Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
3b2e15e41e
Initial commit
1 year ago
3ef1e15af7
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Sudachi Language Resources

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/sudachi-dataset")

fs.listdir("s3://sudachi")

Description:

Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.

Contact:

Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.

Update Frequency:

The dictionaries are updated every few months to include neologism and fixes for the existing words.

Managed By:

https://www.worksap.co.jp/about/csr/nlp/

Resources:

  1. resource:
    • Description: SudachiDict: Binary format of the mophological analysis dictionaries chiVe: Pretrained word embedding in various formats

    • ARN: arn:aws:s3:::sudachi

    • Region: ap-northeast-1

    • Type: S3 Bucket

Tags:

aws-pds, natural language processing

Tutorials:

  1. tutorial:

  2. tutorial:

  3. tutorial:

  4. tutorial:

  5. tutorial:

Tools & Applications:

  1. tools & applications:

  2. tools & applications:

  3. tools & applications:

  4. tools & applications:

  5. tools & applications:

  6. tools & applications:

  7. tools & applications:

  8. tools & applications:

  9. tools & applications:

Publication:

  1. publication:

  2. publication:

  3. publication:

  4. publication:

  5. publication:

  6. publication:

    • Title: chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて ~形態素解析器Sudachiと超大規模ウェブコーパスNWJCによる分散表現の獲得と改良~
    • URL: https://www.ieice.org/ken/paper/20200910U1zQ/
    • AuthorName: 久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
Tip!

Press p or to see the previous file or, n or to see the next file

About

sudachi-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...