1 Branches

.dvc

4d6345fe26

config update

1 year ago

dataset

.dvcignore

5aa1ace676

initial commit

1 year ago

.gitignore

5aa1ace676

initial commit

1 year ago

README.md

5aa1ace676

initial commit

1 year ago

dataset.dvc

5aa1ace676

initial commit

1 year ago

DagsHub Storage

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

You have to be logged in to leave a comment.

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

The following paper introduces the dataset in detail. If you use this dataset in published work, please cite:

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016.
Pointer Sentinel Mixture Models: http://arxiv.org/abs/1609.07843

Dataset statistics

In comparison to the Mikolov processed version of the Penn Treebank (PTB), the WikiText datasets are larger. WikiText-2 aims to be of a similar size to the PTB while WikiText-103 contains all articles extracted from Wikipedia. The WikiText datasets also retain numbers (as opposed to replacing them with N), case (as opposed to all text being lowercased), and punctuation (as opposed to stripping them out).

Contact information

If you have questions about the dataset or want to report new results, contact Stephen Merity.

Tip!

Press p or to see the previous file or, n or to see the next file

README.md

WikiText-103

Dataset statistics

Contact information

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

DagsHub / WIkiText-103

README.md

WikiText-103

Dataset statistics

Contact information

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

DagsHub
/
WIkiText-103