Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Type:  dataset Task:  language modelling Data Domain:  nlp Integration:  dvc git
4d6345fe26
config update
1 year ago
5aa1ace676
initial commit
1 year ago
5aa1ace676
initial commit
1 year ago
5aa1ace676
initial commit
1 year ago
5aa1ace676
initial commit
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

The following paper introduces the dataset in detail. If you use this dataset in published work, please cite:

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016.
Pointer Sentinel Mixture Models: http://arxiv.org/abs/1609.07843

Dataset statistics

In comparison to the Mikolov processed version of the Penn Treebank (PTB), the WikiText datasets are larger. WikiText-2 aims to be of a similar size to the PTB while WikiText-103 contains all articles extracted from Wikipedia. The WikiText datasets also retain numbers (as opposed to replacing them with N), case (as opposed to all text being lowercased), and punctuation (as opposed to stripping them out).

Contact information

If you have questions about the dataset or want to report new results, contact Stephen Merity.

Tip!

Press p or to see the previous file or, n or to see the next file

About

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Collaborators 4

Comments

Loading...