Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
vinjn dccf362c2b
Move enc to gloabal namespace
1 year ago
..
dccf362c2b
Move enc to gloabal namespace
1 year ago
fe8042867c
first very bad commit
2 years ago

readme.md

You have to be logged in to leave a comment. Sign In

openwebtext dataset

after running prepare.py (preprocess) we get:

  • train.bin is ~17GB, val.bin ~8.5MB
  • train has ~9B tokens (9,035,582,198)
  • val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...