
Install DagsHub:
pip install dagshub
To stream this data directly on DagsHub
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/commoncrawl-dataset")
fs.listdir("s3://commoncrawl")
Description
A corpus of web crawl data composed of over 50 billion web pages.
Additional information
Documentation
Update frequency
Monthly
Managed by
License
This data is available for anyone to use under the Common Crawl Terms of Use