Common Crawl
Stream data with DDA:
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/commoncrawl-dataset")
fs.listdir("s3://commoncrawl")
Description:
A corpus of web crawl data composed of over 50 billion web pages.
Contact:
A corpus of web crawl data composed of over 50 billion web pages.
Update Frequency:
Monthly
Managed By:
https://commoncrawl.org/
Resources:
- resource:
- Description: Crawl data (WARC and ARC format)
- ARN: arn:aws:s3:::commoncrawl
- Region: us-east-1
- Type: S3 Bucket
- AccountRequired: True
Tags:
aws-pds, encyclopedic, natural language processing, internet
Tutorials:
- tutorial:
- tutorial:
- tutorial:
- tutorial:
- tutorial:
Tools & Applications:
-
tools & applications:
-
tools & applications:
-
tools & applications:
Publication:
-
publication:
- Title: Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
- URL: https://arxiv.org/pdf/1710.01779.pdf
- AuthorName: Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
-
publication:
-
publication:
-
publication:
- Title: Large-scale analysis of style injection by relative path overwrite
- URL: https://doi.org/10.1145/3178876.3186090
- AuthorName: Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
-
publication:
-
publication:
-
publication:
-
publication:
-
publication:
-
publication:
-
publication:
-
publication:
- Title: CC-News-En: A large English news corpus
- URL: https://doi.org/10.1145/3340531.3412762
- AuthorName: Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
-
publication:
-
publication:
- Title: On the impact of publicly available news and information transfer to financial markets
- URL: https://arxiv.org/abs/2010.12002
- AuthorName: Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
-
publication:
- Title: Language models are few-shot learners
- URL: https://arxiv.org/abs/2005.14165
- AuthorName: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al