Photo by Milad Fakurian on Unsplash

Software Heritage Graph Dataset Dataset for Machine Learning

Install DagsHub:

pip install dagshub
Click on copy button to copy content

To stream this data directly on DagsHub

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/software-heritage-dataset")

fs.listdir("s3://softwareheritage")
Click on copy button to copy content

Description

Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

Additional information

Update frequency

Data is updated yearly

Managed by

Software Heritage

License

Creative Commons Attribution 4.0 International.

By accessing the dataset, you agree with the Software Heritage [Ethical
Charter for using the archive
data](https://www.softwareheritage.org/legal/users-ethical-charter/) and
the [terms of use for bulk
access](https://www.softwareheritage.org/legal/bulk-access-terms-of-use/).

Related datasets

Common Screens

Helpful Sentences from Reviews

Humor Detection from Product Question Answering Systems

Japanese Tokenizer Dictionaries

Launch your ML development to new heights with DagsHub

Back to top