Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
b72d2b102e
Initial commit
1 year ago
c78ba64337
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Software Heritage Graph Dataset

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/software-heritage-dataset")

fs.listdir("s3://softwareheritage")

Description:

Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.

The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

Contact:

Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.

The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

Update Frequency:

Data is updated yearly

Managed By:

Software Heritage

Resources:

  1. resource:

    • Description: Software Heritage Graph Dataset
    • ARN: arn:aws:s3:::softwareheritage
    • Region: us-east-1
    • Type: S3 Bucket
  2. resource:

    • Description: S3 Inventory files
    • ARN: arn:aws:s3:::softwareheritage-inventory
    • Region: us-east-1
    • Type: S3 Bucket

Tags:

aws-pds, source code, open source software, free software, digital preservation

Tip!

Press p or to see the previous file or, n or to see the next file

About

software-heritage-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...