Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

README.md 2.7 KB

You have to be logged in to leave a comment. Sign In

AWS Open-Registry Project:

Goal:

Scrape all the information from awslabs/open-data-registry/dataset and turn it into the CSV file, which contains all the information. We use the CSV file to automatically create the repository on DagsHub, connect the s3 bucket, and add the comprehensive README.md for every dataset repository. The README.md provides an example of how to use the DagsHub Direct Data Access (DDA).

Check out the open-data-registry from AWS: awslabs/open-data-registry

Data Aquisition:

python main.py --pipeline retrieve_info
  • OUTPUT:
    • open_data_registry_comprehensive_table.csv
    • deprecated_dataset.csv: The file records the yaml files that shows the dataset is deprecated.

Automatically Create Repository on DagsHub:

  1. Default setting
python main.py --pipeline connect_bucket
  1. Control the start and end index of the csv_file:
python main.py --pipeline connect_bucket --start 4 --end 7
  1. Overwrite the default csv_file:
python main.py --pipeline connect_bucket --csv other.csv

The terminal will pop out the prompt and ask for dagshub username, dagshub password, organization(press Enter to skip if not needed), aws-key, and aws-secret-key.

  • Key Description:

    • username: Dagshub account username
    • ownername: Dagshub organization name, press enter to skip ownername if you don't have an organization
    • password: Dagshub account password
    • host: Dagshub host url
    • access_key: aws s3 bucket access_key
    • access_secret_key: "aws s3 bucket access_secret_key
    • token: Dagshub account api token
  • OUTPUT:

    • missing_s3_bucket_link.csv: The file records the yaml files that lack the s3 bucket link.
    • fail_to_connect_bucket.csv: The file records the dataset that, for some reason, can't connect the s3 bucket to the repository
  • Link to all datasets: DagsHub-Datasets

Functions for Repository Manipulation:

  1. Create Repository: dagshub.upload.wrapper.create_repo
  2. Migrate Repository: migrate_repo
  3. Delete Repository: delete_repo
  4. Add Tags to Repository:add_tags
  5. Create README: readme
  6. Connect Bucket to Repository: connect_bucket
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...