400 Dataset from AWS Data Registry are Available on DagsHub
- Kang-Chi Ho
- 3 min read
- 3 years ago
 
            We're excited to share that we added over 400 datasets from AWS Registry to DagsHub, which you can view, stream, and use in your machine learning project. These datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals and covers various domains, including audio, computer vision, NLP, geology, biology, and tabular data.

Where can you find over 400 datasets from AWS Registry?
You can explore the 400 datasets from the AWS Registry on the new DagsHub Dataset page. It holds all the available and up-to-date datasets in a user-friendly interface, enabling one to browse and filter datasets by categories, such as Audio, Computer Vision, NLP, Geology, Biology, and Tabular data.
Each dataset card holds a brief description of the dataset, a link to the dataset repository, a code snippet for streaming it, and all of its relevant tags. With DagsHub’s Data Catalog, you can seamlessly explore the datasets, view their content, and use them in your machine-learning projects.
How can you use a dataset from AWS Registry with DagsHub?
Direct Data Access supports streaming files from an s3 bucket connected to a DagsHub repository. This means you can stream a subset of the AWS Registry datasets without downloading it entirely to your local storage.
Check out this example:
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="<https://dagshub.com/DagsHub-Datasets/fast-ai-imageclas-dataset>")
fs.listdir("s3://fast-ai-imageclas")
Demo: How to use a dataset from AWS Registry
To find the dataset that meets your need, you can check out the Dataset DB landing page and explore the supported dataset under the specific data domain by clicking the “Click for more” button.

As you find the dataset you are interested in, click the dataset card and check out more information on the dataset page.

Press the link button to access the repository or copy the DDA snippet to stream the data.

How to Filter Datasets on DagsHub
Another way of exploring the dataset for your machine learning project is to filter the dataset by data catalog. For instance, if you are looking for the computer vision dataset provided by the AWS registry, You can choose the open-data-registry in General and AWS s3 in Integration. Specify the data domain topic to computer vision, and you are ready to start exploring the computer vision datasets provided by the AWS registry.

How to Build a ML Project Using a Dataset from AWS?
If you're looking for an exciting example of utilizing the direct data accessing feature from DagsHub, check out this project on DagsHub! The project demonstrates how to train a computer vision model using the MNIST dataset provided by DagsHub-Dataset.
With over 400 datasets and direct data accessing from DagsHub, you'll never have to worry about a lack of quality data for your machine learning projects again. So what are you waiting for? Start exploring the Dataset DB today!
 
       
       
      