Connect S3-Compatible Storage To DagsHub: Manage Data And Code In The Same Place
  Back to blog home

Connect S3-Compatible Storage To DagsHub: Manage Data And Code In The Same Place

Storage May 23, 2023

When it comes to data, Dagshub places a significant emphasis on versioning and collaboration, using DVC under the hood for version control. However, we acknowledge that not all users rely on DVC for data management, which limits their ability to fully utilize DagsHub.

To address this, we have expanded our support by allowing users to connect their S3 buckets and other S3-compatible storage to their DagsHub repository and view its content even if not versioned by DVC.

Users can now effortlessly access and interact with data stored in their connected buckets without leaving the DagsHub platform. Whether or not their data is versioned with DVC, users can still benefit from the convenience of a single platform for managing data and code while retaining the ability to version their data changes with DVC if they choose to do so.

In this blog post, we will show you how to connect your S3 bucket to DagsHub and view its content.

What is an S3-Compatible storage?

An S3-compatible storage refers to a cloud storage system that supports the same API as the Amazon Simple Storage Service (S3). S3 is a widely used object storage service provided by Amazon Web Services (AWS) that allows users to store and retrieve data over the internet. With S3-compatible storage, alternative cloud providers offer storage solutions that are compatible with the S3 API, allowing users to utilize the same tools and API calls used with Amazon S3. This compatibility facilitates easy migration between different storage providers and enables users to leverage the existing S3 ecosystem.

In S3-compatible can store any kind of object, such as: images, videos, audio files, CSV files, JSON files, etc. They are widely used for storing large amounts of data because they offer high availability, durability, scalability, and security. They also support various features such as encryption, lifecycle management, versioning, and access control.

How to connect an S3 bucket to a DagsHub repository?

Connecting an S3 bucket, or any other S3-compatible storage, to a DagsHub repository you need to have to following.

  • A DagsHub account
  • A DagsHub repository
  • An S3 bucket (or any other public bucket) that contains some data files
  • The credentials for accessing the bucket

Once you have these ready, follow these few simple steps to connect your storage to DagsHub:

  1. Go to your DagsHub repository page
  2. Click on Remote
  3. Click on the Data tab
  4. Click on the relevant storage provider
  5. Click on the add a key button and fill in the URL of the bucket (e.g., s3://my-bucket) and its region
  6. Enter the credentials for accessing the bucket (e.g., AWS Access Key ID and AWS Secret Access Key)

That’s it! You have successfully connected your S3 bucket to your DagsHub repository.

How to access and view the connected S3 bucket?

Once you have connected an S3 bucket (or any other public bucket) to your DagsHub repository,

  1. Go back to your DagsHub repository page
  2. Click on those folder named s3://<bucket_name>
  3. You will see a file browser that shows all the objects stored in that bucket.

How to stream files hosted on an S3 Bucket

Direct Data Access supports streaming files from an S3 bucket connected to a DagsHub repository. This means you can stream a subset of your datasets without downloading it entirely to your local storage.

Check out the example:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/<username>/<reponame>")

fs.listdir("s3://<bucket_name>")

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.