Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

connect_external_storage.md 3.2 KB

You have to be logged in to leave a comment. Sign In

Connect External Storage

DagsHub supports connecting external storage to DagsHub repositories to access and interact with your data and large files without leaving the DagsHub platform.

What type of storage is supported?

DagsHub provides integration with AWS S3 , Google Cloud Storage (GCS), and S3-compatible storage.

What is an S3-compatible storage?

S3-compatible storage refers to a cloud storage system that supports the same API as the Amazon Simple Storage Service ( S3). S3 is a widely used object storage service provided by Amazon Web Services (AWS) that allows users to store and retrieve data over the internet. With S3-compatible storage, alternative cloud providers offer storage solutions that are compatible with the S3 API, allowing users to utilize the same tools and API calls used with Amazon S3.

S3 can store any kind of object, such as: images, videos, audio files, CSV files, JSON files, etc. They are widely used for storing large amounts of data because they offer high availability, durability, scalability, and security. They also support various features such as encryption, lifecycle management, versioning, and access control.

How to connect an S3 bucket to a DagsHub repository?

The flow below showcases how to connect an AWS S3 bucket. To connect a GCS, or S3-compatible storage, choose the relevant option in the data tab.

!!! note "Setting appropriate permissions" If the storage you're connecting is not public, you need to set up read permissions to enable DagsHub to show your data to you it. You can read how to do so on the Setup Remote Storage section

On your DagsHub repo page:

  1. Click the Remote button
  2. Click the Data tab
  3. Click on the relevant storage provider
  4. Click the Add a key button
  5. Fill in the Bucket url and prefix (e.g. s3://good-dog-pics/10plus)
  6. Select or fill in the Region (if applicable for the provider)
  7. Fill in the Endpoint Url (if applicable for the provider)
  8. Click Next
  9. Enter the credentials for accessing the bucket (e.g., AWS Access Key ID and AWS Secret Access Key)
Connecting an S3-compatible bucket

How to access and view the connected S3 bucket?

Once you have connected an S3 bucket (or any other public bucket) to your DagsHub repository,

  1. Go back to your DagsHub repository page
  2. Click on those folder named s3://<bucket_name>
  3. You will see a file browser that shows all the objects stored in that bucket.
Viewing a connected S3-compatible bucket

How to stream files hosted on an S3 bucket?

Data Streaming supports streaming files from an S3 bucket connected to a DagsHub repository. This means you can stream a subset of your datasets without downloading it entirely to your local storage.

Check out the example:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/<username>/<reponame>")

fs.listdir("s3://good-dog-pics/10plus")
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...