A Scalable Google Drive Alternative for ML – Launching DagsHub Storage Buckets with Colab Integration
  Back to blog home

A Scalable Google Drive Alternative for ML – Launching DagsHub Storage Buckets with Colab Integration

Google Colab Apr 01, 2024

TL;DR

Today we're excited to launch DagsHub Storage Buckets – An S3-Compatible, easy to use, and ML focused storage – which is integrated with Google Colab for easy and scalable access in projects that work with large scale datasets. The main benefits of DagsHub Storage:

  1. High Throughput - supporting millions of files
  2. Easy to use access controls and sharing
  3. A convenient UI for file viewing, and integrated with data and experiment management and labeling tools.

Check out the example notebook to get started:

Open in Colab

Building a scalable and easy-to-use storage solution for Colab

Colab is one of the most popular tools for machine learning and AI development. Up until now, the main ways people managed data with Colab was either in a simple, integrated way with Google Drive, or one of the cloud storage like S3, GCS or Azure Blob Storage. Each has major shortcomings.

GDrive was not designed for ML workflows - which require handling very large number of files and large volumes for dataset management. If you use any of the cloud storage solutions, you'll have to deal with complicated access controls and no convenient way to visualize, share, and collaborate on the data stored there.

Enhancing the Colab Experience with DagsHub Storage

That's why we're excited to launch DagsHub Storage Buckets. Each repo on DagsHub comes with a DagsHub Storage Bucket, which is fully S3-compatible, meaning it works with Boto, S3FS, and RClone. This means you can easily mount it to your Colab instance. It also provides an easy way to grant read/write/admin access and a visualization layer for your bucket, which means it solves the challenges in working with GDrive, and the cloud storages.

This integration builds on our existing partnership with Colab, which allows DagsHub users to open notebooks in Colab directly from DagsHub and commit them back using Git or DVC. Our goal has always been to simplify the ML development cycle by handling the MLOps complexities so you can focus on building.

Train ML Models With ZERO MLOps | DagsHub
Open notebooks in a Colab environment directly from DagsHub (free GPU included) and also version and commit them back using Git or DVC. Learn more now

Why DagsHub and Colab?

Google Colab has become an indispensable tool for data scientists and ML practitioners, thanks to its free, interactive computing environment that supports popular libraries like TensorFlow and PyTorch. It offers powerful GPUs and TPUs for computation, facilitates real-time collaboration, and now, with DagsHub integration, closes the loop on the ML training lifecycle.

DagsHub provides a platform to manage and collaborate in the machine learning experimentation and development lifecycle, with tools for dataset curation and annotation, experiment tracking, and a model registry.

Together, DagsHub + Colab cover the entire model development process, from data collection to model management, make it accessible and convenient.

How Does It Work?

Automatic setup

The easiest way to get started in Colab, is to simply install the DagsHub client and run the following command:

%pip install -q dagshub
import dagshub.colab
DAGSHUB_REPO = dagshub.colab.login()


Manual setup

Alternatively, you can use the DagsHub UI to sign up for DagsHub, and create your first repository. Below the file list, you'll see your DagsHub Storage Bucket, which will be empty.



Data Upload

The easiest way to upload your data is with the DagsHub client. Simply run (via bash):

!dagshub upload <user_name>/<repo_name> <local_path> <remote_path> --bucket

Or via Python:

dagshub.upload_files("<user_name>/<repo_name>", "<local_path>", remote_path="<remote_path>", bucket=True)

Mounting & Syncing with DagsHub Storage Buckets

You can mount DagsHub Storage Buckets similar to how you would a Google Drive. It has a more scalable and robust backend, built from the ground up for machine learning use cases. Let's see how that works.

Running the below commands will help you set up anything you need for this to work, including installing the packages necessary, such as RClone and FUSE3, configuring the remote, etc. You only need to run one command for each!

Syncing a Local Folder to DagsHub Storage

Sometimes you want to sync a local folder with your DagsHub storage remote. You can do this by running:

dagshub.storage.sync("<user_name>/<repo_name>", "<local_path>", "<remote_path>")

This will sync the local_path to the remote_path inside your storage bucket!

Mounting a Bucket to Colab

To mount your bucket, simply run:

mount_path = dagshub.storage.mount("<user_name>/<repo_name>")

This will mount it the <repo_name>/dagshub_storage/ path, unless you provide a path= with a custom path. The mount directory is returned from this function, so the easiest way to unmount the bucket is to run:

dagshub.storage.unmount(repo="<user_name>/<repo_name>", path=mount_path)

A full tour of DagsHub Storage Buckets in Colab

Our detailed guide and example notebook walk you through installing necessary packages, setting up your DagsHub repository, and using RClone for data upload and access.

We cover mounting the bucket to Colab, running model inference and even training directly on bucket data and tracking an experiment with MLflow on DagsHub too, so that you have the entire lifecycle of ML development covered, showcasing the seamless integration of DagsHub Storage with Google Colab.

Try it out here:

Open in Colab

Join Us on This Exciting Journey

This integration is more than just a technical enhancement; it's a step towards realizing our vision of making ML development more accessible and collaborative. By joining forces with Google Colab and launching DagsHub Storage, we're excited to offer the ML community a more efficient, scalable, and collaborative environment.

Explore the capabilities of DagsHub Storage and see how it can transform your ML workflows. We're eager to support your journey and continue breaking down barriers in ML development.

Tags

Dean Pleban

Co-Founder & CEO of DAGsHub. Building the home for data science collaboration. Interested in machine learning, physics and philosophy. Join https://DAGsHub.com

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.