Launching Direct Data Access - The new and improved way to interface with your data

Dean Pleban
6 min read
3 years ago

Co-Founder & CEO of DAGsHub. Building the home for data science collaboration. Interested in machine learning, physics and philosophy. Join https://DAGsHub.com | DagsHub Co-Founder & CEO

Table of Contents

Share This Article

TL;DR: DagsHub is launching Direct Data Access – a new and improved way to interact with your data. Providing an intuitive interface to stream and upload data for any ML project. It doesn’t require any adaptation, and lets you keep all the benefits of a versioned and shareable dataset based on open-source tools. If you want to dive into the docs, you can go there now.

Our main focus at DagsHub is creating a new and collaborative standard for how people build machine learning projects. We want data scientists to have an easy-to-use platform that provides an intuitive way to manage, collaborate and scale their work.

Machine learning projects contain many complex parts, but today we’ll focus on the heart and soul of our work as data scientists - Data. We're launching a new solution that will take the way you interact with your data to the next level, unlocking new streaming and uploading capabilities.

Data: The heart and soul of every machine learning project

Data is the heart and soul of every machine learning project. It’s Data Scientists’ best friend, and one of the main factors which improve models’ performance. But it isn’t all rainbows and unicorns – interacting with your data can be annoying, in a few ways. To explain what I mean, let’s talk about the world before DDA.

There are two common approaches available today for interacting with data while working on a machine-learning project:

The CLI approach
The API approach

The CLI approach

Let’s start with the CLI approach - in this approach you store your data, often with versions – either manually (with e.g. S3 CLI) creating directories with meaningful names for your data, or by using DVC (Git-tracking extension). This approach relies on pushing (uploading) and pulling (downloading) an entire dataset. It’s better for sharing datasets across teams, managing changes, and overall control.

However, the CLI approach can be time-consuming and pricey; to start working, the whole dataset needs to be pulled first. In addition, adding just one image requires pulling the entire dataset, committing, and pushing it back.

Imagine you want to add 10 files to an already tracked million-file dataset. You need to clone the repo and pull the original dataset. Add the 10 new files to DVC, then commit and push it back to your remote. This can become a very painful process as your data grows.

The API approach

Moving to the API approach - downloading and uploading data using API calls, specifically from a Python client. This approach reduces complexity by providing a fluent and intuitive interface and allowing users to operate only with the relevant sections of their datasets rather than the whole thing, including via data streaming.

This approach has its pitfalls as well, which are exacerbated when you want to work with your data and not a toy dataset. The setup process is often complicated since you need to conform to custom interfaces to support the functionality. It also makes it hard to keep an organized workflow and requires much more discipline to maintain and collaborate properly.

The best of both worlds

Contemplating which approach is better got us thinking - what if we didn’t have to choose? Could we have an intuitive solution for interacting with specific parts of our datasets while collaborating with a team and with no maintenance overhead?

Introducing Direct Data Access

So far, DagsHub has been a great platform to manage and visualize your data, share it with collaborators, and provide control over all aspects of your project, based on open-source tools like Git and DVC.

With the introduction of Direct Data Access (we like to call it DDA), you can combine the best of the CLI and API approach – providing a Python client, CLI, and Rest API (all in one) that lets you:

Stream your data, including subsets of your data, directly
Mounting a smart virtual filesystem connected to your DagsHub repo
Uploading data in an intuitive way that preserves versioning

This makes working with your data easier, faster, and more intuitive, without changing anything in your project structure or code.

How It Works

Direct Data Access is built for ease of use and flexibility. Let’s go over a quick guide.

Install it by running $ pip3 install dagshub in your terminal.

Data Streaming

The easiest way to get started is the install_hooks approach which automatically detects calls to Python's built-in file operations and modifies them so they can retrieve the files from DagsHub when they aren’t found locally.

Just copy the following 2 lines of code into your Python code which accesses your data:

from dagshub.streaming import install_hooks
install_hooks()

That’s all you need to do – after this, you should now have streaming access to all your project files. Your Python code will act as if all your files exist locally, so when you display them, they’ll show.

We also do smart caching so that files remain available once they are streamed. To see an example of this that runs, check out this Colab Notebook.

Data Upload

You don't need to pull the entire dataset anymore.

The upload API and CLI let you upload or append files to existing DVC (or Git) directories, without downloading anything to your machine, quickly and efficiently. You can now send files you’d like to track to DagsHub and have them automatically added to your project, using DVC or Git for versioning.

To use the upload CLI, simply run:

$ dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>

Or run it directly in your Python code with the following snippet:

from dagshub.upload import Repo

repo = Repo("<repo_owner>", "<repo_name>")  # Optional: username, password, token

# Upload a single file to a repository in one line
repo.upload(file="<local_file_path>", path="<path_in_remote>", versioning=”dvc”)  # Optional: versioning, new_branch, commit_message

Advanced usage and next steps

The above is only the most basic usage of DDA. We also provide ways to upload multiple files at once, a no-nonsense Python API for those that don’t like magic, and a full virtual mounted filesystem.
To learn more, check out the full documentation.

Unlocking new ways to work with your data in machine learning

As a Data Scientist working on machine learning projects, data is at the center of almost everything you do on a day-to-day basis. The potential improvements to your workflow enabled by the new DDA API allow you to work smarter and more efficiently.

Reduce the time to start a training run

Downloading the entire dataset before starting a training run can take up a lot of time and effort. If you’re using an expensive GPU machine, that might also be a cost you’re not willing to pay.

DDA provides a way to avoid this, by treating your data as if it were on your machine from the start, while downloading it as is required for your training run. This makes it much faster to just start training!

Training on a subset of data

In other cases, you might have a huge dataset and want to run your training or processing only on a subset. If you’d be working with a table in a DB, that would just be a query away, but for unstructured data, there’s no simple way to do this.

With DDA you can download only what you use, intuitively zooming in on the part of your data you need. Which saves space and time, and can accelerate your iteration cycle.

Reporting data from edge devices

What if you have a bunch of edge devices (let’s say cameras) that collect data, but want to train a model centrally? Data collection can become a huge pain – requiring a central location to send the data to, then sorting it and adding it to your project repo.

With DDA you can easily skip this entire issue by uploading the data (either via the Python or rest APIs, adding them as a new version of your dataset.

Active learning

Last but not least – Active Learning. This is a topic for an entire blog (or two), but Active Learning is a way to automatically and iteratively improve your machine learning model by collecting additional data after it has been deployed and using it to retrain the model to increase performance and reduce bugs.

With DDA, you can complete the active learning loop on DagsHub fully. We’ll be showing how to do this in a separate blog, so stay tuned.

Try it out

To get started with The Direct Data Access API, check out our docs. The code is open-source and available on GitHub. As always, we’d love to get your feedback. Feel free to reach out to us via our Discord channel, where our team is waiting to answer your questions, help out in any way, or just talk about data, machine learning, and the universe's secrets.