Skip to content

Stream & Download Data

Data Download is a component of the DagsHub client and API libraries, that lets you stream your data from any DagsHub project. It makes it extremely simple to get started with your data science work, even when you have a large dataset.

Installation and Setup

DagsHub's data access tools come with the DagsHub client libraries. To install it, simply type in the following:

$ pip3 install dagshub

Using all functionality of DagsHub's data access tools requires authentication, and you can do this easily by running:

$ dagshub login
This will guide you to sign in and provide a temporary token to sign in to the system.

Info

If you prefer to use a non-temporary token for logging in, you can run the following command:

$ dagshub login --token <your dagshub token>
By using this command, all client features will work until the token is revoked from the DagsHub UI.

How does Data Streaming work?

Data streaming via DagsHub's data access tools has two main implementations, Python Hooks and Mounted Filesystem, each valid for different cases. Review the support matrix below for more details and recommendations on when to use each one.

The Python Hooks method automatically detects calls to Python's built-in file operations (such as open()), and if the files exist on your DagsHub repo, it will load them on the fly as they're requested. This means that most Python ML and data libraries will automatically work with this method, without requiring manual integration.

The Mounted Filesystem implementation, based on FUSE, relies on an interface called Filesystem in UserSpacE. It creates a virtual mounted filesystem reflecting your DagsHub repo, that behaves like a part of your local filesystem for all intents and purposes.

How to use Data Streaming?

To use Python Hooks, open your DagsHub project, and copy the following 2 lines of code into your Python code which accesses your data:

from dagshub.streaming import install_hooks
install_hooks()
That’s it! You now have streaming access to all your project files.

Note: You can stream files from a spesific branch or commit by setting the branch parameter.

To see an example of this that actually runs, check out the Colab below:

Open in Colab

Known Limitations

  1. Some frameworks, such as TensorFlow and OpenCV, which rely on routines written in C or C++ for file input/output, are currently not supported.
  2. dvc repro and dvc run commands for stages that have dvc tracked files in deps will not work, showing errors of missing data, to run a stage, use the --downstream flag instead, or run it manually, and use dvc commit.

Mounted Filesystem – Experimental

The Mounted Filesystem approach uses FUSE under the hood. This bypasses the limitations in the Python Hooks approach by creating a fully virtual filesystem that connects your remote to the local workspace. It supports all frameworks and non-Python languages. However, note that FUSE only supports Linux machines and is currently unstable. Read more about it in the DagsHub client

Non-magical API approach

Magic is awesome, but sometimes you need more control over how you access your project files and prefer a direct API. If you want to explicitly and unambiguously state that you're using DagsHub Streaming, or else none of the other methods are supported on your machine, we also offer a straightforward Python client class that you can use.

Just copy the following code into your Python code:

from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem()

Then replace any use of Python file-handling function in the following way:

  • open()fs.open()
  • os.stat()fs.stat()
  • os.listdir()fs.listdir()
  • os.scandir()fs.scandir()

You can pass the same arguments you would to the built-in functions to our client's functions, and streaming functionality will be provided. e.g.:

fs.open('/full/path/from/root/to/dvc/managed/file')

Contributing

Contributions are welcome! Data Streaming is part of the open-source DagsHub client, contributions of dedicated support for auto-logging, data-streaming, and more will be greatly appreciated. You can start by creating an issue, or asking for guidance on our Discord community.