Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

direct_data_access.md 13 KB

You have to be logged in to leave a comment. Sign In

Direct Data Access

Direct Data Access, or DDA for short, is a magical component of the DagsHub client and API libraries, that lets you stream your data from, and upload it to, any DagsHub project. It makes it extremely simple to get started with your data science work, even when you have a large dataset.

In other words, DDA gives you the organization and reproducibility provided by DVC, with the ease of use and flexibility of a data API, without requiring any changes to your project. This also means you can stream and upload data directly with only a part of your project’s files without pulling everything to your machine.

Direct Data Access has 2 main components:

  1. Data Streaming
    1. Python Hooks approach
    2. Mounted Filesystem approach – Experimental
  2. Data Upload

Installation and Setup

DDA comes with the new DagsHub client libraries. To install it, simply type in the following:

$ pip3 install dagshub

Using all functionality of DDA requires authentication, and you can do this easily by running:

$ dagshub login

This will guide you to sign in and provide a temporary token to sign in to the system.

!!! info If you prefer to use a non-temporary token for logging in, you can run the following command: bash $ dagshub login --token <your dagshub token> By using this command, all client features will work until the token is revoked from the DagsHub UI.

Data Streaming

How does Data Streaming work?

Data streaming via DDA has two main implementations, Python Hooks and Mounted Filesystem, each valid for different cases. Review the support matrix below for more details and recommendations on when to use each one.

The Python Hooks method automatically detects calls to Python's built-in file operations (such as open()), and if the files exist on your DagsHub repo, it will load them on the fly as they're requested. This means that most Python ML and data libraries will automatically work with this method, without requiring manual integration.

The Mounted Filesystem implementation, based on FUSE, relies on an interface called Filesystem in UserSpacE. It creates a virtual mounted filesystem reflecting your DagsHub repo, that behaves like a part of your local filesystem for all intents and purposes.

How to use Data Streaming?

To use Python Hooks, open your DagsHub project, and copy the following 2 lines of code into your Python code which accesses your data:

from dagshub.streaming import install_hooks
install_hooks()

That’s it! You now have streaming access to all your project files.

To see an example of this that actually runs, check out the Colab below:

Open In Colab

!!! warning "Known Limitations" 1. Some frameworks, such as TensorFlow and OpenCV, which rely on routines written in C or C++ for file input/output, are currently not supported. 2. dvc repro and dvc run commands for stages that have dvc tracked files in deps will not work, showing errors of missing data, to run a stage, use the --downstream flag instead, or run it manually, and use dvc commit.

Mounted Filesystem – Experimental

The Mounted Filesystem approach uses FUSE under the hood. This bypasses the limitations in the Python Hooks approach by creating a fully virtual filesystem that connects your remote to the local workspace. It supports all frameworks and non-Python languages. However, note that FUSE only supports Linux machines and is currently unstable. Read more about it in the DagsHub client Readme

Non-magical API approach

Magic is awesome, but sometimes you need more control over how you access your project files and prefer a direct API. If you want to explicitly and unambiguously state that you're using DagsHub Streaming, or else none of the other methods are supported on your machine, we also offer a straightforward Python client class that you can use.

Just copy the following code into your Python code:

from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem()

Then replace any use of Python file-handling function in the following way:

  • open()fs.open()
  • os.stat()fs.stat()
  • os.listdir()fs.listdir()
  • os.scandir()fs.scandir()

You can pass the same arguments you would to the built-in functions to our client's functions, and streaming functionality will be provided. e.g.:

fs.open('/full/path/from/root/to/dvc/managed/file')

Data Upload

You don't need to pull the entire dataset anymore.

The upload API lets you upload or append files to existing DVC directories, without downloading anything to your machine, quickly and efficiently.

How does Data Upload work?

Data upload is an API, and a Python client library, that enables you to send files you’d like to track to DagsHub and have them automatically added to your project, using DVC or Git for versioning. In order to accomplish this, we implement all the logic for tracking new files on our server, so that you end up with a fully DVC-(or Git-)tracked file or folder.

How to use Data Upload?

After installing the client, you can use the upload function for both Git and DVC-tracked files.

Upload single files using the DagsHub CLI

You can upload a single file to any location in your repository, including DVC directories by using the DagsHub CLI in your terminal. This utility is useful for active learning scenarios when you want to append a new file to your dataset.

A basic usage example is:

$ dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>

Options:

-m, --message TEXT  Commit message for the upload 
-b, --branch TEXT   Branch to upload the file to - this is required for private repositories
--update            Force update an existing file
-v, --verbose       Verbosity level
--help              Show this message and exit.

Upload a single file using the Python client

Basic usage example is as follows:

from dagshub.upload import Repo

repo = Repo("<repo_owner>", "<repo_name>")  # Optional: username, password, token, branch

# Upload a single file to a repository in one line
repo.upload(file="<local_file_path>", path="<path_in_remote>", versioning=”dvc”)  # Optional: versioning, new_branch, commit_message

This will upload a single file to DagsHub, which will be tracked by DVC.

Upload multiple files using the Python client

To upload multiple files, use:

# Upload multiple files to a dvc folder in a repository with a single commit
ds = repo.directory("<name_of_remote_folder")

# Add file-like object (path_in_remote is the relative path inside of the remote folder)
f = open("<local_file_path>", 'rb')
ds.add(file=f, path="<path_in_remote>")

# Or add a local file path
ds.add(file="<local_file_path>", path="<path_in_remote>")
ds.commit("<commit_message>", versioning="dvc")

This will upload a folder with multiple files simultaneously, with a custom commit message to your DagsHub repo.

Automagic Repo Configuration

Parts of DDA will try to pick up configuration required to communicate with DagsHub. For example, Data Streaming will use the configuration of your git repository to get the branch you're currently working on and your authentication username and password.

OAuth token acquired via dagshub login is cached locally, so you don't need to log in every time you run your scripts.

If you need to override the automatically detected configuration, use the following environment variables and options in the CLI:

  • --repo (a command line option)
  • DAGSHUB_USERNAME
  • DAGSHUB_PASSWORD
  • DAGSHUB_USER_TOKEN

Or provide the relevant arguments to the Python entrypoints:

  • repo_url= (For Data Streaming)
  • username=
  • password=
  • token=

Data Upload Use Cases

Appending files to a DVC directory

Adding files to an existing DVC directory stored in a remote is time-consuming and sometimes expensive. Normally, you need to start by running dvc pull to download the dataset to a local system, adding the new files to the folder, then running dvc commit and dvc push to re-upload them.

In cases where you want to add 10 files to a million-file dataset, this can become a very painful process.

With Direct Data Access and the Data Upload functionality, DagsHub takes care of that for you. Since we host or connect with your DVC remote, we calculate the new hashes, and commit the new DVC-tracked and modified Git-tracked files on your behalf.

All the above methods for uploading files work, but the easiest way to do this is to use the CLI, by running the following:

dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>

With <path_in_remote> as a DVC-tracked folder in your DagsHub folder. After running this, you will see a new commit, and the appended file will appear as part of the directory on DagsHub.

!!! important For uploading to private repositories, you must use the --branch BRANCH_NAME option.

Creating a dataset repo from scratch with Python

Sometimes you need to set up a repository to track a new dataset. This might be useful to share it for training and experimentation across your team, to start labeling raw data, or just to centrally manage something that exists on a single server.

With DDA, we provide a simple command to do this, which can be accomplished fully in Python, without needing to do anything (except signing up) on DagsHub.

This is the easiest way to create a versioned dataset repo and upload it directly from your IDE or Notebook.

To do this, we can use the create_dataset() function. Here is a basic example usage:

from dagshub.upload import create_dataset

repo = create_dataset("<dataset_name>", "<path/to/data/directory>") # Optional: glob_exclude, org_name, private

Support Matrix

The following table shows which use cases are supported for each component of Direct Data Access and recommendations for when to use each data streaming implementation. We’ll update it as we add support for additional use cases. Please let us know on our Discord community if you have any requests.

Data Streaming Support

Python Hooks Mounted Filesystem
Stable V X
Tensorflow X V
DVC Repro X V
DVC Repro (with --downstream) V V
Python Support V V
Non-Python Support X V
No additional Installations V X
Files are visible in the file explorer X V
Support for Windows & Mac V X
Support for Linux V V
C (low-level) Open X V

Recommendations

  • Python Hooks are recommended for usage in Windows & Mac, any with any framework that uses Python and doesn’t rely on C Opens.
  • Mounted Filesystem is recommended for use in cases where Python Hooks don’t apply, and you’re using a Linux system ( Colab is a great example).

Data Upload Support

  • The upload client currently supports DagsHub repositories only. GitHub-connected repositories and empty repositories are not yet supported.
  • Uploading files that are outputs of DVC pipeline stages is not supported as well, as it is expected to be run as part of a dvc repro command.
  • Deleting files is not yet supported.

Contributing

Contributions are welcome! Direct Data Access is part of the open-source DagsHub client, contributions of dedicated support for auto-logging, data-streaming, and more will be greatly appreciated. You can start by creating an issue, or asking for guidance on our Discord community.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...