Upload Data¶

Data Upload is component of the DagsHub client and API libraries, which lets you upload data to any DagsHub project. It makes it extremely simple to get started with your data science work.

You don't need to pull the entire dataset anymore.

The upload API lets you upload or append files to existing DVC directories, without downloading anything to your machine, quickly and efficiently.

Installation and Setup¶

DagsHub's data access tools come with the DagsHub client libraries. To install it, simply type in the following:

$ pip3 install dagshub

Using all functionality of DagsHub's data access tools requires authentication, and you can do this easily by running:

$ dagshub login

This will guide you to sign in and provide a temporary token to sign in to the system.

Info

If you prefer to use a non-temporary token for logging in, you can run the following command:

$ dagshub login --token <your dagshub token>

By using this command, all client features will work until the token is revoked from the DagsHub UI.

How does Data Upload work?¶

Data upload is an API, and a Python client library, that enables you to send files you’d like to track to DagsHub and have them automatically added to your project, using DVC or Git for versioning. In order to accomplish this, we implement all the logic for tracking new files on our server, so that you end up with a fully DVC-(or Git-)tracked file or folder.

How to use Data Upload?¶

After installing the client, you can use the upload function for both Git and DVC-tracked files.

Upload single files using the DagsHub CLI¶

You can upload a single file to any location in your repository, including DVC directories by using the DagsHub CLI in your terminal. This utility is useful for active learning scenarios when you want to append a new file to your dataset.

A basic usage example is:

$ dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>

Options:

-m, --message TEXT  Commit message for the upload 
-b, --branch TEXT   Branch to upload the file to - this is required for private repositories
--update            Force update an existing file
-v, --verbose       Verbosity level
--help              Show this message and exit.

Upload a single file using the Python client¶

Basic usage example is as follows:

from dagshub.upload import Repo

repo = Repo("<repo_owner>", "<repo_name>")  # Optional: username, password, token, branch

# Upload a single file to a repository in one line
repo.upload(local_path="<local_file_path>", remote_path="<path_in_remote>", versioning=”dvc”)  # Optional: versioning, new_branch, commit_message

This will upload a single file to DagsHub, which will be tracked by DVC.

Upload multiple files using the Python client¶

To upload multiple files, use:

# Upload multiple files to a dvc folder in a repository with a single commit
ds = repo.directory("<name_of_remote_folder")

# Add file-like object (path_in_remote is the relative path inside of the remote folder)
f = open("<local_file_path>", 'rb')
ds.add(file=f, path="<path_in_remote>")

# Or add a local file path
ds.add(file="<local_file_path>", path="<path_in_remote>")
ds.commit("<commit_message>", versioning="dvc")

This will upload a folder with multiple files simultaneously, with a custom commit message to your DagsHub repo.

Automagic Repo Configuration¶

DagsHub's client will try to pick up configuration required to communicate with DagsHub. For example, Data Streaming will use the configuration of your git repository to get the branch you're currently working on and your authentication username and password.

OAuth token acquired via dagshub login is cached locally, so you don't need to log in every time you run your scripts.

If you need to override the automatically detected configuration, use the following environment variables and options in the CLI:

--repo (a command line option)
DAGSHUB_USERNAME
DAGSHUB_PASSWORD
DAGSHUB_USER_TOKEN

Or provide the relevant arguments to the Python entrypoints:

repo_url= (For Data Streaming)
username=
password=
token=

Data Upload Use Cases¶

Appending files to a DVC directory¶

Adding files to an existing DVC directory stored in a remote is time-consuming and sometimes expensive. Normally, you need to start by running dvc pull to download the dataset to a local system, adding the new files to the folder, then running dvc commit and dvc push to re-upload them.

In cases where you want to add 10 files to a million-file dataset, this can become a very painful process.

With Direct Data Access and the Data Upload functionality, DagsHub takes care of that for you. Since we host or connect with your DVC remote, we calculate the new hashes, and commit the new DVC-tracked and modified Git-tracked files on your behalf.

All the above methods for uploading files work, but the easiest way to do this is to use the CLI, by running the following:

dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>

With <path_in_remote> as a DVC-tracked folder in your DagsHub folder. After running this, you will see a new commit, and the appended file will appear as part of the directory on DagsHub.

Important

For uploading to private repositories, you must use the --branch BRANCH_NAME option.

Creating a dataset repo from scratch with Python¶

Sometimes you need to set up a repository to track a new dataset. This might be useful to share it for training and experimentation across your team, to start labeling raw data, or just to centrally manage something that exists on a single server.

With the DagsHub client, we provide a simple command to do this, which can be accomplished fully in Python, without needing to do anything (except signing up) on DagsHub.

This is the easiest way to create a versioned dataset repo and upload it directly from your IDE or Notebook.

To do this, we can use the create_dataset() function. Here is a basic example usage:

from dagshub.upload import create_dataset

repo = create_dataset("<dataset_name>", "<path/to/data/directory>") # Optional: glob_exclude, org_name, private

Support Matrix¶

The following table shows which use cases are supported for each component of Direct Data Access and recommendations for when to use each data streaming implementation. We’ll update it as we add support for additional use cases. Please let us know on our Discord community if you have any requests.

Data Streaming Support¶

	Python Hooks	Mounted Filesystem
Stable	V	X
Tensorflow	X	V
DVC Repro	X	V
DVC Repro (with `--downstream`)	V	V
Python Support	V	V
Non-Python Support	X	V
No additional Installations	V	X
Files are visible in the file explorer	X	V
Support for Windows & Mac	V	X
Support for Linux	V	V
C (low-level) Open	X	V

Recommendations¶

Python Hooks are recommended for usage in Windows & Mac, any with any framework that uses Python and doesn’t rely on C Opens.
Mounted Filesystem is recommended for use in cases where Python Hooks don’t apply, and you’re using a Linux system ( Colab is a great example).

Data Upload Support¶

The upload client currently supports DagsHub repositories only. GitHub-connected repositories and empty repositories are not yet supported.
Uploading files that are outputs of DVC pipeline stages is not supported as well, as it is expected to be run as part of a dvc repro command.
Deleting files is not yet supported.

Contributing¶

Contributions are welcome! Data Upload is part of the open-source DagsHub client, contributions of dedicated support for auto-logging, data-streaming, and more will be greatly appreciated. You can start by creating an issue, or asking for guidance on our Discord community.