Are you sure you want to delete this access key?
Direct Data Access, or DDA for short, is a magical component of the DagsHub client and API libraries, that lets you stream your data from, and upload it to, any DagsHub project. It makes it extremely simple to get started with your data science work, even when you have a large dataset.
In other words, DDA gives you the organization and reproducibility provided by DVC, with the ease of use and flexibility of a data API, without requiring any changes to your project. This also means you can stream and upload data directly with only a part of your project’s files without pulling everything to your machine.
Direct Data Access has 2 main components:
DDA comes with the new DagsHub client libraries. To install it, simply type in the following:
$ pip3 install dagshub
Using all functionality of DDA requires authentication, and you can do this easily by running:
$ dagshub login
This will guide you to sign in and provide a temporary token to sign in to the system.
!!! info
If you prefer to use a non-temporary token for logging in, you can run the following command:
bash $ dagshub login --token <your dagshub token>
By using this command, all client features will work until the token is revoked from the DagsHub UI.
Data streaming via DDA has two main implementations, Python Hooks and Mounted Filesystem, each valid for different cases. Review the support matrix below for more details and recommendations on when to use each one.
The Python Hooks method automatically detects calls to Python's built-in file operations (such as open()
), and if
the files exist on your DagsHub repo, it will load them on the fly as they're requested. This means that most Python ML
and data libraries will automatically work with this method, without requiring manual integration.
The Mounted Filesystem implementation, based on FUSE, relies on an interface called Filesystem in UserSpacE. It creates a virtual mounted filesystem reflecting your DagsHub repo, that behaves like a part of your local filesystem for all intents and purposes.
To use Python Hooks, open your DagsHub project, and copy the following 2 lines of code into your Python code which accesses your data:
from dagshub.streaming import install_hooks
install_hooks()
That’s it! You now have streaming access to all your project files.
To see an example of this that actually runs, check out the Colab below:
!!! warning "Known Limitations"
1. Some frameworks, such as TensorFlow and OpenCV, which rely on routines written in C or C++ for file input/output, are
currently not supported.
2. dvc repro
and dvc run
commands for stages that have dvc tracked files in deps
will not work, showing errors of
missing data, to run a stage, use the --downstream
flag instead, or run it manually, and use dvc commit
.
The Mounted Filesystem approach uses FUSE under the hood. This bypasses the limitations in the Python Hooks approach by creating a fully virtual filesystem that connects your remote to the local workspace. It supports all frameworks and non-Python languages. However, note that FUSE only supports Linux machines and is currently unstable. Read more about it in the DagsHub client Readme
Magic is awesome, but sometimes you need more control over how you access your project files and prefer a direct API. If you want to explicitly and unambiguously state that you're using DagsHub Streaming, or else none of the other methods are supported on your machine, we also offer a straightforward Python client class that you can use.
Just copy the following code into your Python code:
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem()
Then replace any use of Python file-handling function in the following way:
open()
→ fs.open()
os.stat()
→ fs.stat()
os.listdir()
→ fs.listdir()
os.scandir()
→ fs.scandir()
You can pass the same arguments you would to the built-in functions to our client's functions, and streaming functionality will be provided. e.g.:
fs.open('/full/path/from/root/to/dvc/managed/file')
You don't need to pull the entire dataset anymore.
The upload API lets you upload or append files to existing DVC directories, without downloading anything to your machine, quickly and efficiently.
Data upload is an API, and a Python client library, that enables you to send files you’d like to track to DagsHub and have them automatically added to your project, using DVC or Git for versioning. In order to accomplish this, we implement all the logic for tracking new files on our server, so that you end up with a fully DVC-(or Git-)tracked file or folder.
After installing the client, you can use the upload function for both Git and DVC-tracked files.
You can upload a single file to any location in your repository, including DVC directories by using the DagsHub CLI in your terminal. This utility is useful for active learning scenarios when you want to append a new file to your dataset.
A basic usage example is:
$ dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>
Options:
-m, --message TEXT Commit message for the upload
-b, --branch TEXT Branch to upload the file to - this is required for private repositories
--update Force update an existing file
-v, --verbose Verbosity level
--help Show this message and exit.
Basic usage example is as follows:
from dagshub.upload import Repo
repo = Repo("<repo_owner>", "<repo_name>") # Optional: username, password, token, branch
# Upload a single file to a repository in one line
repo.upload(file="<local_file_path>", path="<path_in_remote>", versioning=”dvc”) # Optional: versioning, new_branch, commit_message
This will upload a single file to DagsHub, which will be tracked by DVC.
To upload multiple files, use:
# Upload multiple files to a dvc folder in a repository with a single commit
ds = repo.directory("<name_of_remote_folder")
# Add file-like object (path_in_remote is the relative path inside of the remote folder)
f = open("<local_file_path>", 'rb')
ds.add(file=f, path="<path_in_remote>")
# Or add a local file path
ds.add(file="<local_file_path>", path="<path_in_remote>")
ds.commit("<commit_message>", versioning="dvc")
This will upload a folder with multiple files simultaneously, with a custom commit message to your DagsHub repo.
Parts of DDA will try to pick up configuration required to communicate with DagsHub. For example, Data Streaming will use the configuration of your git repository to get the branch you're currently working on and your authentication username and password.
OAuth token acquired via dagshub login
is cached locally, so you don't need to log in every time you run your scripts.
If you need to override the automatically detected configuration, use the following environment variables and options in the CLI:
--repo
(a command line option)DAGSHUB_USERNAME
DAGSHUB_PASSWORD
DAGSHUB_USER_TOKEN
Or provide the relevant arguments to the Python entrypoints:
repo_url=
(For Data Streaming)username=
password=
token=
Adding files to an existing DVC directory stored in a remote is time-consuming and sometimes expensive. Normally, you
need to start by running dvc pull
to download the dataset to a local system, adding the new files to the folder, then
running dvc commit
and dvc push
to re-upload them.
In cases where you want to add 10 files to a million-file dataset, this can become a very painful process.
With Direct Data Access and the Data Upload functionality, DagsHub takes care of that for you. Since we host or connect with your DVC remote, we calculate the new hashes, and commit the new DVC-tracked and modified Git-tracked files on your behalf.
All the above methods for uploading files work, but the easiest way to do this is to use the CLI, by running the following:
dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>
With <path_in_remote>
as a DVC-tracked folder in your DagsHub folder. After running this, you will see a new commit,
and the appended file will appear as part of the directory on DagsHub.
!!! important
For uploading to private repositories, you must use the --branch BRANCH_NAME
option.
Sometimes you need to set up a repository to track a new dataset. This might be useful to share it for training and experimentation across your team, to start labeling raw data, or just to centrally manage something that exists on a single server.
With DDA, we provide a simple command to do this, which can be accomplished fully in Python, without needing to do anything (except signing up) on DagsHub.
This is the easiest way to create a versioned dataset repo and upload it directly from your IDE or Notebook.
To do this, we can use the create_dataset()
function. Here is a basic example usage:
from dagshub.upload import create_dataset
repo = create_dataset("<dataset_name>", "<path/to/data/directory>") # Optional: glob_exclude, org_name, private
The following table shows which use cases are supported for each component of Direct Data Access and recommendations for when to use each data streaming implementation. We’ll update it as we add support for additional use cases. Please let us know on our Discord community if you have any requests.
Python Hooks | Mounted Filesystem | |
---|---|---|
Stable | V | X |
Tensorflow | X | V |
DVC Repro | X | V |
DVC Repro (with --downstream ) |
V | V |
Python Support | V | V |
Non-Python Support | X | V |
No additional Installations | V | X |
Files are visible in the file explorer | X | V |
Support for Windows & Mac | V | X |
Support for Linux | V | V |
C (low-level) Open | X | V |
dvc repro
command.Contributions are welcome! Direct Data Access is part of the open-source DagsHub client, contributions of dedicated support for auto-logging, data-streaming, and more will be greatly appreciated. You can start by creating an issue, or asking for guidance on our Discord community.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?