Version Data Files¶

When your data files change, for example if you overwrite a file's, you need to version your data files for full reproducibility. This should be done alongside Dataset Versioning.

For data file versioning, DagsHub fully supports vanilla DVC, while also providing DagsHub Client, which leverages DVC under the hood but makes using it more convenient.

DagsHub is the easiest way to work with DVC. If you're already familiar with DVC and want to use it, here is a guide on using DVC directly with DagsHub.

The main advantage of using DagsHub Client over vanilla DVC is that it utilizes DagsHub backend capabilities to calculate the DVC hashes without conducting any action locally or requiring having the entire folder content locally.

Let's see how to our data directory with the DagsHub Client. With the Client, you can version your data from the CLI or Python scripts to enable integrations into your project pipeline.

For the purpose of this guide, let's assume your data is in a folder called data/

Installing DagsHub Client¶

We will start by installing DagsHub Client using pip:

pip install dagshub

Version data from CLI¶

In your project's folder, put your data files in your data/ folder, then run the following command:

dagshub upload --update --versioning dvc --message "<commit_message>" "<repo_owner>/<repo_name>" "local/path/to/data/" "remote/path/to/data/"

Let's explain the flags and arguments:

--update: Tells DagsHub to update existing files with new content if changed in the new version of DVC.
--message "<commit_message>": The commit message for the new version. Since every DVC version creates a new Git version, this will be seen in your Git history.
--versioning dvc: This tells the client to version this with DVC, it also accepts git or you can leave it blank for the client to choose depending on what make sense.
"<repo_owner>/<repo_name>": This is your repo identifier. If your repo is https://dagshub.com/my_user/my_repo then you should use "my_user/my_repo".
"local/path/to/data/": This is the local path we want to upload from. If our data is in our current working directory in a folder called data/ then the argument will be data/.
"remote/path/to/data/": This is the path in the DagsHub repo to put the data in. If we'd like the file to be committed to the same folder remotely, we can put in data/

The output will look as following:

dagshub upload --update --versioning dvc --message "Adding raw data" "my_user/my_repo" "data/" "data/"
    ⠙ Uploading files... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
    Uploading files (1) to "my_user/my_repo"...
    Upload finished successfully!
    Directory upload complete, uploaded n files

Version data using Python¶

You can also create new versions using the DagsHub Client in Python, below is the same example as before by in Pythonic form.

import dagshub

repo = dagshub.upload.Repo("my_user", "my_repo")
repo.upload(local_path="data/", remote_path="data/",
            commit_message="Added Raw Data",versioning="dvc")

By conducting this action, the data directory is uploaded to DagsHub Storage, versioned by DVC, and the new data.dvc pointer file is committed to Git.

Results¶

Now, our DagsHub repository will look like this:

Project After Data Push

And we'll be able to see our data file in the data/ folder:

Project Data Preview

DagsHub's Data Catalog

As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.), tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can filter and compare to different commits.

Since DagsHub client does all the hash calculation and DVC file creation on the DagsHub remote, making this process short and simple, we'd like to pull the latest commit from DagsHub. We will do it with Git by running the following command:

git pull

Next Steps¶

Now your local workspace mirrors the DagsHub remote. Continue to learn how to version your dataset metadata, or learn how track ML experiments with DagsHub