Skip to content

Version Datasets (with Data Engine)

Most datasets in the real world are add-only. Data is added to the dataset, and enrichments, like metadata, annotations and predictions change, but actual file contents remain the same. The best way to manage those changes and version them is with DagsHub Data Engine.

To do this, we need to upload our data to DagsHub, and create a datasource from it in Data Engine. Then, using Data Engine's version query syntax, we can easily revert to the dataset state at any point in time.

Uploading The First Dataset Version to DagsHub

DagsHub supports a few different ways to upload or connect data to our platform. Choose the best approach for you:

  1. When your data files are local or if you'd like to host your data on DagsHub directly – Upload your data files to DagsHub Storage.
  2. When your data files are already in cloud storage – DagsHub supports a variety of external storage integrations. If your data is already in one of these storage buckets, you can simply connect it to DagsHub
  3. When file contents also change – If your data files also change, read the guide on versioning data files. Then come back to this guide to learn how to version metadata on top of your uploaded DVC versions.

For the purpose of this guide, we'll go with the first option, using DagsHub onboard storage bucket (kind of like Google Drive for Machine Learning).

Let's assume our dataset has 50 files for now, in a local folder called data/.

To upload it, make sure you have the latest version of DagsHub installed (with pip install -U dagshub). Then run the following snippet in Python:

from dagshub import get_repo_bucket_client

client = get_repo_bucket_client("<repo_owner>/<repo_name>", flavor="s3fs")

# Upload folder contents
client.put("data/", "<repo_name>/data", recursive=True)

This is how the bucket storage looks like now: Bucket storage after first upload

Creating a datasource to generate the first version of our dataset

Now that are data is in DagsHub, we can create a datasource from it.

To create a datasource in the UI:

  1. Click on the "Datasets" tab
  2. Select "Add new source" and "Choose from existing data..."
  3. Select the data folder inside the connected bucket
  4. Give it a name you like.
  5. DagsHub scans your folder and creates a "table representing your dataset"

Datasource Creation

To create a datasource in Python:

from dagshub.data_engine import datasources
datasources.create_datasource(repo="<repo_owner>/<repo_name>", name="my-first-datasource", path="s3://<repo_name>/data")

You can add any custom enrichments to your datasource, including metadata, annotations, and predictions. Each change is logged, and you'll be able to revert to an older version at any time.

Adding more data to the dataset.

Now that we have our dataset with V1, lets add 50 more datapoints to it, for V2.

To do that, put your new data files in your local data folder, then run the following code:

client.put("data/", "<repo_name>/data", recursive=True)

Note

If you connected your external storage, simply add the new files there, or if you used DVC to upload versioned data files, repeat that step with the new data files.

Now, if we run len(client.listdir("<repo_name>/data")) we will see we get 100 files!

Updating the datasource and creating a new version

To update our datasource, we can either click the rescan button in the UI:

Datasource Rescan

Or use the Python command:

ds = datasources.get('<repo_owner>/<repo_name>', 'my-first-datasource')
ds.scan_source()

If we visualize the datasource in the UI, we'll see the new data files.

So where's the new version?

Returning to old versions and states of our datasource

DagsHub Data Engine keeps a log of all states and changes to your datasource metadata, and provides a simple way to return to old versions. It has a lot more capabilities, which you can read about in the full versioning query syntax

The simplest syntax is the global as_of(t) query. Let's see how we can get 2 versions of our datasource easily:

from datetime import *
t_now = datetime.now()
v2 = ds.as_of(t_now)
v2.all() # Output: QueryResult of datasource my-first-datasource with 100 datapoint(s)

Since I create the new version 10 minutes ago, if I go back in time 15 minutes, I'll get my old version:

t_prev = datetime.now() - timedelta(minutes=15)
v1 = ds.as_of(t_prev)
v1.all() # Output: QueryResult of datasource my-first-datasource with 50 datapoint(s)

In practice, you'll be able to get your experiment's start time, then use that to retrieve your dataset state at that time.

This versioning works for any metadata change, including, for example, if you update your annotation versions and want to revert to an older version of annotations. Data Engine also offers an easy way to compare different versions of the same metadata column using our select() combined with as_of() functionality. Read more about that here.

Next Steps

Now that you've started managing your dataset versions, learn how to enrich them with your custom metadata, visualize them, annotate them or convert them to dataloaders for training.

You can also continue to learn how to track your experiments on DagsHub.