Are you sure you want to delete this access key?
title | description |
---|---|
Version Your Datasets & Metadata with DagsHub | Learn how to version your datasets, metadata, and annotations with DagsHub. |
Most datasets in the real world are add-only. Data is added to the dataset, and enrichments, like metadata, annotations and predictions change, but actual file contents remain the same. The best way to manage those changes and version them is with DagsHub Data Engine.
To do this, we need to upload our data to DagsHub, and create a datasource from it in Data Engine. Then, using Data Engine's version query syntax, we can easily revert to the dataset state at any point in time.
DagsHub supports a few different ways to upload or connect data to our platform. Choose the best approach for you:
For the purpose of this guide, we'll go with the first option, using DagsHub onboard storage bucket (kind of like Google Drive for Machine Learning).
Let's assume our dataset has 50 files for now, in a local folder called data/
.
To upload it, make sure you have the latest version of DagsHub installed (with pip install -U dagshub
).
Then run the following snippet in Python:
from dagshub import get_repo_bucket_client
client = get_repo_bucket_client("<repo_owner>/<repo_name>", flavor="s3fs")
# Upload folder contents
client.put("data/", "<repo_name>/data", recursive=True)
This is how the bucket storage looks like now:
Now that are data is in DagsHub, we can create a datasource from it.
To create a datasource in the UI:
To create a datasource in Python:
from dagshub.data_engine import datasources
datasources.create_datasource(repo="<repo_owner>/<repo_name>", name="my-first-datasource", path="s3://<repo_name>/data")
You can add any custom enrichments to your datasource, including metadata, annotations, and predictions. Each change is logged, and you'll be able to revert to an older version at any time.
Now that we have our dataset with V1, lets add 50 more datapoints to it, for V2.
To do that, put your new data files in your local data folder, then run the following code:
client.put("data/", "<repo_name>/data", recursive=True)
!!! note If you connected your external storage, simply add the new files there, or if you used DVC to upload versioned data files, repeat that step with the new data files.
Now, if we run len(client.listdir("<repo_name>/data"))
we will see we get 100 files!
To update our datasource, we can either click the rescan button in the UI:
Or use the Python command:
ds = datasources.get('<repo_owner>/<repo_name>', 'my-first-datasource')
ds.scan_source()
If we visualize the datasource in the UI, we'll see the new data files.
So where's the new version?
DagsHub Data Engine keeps a log of all states and changes to your datasource metadata, and provides a simple way to return to old versions. It has a lot more capabilities, which you can read about in the full versioning query syntax
The simplest syntax is the global as_of(t)
query. Let's see how we can get 2 versions of our datasource easily:
from datetime import *
t_now = datetime.now()
v2 = ds.as_of(t_now)
v2.all() # Output: QueryResult of datasource my-first-datasource with 100 datapoint(s)
Since I create the new version 10 minutes ago, if I go back in time 15 minutes, I'll get my old version:
t_prev = datetime.now() - timedelta(minutes=15)
v1 = ds.as_of(t_prev)
v1.all() # Output: QueryResult of datasource my-first-datasource with 50 datapoint(s)
In practice, you'll be able to get your experiment's start time, then use that to retrieve your dataset state at that time.
This versioning works for any metadata change, including, for example, if you update your annotation versions and want
to revert to an older version of annotations. Data Engine also offers an easy way to compare different versions of the
same metadata column using our select()
combined with as_of()
functionality. Read more about that here.
Now that you've started managing your dataset versions, learn how to enrich them with your custom metadata, visualize them, annotate them or convert them to dataloaders for training.
You can also continue to learn how to track your experiments on DagsHub.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?