Skip to content

Version Datasets, Data Files, and Code

Versioning your datasets and code is a critical component of data science projects that ensures the reproducibility of ML experiments. It provides traceability and enables collaboration among team members with ease.

Let's learn when, and how to version your datasets with DagsHub Data Engine, your data files using DVC, and code using Git, and how to manage and host all these components on DagsHub.

When does versioning make sense?

Versioning makes sense in cases where your datasets might change, and you want to keep track of those changes.

This might be done for various reasons for example:

  • Reverting in case of some error or other unexpected event
  • Reproducing previous experiment results to verify them or continue working on previous research directions
  • Regulatory requirements in certain medical, automotive and other use cases, where an external auditor may require you share your source data as it was when a certain model was trained
  • Debugging models that are misbehaving

In most ML projects, it is recommended to version your data.

Do I need to version datasets, data files or both?

There are 3 main data change scenarios relevant for versioning:

  1. Data never changes (very rare in production ML projects) – In this case, perhaps versioning is unnecessary.
  2. Data is add-only, and metadata changes (for example annotations) , but data files themselves don't change (this is the most common use case) – In this case dataset versioning is required.
  3. Data files change (imagine an image where between versions pixels might change) – In this case both dataset versioning and data file versioning are required.

In all cases, versioning code is required.

Select what type of versioning you'd like to dive into: