Announcing DagsHub Support for DVC 3.0
  Back to blog home

Announcing DagsHub Support for DVC 3.0

DVC Sep 28, 2023

We're happy to share that DagsHub supports DVC 3! Version data with DVC 3 and host it on DagsHub's free and fully configured remote object storage or your S3-compatible storage

With DagsHub, you don't need to duplicate files versioned with DVC 2.X to DVC 3! We handle it for you automatically with no data duplication.
DagsHub integration with DVC

What major change did DVC 3.0 introduce?

The most significant change DVC 3.0 introduced is how files are hashed and stored in the cache. Previously, DVC attempted to handle line-ending differences between Windows and Unix systems before hashing the file content to support cross-platform projects. However, it led to scenarios where binary files were misidentified as text by DVC or where a text file was not intended to be cross-platform.

Now, DVC treats all files as binary, ensuring distinct identification for text files with different line endings.

Note: When upgrading to DVC 3.0, users with cross-platform pipelines should ensure consistent line endings in text outputs. For example, if you use Unix and Windows environments in the same pipeline and have a stage with text output, you should ensure consistency in the line-ending, regardless of where the pipeline was run.

How does DVC handle files that were versioned with versions lower than 3.0?

Files versioned with DVC 3.0 are stored separately from those tracked in older releases to avoid hash collisions. DVC continues to access cached files from previous versions and only duplicates new or modified data

DVC Users can manually migrate existing local DVC cache data to the DVC 3.0 location using the 'dvc cache migrate' command.

How does the new DVC hashing method affect DagsHub users?

DagsHub supports the hashing methods of both DVC 2.X and DVC 3, even when used in the same storage.

With the new hashing methods, the files are no longer stored at the root of the cache but at files/md5/… instead. For that, we've implemented a fallback mechanism that checks the alternative.
This means that if a file was versioned with DVC 2.X and someone tries to access it with DVC 3, we will first check if the file exists at files/md5/… and if the files do not exist, we will check for it in the root.

This means that with DagsHub, you don't need to duplicate files versioned with DVC 2.X to DVC 3! We handle it for you automatically with no data duplication.

What is the integration between DagsHub and DVC?

DagsHub integration with DVC includes a fully configured remote object storage managed by DVC, showing and diffing DVC tracked files hosted on DagsHub Storage or S3 compatible, and Data Pipeline visualization.

This means you can see your data and diff it, next to your code file, and create a single source of truth for your project.

DagsHub Storage

DagsHub automatically configures a remote object storage bucket for every repository with 100 GB of free space. The storage can be managed by DVC and easily configured with any machine. Using the DVC pointer files (.dvc) and the dvc.lock file, DagsHub parses the DVC file information and displays the DVC tracked files under the Files tab.

S3 compatible

The same as with DagsHub Storage, you can configure an existing AWS S3, Google Storage, or S3 compatible with DagsHub and view the DVC tracked files under the Files tab.

Visualize DVC pipelines

DagsHub parses the dvc.lock and dvc.yaml files to create the interactive data pipeline. The pipeline is versioned and holds valuable information about the different files, metrics, and data steps.


For further details and a comprehensive list of breaking changes, please refer to the official DVC 3.0 release notes.

Tags

Nir Barazida

MLOps Team Lead @ DagsHub

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.