Skip to content

DVC

DVC is an open-source version control tool for machine learning projects designed to handle large files, data sets, machine learning models, and metrics. It works on top of Git to easily integrate with your existing Git code repositories. DagsHub integration with DVC includes a fully configured remote object storage managed by DVC, showing and diffing DVC tracked files hosted on DagsHub Storage or S3 compatible, and Data Pipeline visualization.

How does the integration of DagsHub with DVC work?

DagsHub Storage

DagsHub automatically configures a remote object storage for every repository with 100 GBs of free space. The storage can be managed by DVC and easily configured with any machine. Using the DVC pointer files (.dvc) and the dvc.lock file, host on the Git commit, DagsHub parsed the storage and displays the DVC tracked files under the Files tab.

Learn how to use DagsHub Storage

External Storage Buckets

DagsHub supports visualizing and managing DVC data stored on any AWS S3, Google Cloud Storage, Azure Blob Storage, or any S3 Compatible storage including MinIO.

Learn how to configure your external bucket.

Visualize DVC pipelines

DagsHub parses the dvc.lock and dvc.yaml file to create the interactive data pipeline. The pipeline is versioned and holds valuable information about the different files, metrics, and data steps.

How to use DVC with DagsHub?

DagsHub Storage

Configure DagsHub Storage with your machine

  1. Go to your repository homepage
  2. Click on the remote button, and select the Data tab.
  3. Select DVC
  4. Copy the commands to set your local machine with DagsHub Storage

DVC remote

  1. Enter a terminal in your project, paste the commands and run them

    dvc remote add origin s3://dvc
    dvc remote modify origin endpointurl https://dagshub.com/<DagsHub-user-name>/<repo_name>.s3
    dvc remote modify origin --local access_key_id <Token>
    dvc remote modify origin --local secret_access_key <Token>
    
    Why --local?

    Everything you configure without --local will end up in the .dvc/config file, which is tracked by git, and appear in you repository. Personal info like authentication details should always be kept local.

That's it! You can now pull data from your remote cache

Note: You need to be inside a Git and DVC directory for this process to succeed. To learn how to do that, please follow the first part of the Get Started section.

Pulling data

dvc pull -r origin

Pushing data

dvc push -r origin

Visualize DVC pipelines:

  1. Run a DVC pipeline
  2. Version the dvc.lock and dvc.yaml files using Git.
  3. Version with Git the files not tracked by DVC.
  4. Push the Git and DVC tracked files to DagsHub. Note: You can follow the Pipeline tutorial to learn how to build a DVC pipeline