Skip to content

Data Merging

What is Data Merging?

Data Merging is a method of submitting contributions to a data science project. In fact, it is the natural extension of a regular code merge, when using a data versioning system in addition to your regular git repository.

File changes

When you open a data science pull request in your project, DAGsHub detects the changes in the cached files of your DVC project and displays the list of changed files in the Files changed tab of the data science pull request.

Screenshot

Data diff

If provided a storage access key with read permissions, DAGsHub retrieves additional information on the files:

  1. File Size
  2. Directory Content - When you define a directory as a DVC cached output or dependency, DAGsHub will list the changed files inside the directory.

In order to get the full benefits of the feature, both the base repository and the head repository of a data science pull request need to have a Storage Access Key configured.

Data merge

DAGsHub can copy new cache data from one remote to another!

In order for anybody to collaborate on a DVC project today, every potential contributor needs to have permissions to write to the same remote cache. This is no longer the case. Everybody can open a data science pull request from their forked repository, and you can safely decide if you want to merge the changes in the code as well as the changes in the data.

In order to use the data merge properly, both these conditions need to be fulfilled:

  1. The base repository has a Storage Access key configured with write permissions
  2. The head repository has a Storage Access key configured with at least read permissions

Adding a Storage Access Key

For the Data PR to work properly, you need to provide DAGsHub with a storage access key.

  1. Go to your repository's home page
  2. Click Settings

Screenshot

  1. Go to Storage Keys
  2. Click Add Storage Key

Screenshot

  1. Enter your storage URL (See the list of supported storage types)

Screenshot

  1. Fill the credentials corresponding to the storage type you are using
  2. Click Add Storage Key

Data Merging - Supported storage types

The storage types currently supported are: * Google Cloud Storage

We are working on adding support for additional storage types. If you have specific request feel free to send us Feedback or join the community and ask directly.