Skip to content

Data Pull Request

What is a Data Pull Request?

A Data Pull Request is a method of submitting contributions to a data science project. In fact, it is the natural extension of a regular Pull Request, when using a data versioning system in addition to your regular git repository.

File changes

When you open a pull request in your project, DAGsHub detects the changes in the cached files of your DVC project and displays the list of changed files in the Files changed tab of the pull request.

Screenshot

Data diff

If provided a storage access key with read permissions, DAGsHub retrieves additional information on the files:

  1. File Size
  2. Directory Content - When you define a directory as a DVC cached output or dependency, DAGsHub will list the changed files inside the directory.

In order to get the full benefits of the feature, both the base repository and the head repository of a Pull Request need to have a Storage Access Key configured.

Data merge

DAGsHub can copy new cache data from one remote to another!

In order for anybody to collaborate on a DVC project today, every potential contributor needs to have permissions to write to the same remote cache. This is no longer the case. Everybody can open a Pull Request from their forked repository, and you can safely decide if you want to merge the changes in the code as well as the changes in the data.

In order to use the data merge properly, both these conditions need to be fulfilled:

  1. The base repository has a Storage Access key configured with write permissions
  2. The head repository has a Storage Access key configured with at least read permissions

Adding a Storage Access Key

For the Data PR to work properly, you need to provide DAGsHub with a storage access key.

  1. Go to your repository's home page
  2. Click Settings

Screenshot

  1. Go to Storage Keys
  2. Click Add Storage Key

Screenshot

  1. Enter your storage URL (See the list of supported storage types)

Screenshot

  1. Fill the credentials corresponding to the storage type you are using
  2. Click Add Storage Key

Data Pull Request - Supported storage types

The only storage type currently supported is Google Cloud Storage.

We are working on adding support for additional storage types. If you have specific request feel free to send us Feedback or join the community and ask directly.