Skip to content

Data Science Pull Requests

You discovered the experiments you wanted to work on, understood what you want to do to improve them, reproduced the results and made some modifications that improve some key metrics. Now you want to contribute that work back to the project.

TL;DR

Data Science Pull Requests is a standard way for data scientists to review data science work, and accept code, data and experiment contributions to a project. If you have a DAGsHub project you will see the data science pull request upon creating a pull request in a project.

Overview

Contributing – The final step of the collaborative process, is arguably the most important one. Without it, the workflow is one-sided, a monologue, which means collaboration isn't really happening. Practically, contributing can be broken down into two tasks - reviewing and merging contributions.

Contributing data science work is hard – The goal of data science pull requests is to lower friction and make it as easy as possible for two reasons:

  1. If we have a a standard way to discuss and review data science, that means we don't need to invent custom workflows from scratch for every project, and teams can move faster.
  2. A standard process for reviewing and contributing work promotes Open Source Data Science, which doesn't really exist today.

To see the Data Science Pull Requests in a project or create a new one, just click the pull request tab (1) on the project's homepage. Once you're in the pull requests tab, creating a new pull request is as simple as clicking the New Pull Request (2) button and choosing which branch you want to start the pull request from.

Screenshot Pull Request Tab

Data Science Review

Screenshot Data Science Pull Request Tabs – Ready for review

Before accepting a pull request, you'll probably want to review the work done by the contributor or team member. DAGsHub enables a standardized, automatic way to do that by providing a few useful features, dedicated to the data science workflow.

Experiment Review

Before diving into the details of the files and structural changes that are up for review, we usually want to take a look at the experiment parameters and results in a data science pull request.

It might be an updated metric, an ablation test to understand the effect of a parameter on performance or a "simple" parameter tuning. In any case, we want to understand why a pull request is interesting, and what are the proposed data science (as opposed to just code) changes.

With experiment review you can see all the familiar views in the DAGsHub Experiments tab, with a few important changes.

First, the base experiment – the one being compared to, is marked in blue for convenience. All the experiments that are being suggested as part of the pull request appear below it for ease of comparison.

Screenshot Pull request experiment tab – Note the blue row which is the base experiment

Second, upon going into a single experiment view, or the experiment comparison view, you will see a comment box, that lets you add comments in context – meaning after creating the comment, it will be added to the Conversation tab, with a link to the same view – making it easy for team members to understand what is being discussed.

Screenshot Pull request experiment tab – Note the blue row which is the base experiment

Code Review

Code is still an important part of data science. In the Files Changed tab of the pull request, you can see the code files changed in your project, what was added or removed in a way that focuses you on what's important.

However, we know that some things matter especially to data scientists – like working with notebooks...

Notebook Review

...So we also added notebook diffing as part of the review process. See what changed in your notebooks in an easy to understand way.

Screenshot Notebook diffing in data science pull requests

Data & Model Review

Reviewing code changes is necessary, but not sufficient to understand a data science project. We need data and model changes in order to get the entire picture. In the Files Changed tab, in addition to code changes (tracked in Git), you will see changes to data, models and any other artifact tracked by DVC. This means that all changes can be viewed in one place.

Screenshot Data and model files changed

Data Merging

After reviewing the incoming pull request, you've decide to accept it. Congratulations! Now, you want to incorporate all components into your original project – code, data, models and experiments. Ideally, you want it to happen as automatically as possible, like in a normal pull request.

DAGsHub lets you do exactly that with data merging. Setting it up is straightforward and is covered in the data merging doc page. Once you've completed the setup successfully, the conversation tab will have the data merging section.

Screenshot Data merging section

This section will show you how much data will be copied into your remote upon accepting this pull request. Once you click the Merge Pull Request button, the copy will be performed automatically. This means that everything in your project – both Git and DVC tracked files, will be merged.

That's it, you've successfully contributed (or received) a data science contribution.

Next Steps

Congratulations! you've successfully completed one data science workflow cycle. Perhaps it's time to discover another experiment or explore a new project to work on?