Data Science Pull Requests¶
You discovered the experiments you wanted to work on, understood what you want to do to improve them, reproduced the results and made some modifications that improve some key metrics. Now you want to contribute that work back to the project.
TL;DR¶
Data Science Pull Requests is a standard way for data scientists to review data science work, and accept code, data and experiment contributions to a project. If you have a DagsHub project you will see the data science pull request upon creating a pull request in a project.
Overview¶
Contributing – The final step of the collaborative process, is arguably the most important one. Without it, the workflow is one-sided, a monologue, which means collaboration isn't really happening. Practically, contributing can be broken down into two tasks - reviewing and merging contributions.
Contributing data science work is hard – The goal of data science pull requests is to lower friction and make it as easy as possible for two reasons:
- If we have a standard way to discuss and review data science, that means we don't need to invent custom workflows from scratch for every project, and teams can move faster.
- A standard process for reviewing and contributing work promotes Open Source Data Science, which doesn't really exist today.
To see the Data Science Pull Requests in a project or create a new one, just click the pull request tab (1) on the project's homepage. Once you're in the pull requests tab, creating a new pull request is as simple as clicking the New Pull Request (2) button and choosing which branch you want to start the pull request from.
Pull Request Tab
Data Science Review¶
Data Science Pull Request Tabs – Ready for review
Before accepting a pull request, you'll probably want to review the work done by the contributor or team member. DagsHub enables a standardized, automatic way to do that by providing a few useful features, dedicated to the data science workflow.
Experiment Review¶
Before diving into the details of the files and structural changes that are up for review, we usually want to take a look at the experiment parameters and results in a data science pull request.
It might be an updated metric, an ablation test to understand the effect of a parameter on performance or a "simple" parameter tuning. In any case, we want to understand why a pull request is interesting, and what are the proposed data science (as opposed to just code) changes.
With experiment review you can see all the familiar views in the DagsHub Experiments tab, with a few important changes.
First, the base experiment – the one being compared to, is marked in blue for convenience. All the experiments that are being suggested as part of the pull request appear below it for ease of comparison.
Pull request experiment tab – Note the blue row which is the base experiment
Second, upon going into a single experiment view, or the experiment comparison view, you will see a comment box, that lets you add comments in context – meaning after creating the comment, it will be added to the Conversation tab, with a link to the same view – making it easy for team members to understand what is being discussed.
Pull request experiment tab – Comment box for the Parallel Coordinate Plot
Code Review¶
Code is still an important part of data science. In the Files Changed tab of the pull request, you can see the code files changed in your project, what was added or removed in a way that focuses you on what's important.
However, we know that some things matter especially to data scientists – like working with notebooks...
Notebook Review¶
...So we also added notebook diffing as part of the review process. See what changed in your notebooks in an easy to understand way.
Notebook diffing in data science pull requests
Data & Model Review¶
Reviewing code changes is necessary, but not sufficient to understand a data science project. We need data and model changes in order to get the entire picture. In the Files Changed tab, in addition to code changes (tracked in Git), you will see changes to data, models and any other artifact tracked by DVC. This means that all changes can be viewed in one place.
Data and model files changed
Next Steps¶
Congratulations! you've successfully completed one data science workflow cycle. Perhaps it's time to discover another experiment or explore a new project to work on?