Data Science is a research-driven field and exploring many solutions to a problem is a core principle. When a project evolves and grows in complexity, we need to compare results and see what approaches are more promising than others. In this process, we need to ensure we don't lose track of the project's components or miss out on critical information. Moreover, we need to have the ability to reproduce results and manage past experiments so as not to waste time exploring the same hypothesis twice. For this reason, its necessary to use a structured workflow to explore new experiments.
In this blog post we’ll cover the following topics:
- Key challenges in a data science projects and the desired capabilities to overcome them.
- Insights from the software development workflow.
- The recommended workflow to run a new experiment.
Key challenges in the Data Science Workflow
To better understand the challenges in the experiment tracking workflow, let's first define the steps involved in a typical project:
- Understand the business problem at hand.
- Gather the raw data.
- Explore, transform, clean, and prepare the data.
- Create and select models based on the data.
- Train, test, tune and deploy the model.
- Monitor the model's performance.
The workflow outlined above provides a road map to deploying a machine learning model to production and monitoring it. As we can see, much of the process in a data science project is based on trial-and-error. The data scientist runs tests, compares the results, reruns them, compares the results, and so on. As a result, we have to deal with three main challenges.
The Challenge: Manage the different experiments/approaches to solving the problem.
When working on a data science project, we often need to explore different approaches to solve a problem - hyperparameter tuning, model architecture, data processing methods, etc. Every approach might be different and even orthogonal to the other. Moreover, every approach will have sub-experiments, challenging the researcher to choose the best result.
The Desired Capability: Comparison tools that present the different results and compares them by various parameters.
The Challenge: Reproduce results of previous experiments with all of the project's components.
When exploring sub-experiments, we can't tell when we reached the desired result. It might be in the first experiment, the Xth, or the last one. Thus, we need to have the ability to reproduce the code, data, and model of the experiment with the best result.
The Desired Capability: Encapsulate all of the project’s components with the result of the experiment.
The Challenge: Work in parallel and synchronize the project's components.
When working in a production environment, one of the greatest challenges is collaborating with other data scientists. While working simultaneously on various experiments, managing the code, data, and model becomes an excruciating task rather than a working state.
The Desired Capability: Version all of the project’s components.
Learning from Software Development
Because of the similarities between software development and data science, and since software development in teams has had significantly more time to mature, let’s see how we can adapt and apply methods from that domain and use them to define an efficient workflow in the data science field.
The software development workflow is an extensive discipline. I want to address our attention to the “Git Feature Branch Workflow” that defines a strict branching model. It provides a robust framework for managing large-scale projects with a considerable number of developers. For simplicity, I will not cover in this blog post the usage of Develop and Release branches.
In the Feature Branch Workflow, we will store the production version of the project on the Master branch and the developed features on different ones. Git will isolate the working environment for every branch, so developers can edit, stage, and commit changes to a feature branch without affecting the production version.
If using Git hosting platforms (e.g., GitHub, BitBucket, GitLab), when the feature is ready, the developer will not simply merge it to the Master branch but will create a pull request (PR). At this time, he will assign it to his colleague to review the entire work and changes.
Finally, when the feature is ready for deployment, the developer will merge the feature's branch to the Master branch. Then, CI/CD tools will pull the new version, build it, and deploy it to production.
This workflow enables many teams to work in parallel and manage their project efficiently. With some modifications, we can adapt it to data science to overcome the challenges described earlier. Let's explore how we can do that.
Experiment Tracking Workflow
Based on the challenges mentioned throughout this blog post, the desired capabilities, and the traditional software development workflow, we suggest a new approach.
Each of the elements above is based on the previous one, and by combining them, they create a holistic solution.
The foundation of this workflow is Data and Model Versioning. Tracking all the project's components makes reproducing prior work easier. Furthermore, when working in a team, it enables us to manage the project more efficiently and work in parallel on the different components. Several MLOps tools support data versioning, and they include DVC, MLflow, and Pachyderm.
The second element is Encapsulating the Experiment's Components with the result. To reproduce the results of a experiment, we need to have the ability to retrieve the version of code, data, and model the produced them. We will use one of the above tools to version the data and model files and Git to version the code files. We recommend logging the experiment's parameters and metrics into an open-source readable file and tracking them using Git.
The third element refers to the Isolation of Experiments. This element was inspired by the software development workflow where the production code is held on the Master branch, and the evolving features are stored on different branches. In our case, the project's best components will be held on the Master branch, and every experiment will be hosted on another branch. This way, it will be easy to manage and track the various experiments, review every experiment's result, and not reap the same one twice.
The fourth and last element is the Data Science Poll Request (DSPR). When creating a PR for a software development project, the only element that needs to be reviewed and merged is the code. However, a data science project has more components. Therefore, to enable an efficient reviewing process with a complete picture of the experiment, the DSPR will include the code, data, model, and experiment results.
Experiment Tracking Workflow
Create a branch for the experiment and checkout to it.
Run sub-experiments on the branch and track the qualitative results with the project’s components.
- Parallel Approach - For every sub-experiment, create a new branch and run the experiment on it. When having multiple experiments, this approach can become hard to manage.
- Sequential Approach - For every sub-experiment, create a new commit of the project version. This approach allows you to choose which experiment you want to track and filter noise.
Compare the results of different experiments and choose the best one.
If the best experiment result is better than the result of the model on the Master branch.
- Parallel Approach - checkout to the branch that stores the experiment with the best results.
- Sequential Approach - checkout to the commit with the best result and fork it to a new branch.
Create a Data Science Poll Request with all of the project components - code, data, models, and experiments.
After a review of the DSPR and approval, simply merge the files to the Master branch.
Using the above workflow and elements, exploring a new hypothesis becomes a more manageable task. We can easily compare the results of the experiments and determine what approach was more effective. If you have tried this workflow and it helped you - I would love to hear about it via Twitter or LinkedIn. If you have suggestions for optimization, we can talk about them on our Discord channel.