Why Git is not enough for data science

Martin Daniel
6 min read
5 years ago

Developer @ DAGsHub

Table of Contents

Share This Article

TL;DR Git is used in almost every software development project to track code and file changes. Based on this ability to track every change, there has also been a tremendous increase in Gits adoption for Data science projects. In this post we discuss;

Benefits of Git for data science
The gaps and limitations of Git
Best practices for using Git for data science projects

For those of you familiar with Git jump to the section “Why Git is important to learn for data science”

What is Git and how does it work?

“Git is a free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.” - Git

As the description states, Git is a version control system. It helps record, track, and save any change made to source code and to quickly and easily recover any previous state.

Git uses a distributed version control model. This means that there can be many copies (or forks/remotes in the GitHub world) of the repository. When working locally, Git is the program that you will use to keep track of changes to your repository.

GitHub.com is a location on the internet that acts as a remote location for your repository. GitHub provides a backup of your work that can be retrieved if your local copy is lost (e.g., if your computer falls off a pier). GitHub also allows you to share your work and collaborate with others on a project.

Similar tools to GitHub are GitLab, Bitbucket.

https://git-scm.com/book/en/v2/book/01-introduction/images/distributed.png

Source: Pro Git by Scott Chacon and Ben Straub.

How Git and Github, GitLab, or Bitbucket help you work better

There are several practical ways Git helps a development project.

Keep track of changes to your code locally using git.
Synchronize code between different versions (i.e. your own versions or others’ versions).
Test changes to code without losing the original.
Revert back to an older version of code, if needed.
Backup your files on the cloud (GitHub/GitLab/Bitbucket).
Share your files on GitHub/GitLab/Bitbucket and collaborate with others.

Up to this point, it's clear why Git is a powerful tool that will help you record, track and store any change made to almost any file in a project. We understood how Git helps you work well as a team and this is one of the main reasons why Git is so widely used in software development projects.

These benefits are also relevant for data science projects, to manage the code that supports their work, but it does not translate 1:1.

Is Git important to learn for data science?

If you want to join a data science project as a collaborator, you will have to face some challenges, for example;

Review all ongoing research and repositories on a specific topic and pick the most promising one. (If you are working on an open source project)
You will need to understand the current state of that project and how it has evolved over time.
Identify which directions are promising and still worth exploring. In this step, reviewing ideas and approaches that were tried and abandoned is also important, since you don’t want to unnecessarily repeat work someone tried unsuccessfully. Usually, these failed approaches are not documented and forgotten, which is a huge challenge.
You will need to collect all the pieces of the project (data, code, etc.) which might be spread out over multiple platforms, and sometimes not completely accessible.
Last but not least, once you’ve made some improvement or explored a new direction, there is no easy way to contribute your results back to the project.

To summarize, nowadays there are multiple challenges which aren’t gracefully handled in today’s tools.

Now let’s see how Git can help us fill in some of the missing pieces.

With Git’s ability to track every change we made to our files, we can show all the directions taken, when they were taken, and by whom! It's possible to see the entire Git history as an actual story and understand what was done at every step (commit) and have some documentation (commit messages). You can also share it with other collaborators using one of the previously mentioned platforms.

By using a more traditional software development workflow, you begin to treat your models more like an application and less as a script, which makes it easier to manage and leads to higher quality outcomes.

Still, Git has limitations

Although there are significant benefits from using version control tools like Git, they come with a high overhead cost.

The overhead comes from the need to ensure every change goes through the "commit" process, which most often means using the command line and terminal. Since the terminal is so unfamiliar to most analysts (and even data scientists), you don't just need to learn Git, you also need to learn the terminal! This is not quick, and having your efficiency suffer while struggling to remember what command to write is a huge turn-off. If this is your case, you can check this blog post Effective Linux & Bash for Data Scientist

Also, Git can't do all the heavy lifting on its own. What do I mean by this?

Git can't support experiment tracking. Here is a nice post comparing some existing tools. ML experiment tracking tools that fit your data science workflow
Git can't track big files (datasets and models). You can find more information about this in this post Comparing Data Version Control tools

Best practices for structuring a Data Science project using Git.

With all that said I propose a solution to integrate the Git mindset into your DS project. It’s composed of a few components: experiment tracking, version control and using data as source code.

Experiment tracking

You can implement experiment tracking by taking two approaches either using a dedicated tool or using Git. You can also find more information about this on ML experiment tracking tools that fit your data science workflow

External tracking

You have to log all your experiment information on an external system.

This approach has some advantages:

A lot of excellent tools have been developed
It’s an intuitive way to do it. No need to stop before taking a new direction to create a new commit on a git project

But, with advantages come the disadvantages:

There is no clear connection between the code and the experiment results
It’s hard to review
Reproducibility. It’s not easy to reproduce what lead to an experiment result

Version Control with Git Tracking

You consider each experiment a git commit, this means that any change to the project will create a new version since code, data, and parameters are part of the source code

Some advantages are:

Reproducing is easy, just do a git checkout and you have code, parameters, data, models.
You get all the context related to an experiment
Collaboration. As mentioned throughout this article Git in combination with some of the other platforms gives you the possibility to parallelize work
If combined with data versioning tools you can also accept data contributions
GitOps, CI/CD - Makes it easier to integrate with the existing git ecosystem for CI/CD, PRs

It also has some disadvantages:

Can be messy when having a lot of experiments meaning having a lot of commits
Change on the mindset to start considering any new direction as a commit

Of course, you can do a mix of both ideas.

Data as source code

As I mentioned before Git was developed to track changes in text files, not large binary files. So tracking a project data set is not an option. This is why I recommend two options to use:

For a non-changing dataset, you can upload it to a server and access it through a URL
In case you have a data set that might change you should consider versioning it using one tool. You can find a great comparison here Comparing Data Version Control tools.

You can also find more information about why is it a good idea to version your data set on this blog post Datasets should behave like Git repositories

Conclusion

Implementing these suggested practices for Git offer several benefits:

Consolidate all your project files, data and models in one place
Review tools which make it easier to contribute to an ongoing project and easier to check these contributions.
Easier to reproduce and reuse work from previous projects
CI/CD, if you are happy with the contributions that were made you can have an automatic way to merge them, taking the code and the data, test them, and ship them to production.

Day by day, Git is being used in more Data Science projects. I hope that by reading this article you will have a better understanding on what are its limitations and its strengths and how you can use it with your colleagues. Good luck!