Q: So what IS DAGsHub exactly?¶
A: DAGsHub is a web platform for data version control and collaboration for data scientists and machine learning engineers.
Q: Seriously, I don’t get it...what is it?¶
A: It’s like GitHub for data science and machine learning.
Q: Why can’t I just use Git?¶
A: Basically, regular Git is not so good at versioning large files, which is important for many data science and machine learning projects.
git-lfs is an extension to git that can be used to version large files, but that's only half of the problem.
Git and git-lfs don't version the data pipeline. This means that when something in your pipeline is modified, you won't know that the end of the pipeline (e.g. the trained model) should be reproduced.
Q: So, then, does DAGsHub do all of that stuff?¶
A: The short answer, YES.
The longer answer is that DAGsHub is built on git and DVC, which is an open source command-line tool built for data and pipeline versioning. You use git for the exact same things you would in a regular code project, and you use DVC on top for the DS/ML versioning stuff. DAGsHub adds visualizations and automation features on top of that.
Q: Does that mean I need to learn a whole new framework again?¶
A: The great thing about DVC is that it doesn’t affect code versioning. You still use plain old git for that.
DVC adds commands for DS and ML on top of that, but the syntax is similar to git, so it’s not entirely unfamiliar. Most git commands have a direct equivalent in DVC.
Q: So why not just use Git and DVC through the command line?¶
A: In a nutshell: DAGsHub is for DVC what GitHub is for git.
DVC is great, and so is git. But they are both command line tools, and as such have some issues which DAGsHub solves.
First of all, there is no convenient interface to visualize your pipeline and overview your project metrics. DAGsHub shows your pipeline as a, wait for it, DAG (!!!), where every node is a file, with important details and a direct link to the file itself. This is especially important for team projects, where you want everyone on the same page and seeing the same high level picture.
You can send someone a link to your DAGsHub repo, and give them a way to explore your project, including downloading your data and models from any past version or branch, without forcing them to clone or run any code.
Building on the powerful foundations of git and DVC, we have many more features in the works, which should make life easier for everyone.
Q: Most tools that offer data pipeline versioning require adding lines of code to my project and/or importing libraries, what does DAGsHub or DVC require me to do?¶
A: NOTHING! This is why we love DVC so much. Just like git, it is non-intrusive and not bloated. You just install the program and it works.
Q: Then surely, it works only for certain languages and with certain ML libraries?¶
A: Nope. Completely, 100% language and library agnostic. DVC, and DAGsHub, don’t care if you’re using Python or R, Keras or Pytorch.
Q: OK, but I like GitHub, and that’s what I’m using for my project. So you can’t help me, right?¶
A: Actually, we can. You can make a DAGsHub account and mirror a repo if you don’t want to migrate. That way you can manage your code on GitHub and get most of the awesome features DAGsHub has to offer here.
After logging in, go to: https://dagshub.com/repo/migrate to mirror your project from GitHub or any other existing repo.
Q: Sounds good...How much will it cost me?¶
A: Starting at a whopping 0$, DAGsHub is completely free for open source projects. Private repos are coming soon, and will entail a small monthly subscription. If you’re interested, contact us at email@example.com for more details.
Q: So how do I use DAGsHub?¶
A: You can start with the tutorial.