Skip to content

Tutorial Overview

Creating an awesome project using DVC and DAGsHub

This tutorial covers creating a model to classify images of hand-written digits (0 to 9) using MNIST as the data-set. This problem is often considered a "Hello, World" for machine learning, and is therefore relatively simple.

The focus of the tutorial is to show how we use DVC in order to version our data pipeline, the benefits that it brings to our workflow and the advantages of using DAGsHub as a repo for our projects and as a pipeline visualization tool.

Screenshot
Samples from the MNIST test data set (source: Josef Steppan on Wikimedia Commons)

DVC?

Short for Data Version Control, it's a tool that solves the versioning and reproducibility problems in the data science and machine learning fields. It does so by enabling data versioning as well as pipeline versioning, which in turn enables experiment reproducibility and easier collaboration. DVC is built to work synergistically1 alongside Git, which is still used as the backbone for file versioning.

Too slow for you?

Here is a link to the complete code repo. You can go over it or use the code as you wish.

DAGsHub MNIST Tutorial Repo

The tutorial will guide you, step-by-step, to create this repo.


  1. We're almost sure that's a real word