Skip to content

title: DVC and DagsHub Tutorial: Classifying MNIST Handwritten Digits with ML Pipelines – Reproducing Results description: Embark on a journey through machine learning basics with this DagsHub tutorial, where you'll learn to classify MNIST handwritten digits. Discover how to version your data pipeline with DVC and leverage DagsHub for project repository management and pipeline visualization, streamlining your ML workflows


Experimentation and Reproducibility

Section overview

Now that we have established our pipeline, it's time to enjoy the fruits of our labor. This mainly comes in the form of easy experimentation and automatic reproducibility.

When we experiment and change the parameters of our pipeline (e.g. change hyperparameters, preprocessing, or even switching to a new dataset), DVC automagically knows what has changed and re-runs only the relevant stages of the pipeline, building upon non-changed stages to save time.

Reproducibility means that after making your changes and telling DVC to recalculate the pipeline, you can dvc push the resulting files to a shared remote. That way, when someone else (or you yourself 3 months from now) checks out the experiment branch, they immediately get all the necessary context to reproduce your results. Remember, even failed experiments can be a useful source of information!

In this section you'll see how great that is.

Screenshot
A DAGnicorn with automagical powers (source: Mark Glancy on Pexels)

This section covers the following experiments:

  • Performing Principle Component Analysis (PCA) to reduce the features on the data from 784 to 15.
  • Change the model from an SVM to a Convolutional Neural Network (CNN) using PyTorch.
  • Merging the chosen model with the master branch

Let's get started.

Principle Component Analysis

We decided to see what results we get by reducing the number of features in our data, from 784 (28*28 pixels) to 15.

Let's start by creating a new branch.

git checkout -b PCA

Now, since nothing has really changed, if we use the dvc repro command, nothing will happen.

$ dvc repro
Stage 'data/test_data.csv.dvc' didn't change, skipping
Stage 'data/train_data.csv.dvc' didn't change, skipping
Stage 'featurization' didn't change, skipping
Stage 'training' didn't change, skipping
Stage 'eval' didn't change, skipping
Data and pipelines are up to date.

But that's not really interesting. Let's start changing our code. For this stage we edit only the featurization stage - featurization.py. Logically that means that we'll need to re-run the featurization stage, followed by model training and evalutation.

The new code can be downloaded from this link.

Here is the code:

"""
Create feature CSVs for train and test datasets
"""
import json
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import pickle
import base64

def featurization():
    # Load data-sets
    print("Loading data sets...")
    train_data = pd.read_csv('./data/train_data.csv', header=None, dtype=float).values
    test_data = pd.read_csv('./data/test_data.csv', header=None, dtype=float).values
    print("done.")

    # Create PCA object of the 15 most important components
    print("Creating PCA object...")
    pca = PCA(n_components=15, whiten=True)
    pca.fit(train_data[:, 1:])

    train_labels = train_data[:, 0].reshape([train_data.shape[0], 1])
    test_labels = test_data[:, 0].reshape([test_data.shape[0], 1])

    train_data = np.concatenate([train_labels, pca.transform(train_data[:, 1:])], axis=1)
    test_data = np.concatenate([test_labels, pca.transform(test_data[:, 1:])], axis=1)
    print("done.")

    # END NEW CODE

    print("Saving processed datasets and normalization parameters...")
    # Save normalized data-sets
    np.save('./data/processed_train_data', train_data)
    np.save('./data/processed_test_data', test_data)

    # Save learned PCA for future inference
    with open('./data/norm_params.json', 'w') as f:
        pca_as_string = base64.encodebytes(pickle.dumps(pca)).decode("utf-8")
        json.dump({ 'pca': pca_as_string }, f)

    print("done.")


if __name__ == '__main__':
    featurization()

Now this is where the magic happens. Simply type

dvc repro

The full output of dvc repro should look like this:
Stage 'data/test_data.csv.dvc' didn't change, skipping
Stage 'data/train_data.csv.dvc' didn't change, skipping
Running stage 'featurization' with command:
    python3 code/featurization.py
Loading data sets...
done.
Creating PCA object...
done.
Saving processed datasets and normalization parameters...
done.
Updating lock file 'dvc.lock'

Running stage 'training' with command:
    python3 code/train_model.py
Load training data...
Choosing smaller sample to shorten training time...
done.
Training model...
done.
Save model and training time metric...
done.
Updating lock file 'dvc.lock'

Running stage 'eval' with command:
    python3 code/eval.py
Loading data and model...
done.
Running model on test data...
done.
Calculating metrics...
done.
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.lock

DVC checks all .dvc files and stages in dvc.yaml (and dvc.lock) to see what has changed. Since the import stages didn't change they wont be re-run. Upon reaching the featurization stage, it will run again, as well as the training and evaluation stages.

Committing to Git

Let's commit this change to Git

$ git status -s
 M code/featurization.py
 M data/test_data.csv.dvc
 M dvc.lock
 M dvc.yaml
 M metrics/eval.json
 M metrics/train_metric.json
git add .
git commit -m "Performed PCA to choose 15 features"
git push origin PCA

Visualizing changes

Now that we committed the results to Git, we can see the change in metrics.

$ dvc metrics show -a
Revision    Path                       accuracy    training_time
PCA         metrics/train_metric.json  -           5.82843
PCA         metrics/eval.json          0.8295      -
update      metrics/train_metric.json  -           62.85439
update      metrics/eval.json          0.9845      -
Our accuracy did drop significantly, from 98% to 82%, but our training time was reduced to under one tenth of the original time!


Optional - pushing to the remote cache

If you performed the optional stage, now would be a good time to push the updated files to the cloud.

To do this, simply type the command again.

$ dvc push -r gs_remote

Notice that this time, only changed files are uploaded to the cloud. The imported data for example, doesn't need to be pushed again, and this saves a lot of time.


Experiment conclusion

Let's assume that this is not good enough for us. We will go back to the master branch and create a new branch for the next experiment.

Convolutional Neural Network

We now turn to the awe-some power of neural networks to tackle this digit classification problem.

Screenshot
A cool image of a neuron. So exciting! (source: ColiN00B on Pixabay)

The code in this part is based on the code from the PyTorch examples repo.

Creating a new branch from the master branch

We'd like to start fresh from our original pipeline (before the introduction of PCA). For that we need to do two things.

git checkout -b CNN master
dvc checkout

The first command is the regular Git checkout that branches from the master.

After checking out the Git files, our DVC tracked files - data, model and metrics, still refer to the PCA branch. Using the second command, dvc checkout takes care of that, as DVC looks for the appropriate hash in our cache folder and retrieves it to the working copy.

Automate dvc checkout

If you want DVC to automatically checkout whenever you switch Git branches, use the handy dvc install command.

To verify that we are indeed working on a copy of the last master branch commit, you can perform:

$ dvc repro
Stage 'data/test_data.csv.dvc' didn't change, skipping
Stage 'data/train_data.csv.dvc' didn't change, skipping
Stage 'featurization' didn't change, skipping
Stage 'training' didn't change, skipping
Stage 'eval' didn't change, skipping
Data and pipelines are up to date.

Updating requirements

For this experiment, we are going to use PyTorch, and therefore need to install this package. Simply type:

pip3 install torch==1.5.1
pip3 freeze > requirements.txt
pip3 install torch==1.5.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip3 freeze > requirements.txt

Modifying the training code

We need to make a few changes here. First let's create a new code file named my_torch_model.py which will contain a class definition of our CNN. You can download the complete code from this link.

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
Explanation about the network structure

This is a neural network consisting of 2 convolutional layers and 2 fully connected layers. Between every convolutional layer we apply a ReLU activation function, as well as 2d pooling. The tensor is then transformed to fit the shape of the fully connected layers. It then passes through the first fully connected layer followed by another ReLU, and finally the last fully connected layer. We apply a log_softmax, which results in a tensor with an estimation of the current input for each class. We will later take the maximum of all these estimates and use that as the classification result.

Next, let's modify the code for our model training. The complete code can be found in this link.

The new code has a lot of changes, so here is a view of the whole train_model.py after the changes:

"""
Train classification model for MNIST
"""
import json
import pickle
import numpy as np
import time

# New imports
import torch
import torch.utils.data
import torch.nn.functional as F
import torch.optim as optim

from my_torch_model import Net

# New function
def train(model, device, train_loader, optimizer, epoch):
    log_interval = 100
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def train_model():
    # Measure training time
    start_time = time.time()

    # Setting up network
    print("Setting up Params...")
    device = torch.device("cpu")
    batch_size = 64
    epochs = 3
    learning_rate = 0.01
    momentum = 0.5
    print("done.")

    # Load training data
    print("Load training data...")
    train_data = np.load('./data/processed_train_data.npy')

    # Divide loaded data-set into data and labels
    labels = torch.Tensor(train_data[:, 0]).long()
    data = torch.Tensor(train_data[:, 1:].reshape([train_data.shape[0], 1, 28, 28]))
    torch_train_data = torch.utils.data.TensorDataset(data, labels)
    train_loader = torch.utils.data.DataLoader(torch_train_data,
                                               batch_size=batch_size,
                                               shuffle=True)
    print("done.")

    # Define SVM classifier and train model
    print("Training model...")
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(),
                          lr=learning_rate,
                          momentum=momentum)

    for epoch in range(1, epochs + 1):
        train(model, device, train_loader, optimizer, epoch)
    print("done.")

    # Save model as pkl
    print("Save model and training time metric...")
    with open("./data/model.pkl", 'wb') as f:
        pickle.dump(model, f)

    # End training time measurement
    end_time = time.time()

    # Create metric for model training time
    with open('./metrics/train_metric.json', 'w') as f:
        json.dump({'training_time': end_time - start_time}, f)
    print("done.")


if __name__ == '__main__':
    train_model()

The main changes are the train() function and the train_loader which produces batches of 64 images for training. We train for three epochs using an SGD (Stochastic Gradient Descent) optimizer.

Finally, we must change eval.py to properly evaluate our new model. The complete code can be found here.

Here is the new code:

"""
Evaluate model performance
"""
import pickle
import json
import numpy as np
from sklearn.metrics import accuracy_score
import torch


def eval_model():
    # Load test data
    print("Loading data and model...")
    test_data = np.load('./data/processed_test_data.npy')

    # Load trained model
    with open('./data/model.pkl', 'rb') as f:
        model = pickle.load(f)

    # Switch model to evaluation (inference) mode
    model.eval()

    print("done.")

    # Divide loaded data-set into data and labels
    labels = test_data[:, 0]
    data = torch.Tensor(test_data[:, 1:].reshape([test_data.shape[0], 1, 28, 28]))

    # Run model on test data
    print("Running model on test data...")
    predictions = model(data).max(1, keepdim=True)[1].cpu().data.numpy()
    print("done.")

    # Calculate metric scores
    print("Calculating metrics...")
    metrics = {'accuracy': accuracy_score(labels, predictions)}

    # Save metrics to json file
    with open('./metrics/eval.json', 'w') as f:
        json.dump(metrics, f)
    print("done.")


if __name__ == '__main__':
    eval_model()

The changes here are mainly "cosmetic", changing data types and commands for the ones required for testing a PyTorch model.

Reproduction

Now that that's out of the way, let's reproduce the model. Here we expect only the model and evaluation stages to be rerun.

dvc repro

The full output of dvc repro should look like this:
Stage 'data/test_data.csv.dvc' didn't change, skipping
Stage 'data/train_data.csv.dvc' didn't change, skipping
Stage 'featurization' didn't change, skipping
Running stage 'training' with command:
    python3 code/train_model.py
Setting up Params...
done.
Load training data...
done.
Training model...
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.306390
Train Epoch: 1 [6400/60000 (11%)]   Loss: 0.644356
Train Epoch: 1 [12800/60000 (21%)]  Loss: 0.321557
Train Epoch: 1 [19200/60000 (32%)]  Loss: 0.132828
Train Epoch: 1 [25600/60000 (43%)]  Loss: 0.220711
Train Epoch: 1 [32000/60000 (53%)]  Loss: 0.053678
Train Epoch: 1 [38400/60000 (64%)]  Loss: 0.167499
Train Epoch: 1 [44800/60000 (75%)]  Loss: 0.188741
Train Epoch: 1 [51200/60000 (85%)]  Loss: 0.180456
Train Epoch: 1 [57600/60000 (96%)]  Loss: 0.111947
Train Epoch: 2 [0/60000 (0%)]   Loss: 0.208979
Train Epoch: 2 [6400/60000 (11%)]   Loss: 0.226273
Train Epoch: 2 [12800/60000 (21%)]  Loss: 0.058727
Train Epoch: 2 [19200/60000 (32%)]  Loss: 0.046682
Train Epoch: 2 [25600/60000 (43%)]  Loss: 0.039229
Train Epoch: 2 [32000/60000 (53%)]  Loss: 0.033384
Train Epoch: 2 [38400/60000 (64%)]  Loss: 0.163442
Train Epoch: 2 [44800/60000 (75%)]  Loss: 0.027922
Train Epoch: 2 [51200/60000 (85%)]  Loss: 0.041510
Train Epoch: 2 [57600/60000 (96%)]  Loss: 0.017791
Train Epoch: 3 [0/60000 (0%)]   Loss: 0.053239
Train Epoch: 3 [6400/60000 (11%)]   Loss: 0.082304
Train Epoch: 3 [12800/60000 (21%)]  Loss: 0.081448
Train Epoch: 3 [19200/60000 (32%)]  Loss: 0.035810
Train Epoch: 3 [25600/60000 (43%)]  Loss: 0.042171
Train Epoch: 3 [32000/60000 (53%)]  Loss: 0.099796
Train Epoch: 3 [38400/60000 (64%)]  Loss: 0.045084
Train Epoch: 3 [44800/60000 (75%)]  Loss: 0.020555
Train Epoch: 3 [51200/60000 (85%)]  Loss: 0.047152
Train Epoch: 3 [57600/60000 (96%)]  Loss: 0.006924
done.
Save model and training time metric...
done.
Updating lock file 'dvc.lock'

Running stage 'eval' with command:
    python3 code/eval.py
Loading data and model...
done.
Running model on test data...
done.
Calculating metrics...
done.
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.lock

Everything should run smoothly.

Committing to Git

Let's commit this experiment to Git

$ git status -s
  M code/eval.py
  M code/train_model.py
  M metrics/eval.json
  M metrics/train_metric.json
  M requirements.txt
  M dvc.yaml
  M dvc.lock
 ?? code/my_torch_model.py

git add .
git commit -m "Experiment with CNN model"
git push origin CNN

Visualizing changes

Now that we committed the results to Git, we can see the change in metrics.

$ dvc metrics show -a
CNN:
    metrics/eval.json:
        accuracy: 0.9861
    metrics/train_metric.json:
        training_time: 119.05487275123596
PCA:
    metrics/eval.json:
        accuracy: 0.8047
    metrics/train_metric.json:
        training_time: 0.9064040184020996
master:
    metrics/eval.json:
        accuracy: 0.8583
    metrics/train_metric.json:
        training_time: 11.965423107147217

This model took much longer to train, but performed over 12% better than our best previous work. Let's merge this branch with the master.

Merging the chosen model to master

To merge, perform the usual commands

git checkout master
git merge CNN
git push

To make sure we have all the updated files, lets get the relevant DVC files as well, and try dvc repro.

$ dvc checkout

$ dvc repro
Stage 'data/test_data.csv.dvc' didn't change, skipping
Stage 'data/train_data.csv.dvc' didn't change, skipping
Stage 'featurization' didn't change, skipping
Restored stage 'training' from run-cache
Skipping run, checking out outputs

Stage 'eval' didn't change, skipping

Success!


Optional - pushing to the remote cache

If you performed the optional stage, now would be a good time to push the updated files for all branches to the cloud.

To push all branches into the remote use the following command:

dvc push -a -r gs_remote


And you are done, my friend! Congrats on completing the DagsHub tutorial. If you have any questions, or feedback, feel free to use the online chat or send us an email at contact@dagshub.com.

(∩,,◕◞౪◟◕)⊃━☆+ ゚ .+ .゚.゚。 ゚ 。. +゚ 。゚.゚。☆。。 . 。 o .。゚。.o。 。 .。

See the project on DagsHub