Skip to content

Experimentation and Reproducibility

Section overview

Now that we have established our pipeline, it's time to enjoy the fruits of our labor. This mainly comes in the form of easy experimentation and automatic reproducibility.

When we experiment and change the parameters of our pipeline (e.g. change hyperparameters, preprocessing, or even switching to a new dataset), DVC automagically knows what has changed and re-runs only the relevant stages of the pipeline, building upon non-changed stages to save time.

Reproducibility means that after making your changes and telling DVC to recalculate the pipeline, you can dvc push the resulting files to a shared remote. That way, when someone else (or you yourself 3 months from now) checks out the experiment branch, they immediately get all the necessary context to reproduce your results. Remember, even failed experiments can be a useful source of information!

In this section you'll see how great that is.

Screenshot
A DAGnicorn with automagical powers (source: Mark Glancy on Pexels)

This section covers the following experiments:

  • Performing Principle Component Analysis (PCA) to reduce the features on the data from 784 to 15.
  • Change the model from an SVM to a Convolutional Neural Network (CNN) using PyTorch.
  • Merging the chosen model with the master branch

Let's get started.

Principle Component Analysis

We decided to see what results we get by reducing the number of features in our data, from 784 (28*28 pixels) to 15.

Let's start by creating a new branch.

1
$ git checkout -b "PCA"

Now, since nothing has really changed, if we use the dvc repro command, nothing will happen.

1
2
3
4
5
6
7
$ dvc repro
Stage 'train_data.csv.dvc' didn't change.
Stage 'test_data.csv.dvc' didn't change.
Stage 'featurization.dvc' didn't change.
Stage 'training.dvc' didn't change.
Stage 'Dvcfile' didn't change.
Pipeline is up to date. Nothing to reproduce.

But that's not really interesting. Let's start changing our code. For this stage we edit only the featurization stage - featurization.py. Logically that means that we'll need to re-run the featurization stage, followed by model training and evalutation.

The new code can be downloaded from this link.

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
"""
Create feature CSVs for train and test datasets
"""
import json
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import pickle
import base64

def featurization():
    # Load data-sets
    print("Loading data sets...")
    train_data = pd.read_csv('./data/train_data.csv', header=None, dtype=float).values
    test_data = pd.read_csv('./data/test_data.csv', header=None, dtype=float).values
    print("done.")

    # Create PCA object of the 15 most important components
    print("Creating PCA object...")
    pca = PCA(n_components=15, whiten=True)
    pca.fit(train_data[:, 1:])

    train_labels = train_data[:, 0].reshape([train_data.shape[0], 1])
    test_labels = test_data[:, 0].reshape([test_data.shape[0], 1])

    train_data = np.concatenate([train_labels, pca.transform(train_data[:, 1:])], axis=1)
    test_data = np.concatenate([test_labels, pca.transform(test_data[:, 1:])], axis=1)
    print("done.")

    # END NEW CODE

    print("Saving processed datasets and normalization parameters...")
    # Save normalized data-sets
    np.save('./data/processed_train_data', train_data)
    np.save('./data/processed_test_data', test_data)

    # Save learned PCA for future inference
    with open('./data/norm_params.json', 'w') as f:
        pca_as_string = base64.encodebytes(pickle.dumps(pca)).decode("utf-8")
        json.dump({ 'pca': pca_as_string }, f)

    print("done.")


if __name__ == '__main__':
    featurization()

Now this is where the magic happens. Simply type

1
dvc repro

The full output of dvc repro should look like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Stage 'train_data.csv.dvc' didn't change.
Stage 'test_data.csv.dvc' didn't change.
Dependency 'code/featurization.py' of 'featurization.dvc' changed.
Stage 'featurization.dvc' changed.
Reproducing 'featurization.dvc'
Running command:
     python3 code/featurization.py
Loading data sets...
done.
Creating PCA object...
done.
Saving processed datasets and normalization parameters...
done.
Saving 'data/norm_params.json' to cache '.dvc/cache'.
Saving 'data/processed_train_data.npy' to cache '.dvc/cache'.
Saving 'data/processed_test_data.npy' to cache '.dvc/cache'.
Saving information to 'featurization.dvc'.
Dependency 'data/processed_train_data.npy' of 'training.dvc' changed.
Stage 'training.dvc' changed.
Reproducing 'training.dvc'
Running command:
     python3 code/train_model.py
Load training data...
Choosing smaller sample to shorten training time...
done.
Training model...
done.
Save model and training time metric...
done.
Saving 'data/model.pkl' to cache '.dvc/cache'.
Output 'metrics/train_metric.json' doesn't use cache. Skipping saving.
Saving information to 'training.dvc'.
Dependency 'data/processed_test_data.npy' of 'Dvcfile' changed.
Stage 'Dvcfile' changed.
Reproducing 'Dvcfile'
Running command:
     python3 code/eval.py
Loading data and model...
done.
Running model on test data...
done.
Calculating metrics...
done.
Output 'metrics/eval.json' doesn't use cache. Skipping saving.
Saving information to 'Dvcfile'.

To track the changes with git run:

    git add featurization.dvc training.dvc Dvcfile

DVC checks all .dvc files (or stages) to see what has changed. Since the import stages didn't change they wont be re-run. Upon reaching the featurization stage, it will run again, as well as the training and evaluation stages.

Committing to Git

Let's commit this change to Git

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ git status -s
 M Dvcfile
 M code/featurization.py
 M featurization.dvc
 M metrics/eval.json
 M metrics/train_metric.json
 M training.dvc

$ git add .
$ git commit -m "Performed PCA to choose 15 features"
$ git push origin PCA

Visualizing changes

Now that we committed the results to Git, we can see the change in metrics.

1
2
3
4
5
6
7
$ dvc metrics show -a
PCA:
    metrics/train_metric.json: [3.2777750492095947]
    metrics/eval.json: [0.804]
master:
    metrics/train_metric.json: [34.35910105705261]
    metrics/eval.json: [0.8583]
This means that our accuracy dropped from 85% to 80%, but our training time was reduced to a one tenth of the original time!


Optional - pushing to the remote cache

If you performed the optional stage, now would be a good time to push the updated files to the cloud.

To do this, simply type the command again.

1
$ dvc push -r gs_remote

Notice that this time, only changed files are uploaded to the cloud. The imported data for example, doesn't need to be pushed again, and this saves a lot of time.


Experiment conclusion

Let's assume that this is not good enough for us. We will go back to the master branch and create a new branch for the next experiment.

Convolutional Neural Network

We now turn to the awe-some power of neural networks to tackle this digit classification problem.

Screenshot
A cool image of a neuron. So exciting! (source: ColiN00B on Pixabay)

The code in this part is based on the code from the PyTorch examples repo.

Creating a new branch from the master branch

We'd like to start fresh from our original pipeline (before the introduction of PCA). For that we need to do two things.

1
2
git checkout -b CNN master
dvc checkout

The first command is the regular Git checkout that branches from the master.

After checking out the Git files, our DVC tracked files - data, model and metrics, still refer to the PCA branch. Using the second command, dvc checkout takes care of that, as DVC looks for the appropriate hash in our cache folder and retrieves it to the working copy.

Automate dvc checkout

If you want DVC to automatically checkout whenever you switch Git branches, use the handy dvc install command.

To verify that we are indeed working on a copy of the last master branch commit, you can perform:

1
2
3
4
5
6
7
$ dvc repro
Stage 'train_data.csv.dvc' didn't change.
Stage 'test_data.csv.dvc' didn't change.
Stage 'norm_params.json.dvc' didn't change.
Stage 'model.pkl.dvc' didn't change.
Stage 'Dvcfile' didn't change.
Pipeline is up to date. Nothing to reproduce.

Updating requirements

For this experiment, we are going to use PyTorch, and therefore need to install this package. Simply type:

1
pip install torch==1.0.0 && pip freeze > requirements.txt

Modifying the training code

We need to make a few changes here. First let's create a new code file named my_torch_model.py which will contain a class definition of our CNN. You can download the complete code from this link.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
Explanation about the network structure

This is a neural network consisting of 2 convolutional layers and 2 fully connected layers. Between every convolutional layer we apply a ReLU activation function, as well as 2d pooling. The tensor is then transformed to fit the shape of the fully connected layers. It then passes through the first fully connected layer followed by another ReLU, and finally the last fully connected layer. We apply a log_softmax, which results in a tensor with an estimation of the current input for each class. We will later take the maximum of all these estimates and use that as the classification result.

Next, let's modify the code for our model training. The complete code can be found in this link.

The new code has a lot of changes, so here is a view of the whole train_model.py after the changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
"""
Train classification model for MNIST
"""
import json
import pickle
import numpy as np
import time

# New imports
import torch
import torch.utils.data
import torch.nn.functional as F
import torch.optim as optim

from my_torch_model import Net

# New function
def train(model, device, train_loader, optimizer, epoch):
    log_interval = 100
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def train_model():
    # Measure training time
    start_time = time.time()

    # Setting up network
    print("Setting up Params...")
    device = torch.device("cpu")
    batch_size = 64
    epochs = 3
    learning_rate = 0.01
    momentum = 0.5
    print("done.")

    # Load training data
    print("Load training data...")
    train_data = np.load('./data/processed_train_data.npy')

    # Divide loaded data-set into data and labels
    labels = torch.Tensor(train_data[:, 0]).long()
    data = torch.Tensor(train_data[:, 1:].reshape([train_data.shape[0], 1, 28, 28]))
    torch_train_data = torch.utils.data.TensorDataset(data, labels)
    train_loader = torch.utils.data.DataLoader(torch_train_data,
                                               batch_size=batch_size,
                                               shuffle=True)
    print("done.")

    # Define SVM classifier and train model
    print("Training model...")
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(),
                          lr=learning_rate,
                          momentum=momentum)

    for epoch in range(1, epochs + 1):
        train(model, device, train_loader, optimizer, epoch)
    print("done.")

    # Save model as pkl
    print("Save model and training time metric...")
    with open("./data/model.pkl", 'wb') as f:
        pickle.dump(model, f)

    # End training time measurement
    end_time = time.time()

    # Create metric for model training time
    with open('./metrics/train_metric.json', 'w') as f:
        json.dump({'training_time': end_time - start_time}, f)
    print("done.")


if __name__ == '__main__':
    train_model()

The main changes are the train() function and the train_loader which produces batches of 64 images for training. We train for three epochs using an SGD (Stochastic Gradient Descent) optimizer.

Finally, we must change eval.py to properly evaluate our new model. The complete code can be found here.

Here is the new code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
"""
Evaluate model performance
"""
import pickle
import json
import numpy as np
from sklearn.metrics import accuracy_score
import torch


def eval_model():
    # Load test data
    print("Loading data and model...")
    test_data = np.load('./data/processed_test_data.npy')

    # Load trained model
    with open('./data/model.pkl', 'rb') as f:
        model = pickle.load(f)

    # Switch model to evaluation (inference) mode
    model.eval()

    print("done.")

    # Divide loaded data-set into data and labels
    labels = test_data[:, 0]
    data = torch.Tensor(test_data[:, 1:].reshape([test_data.shape[0], 1, 28, 28]))

    # Run model on test data
    print("Running model on test data...")
    predictions = model(data).max(1, keepdim=True)[1].cpu().data.numpy()
    print("done.")

    # Calculate metric scores
    print("Calculating metrics...")
    metrics = {'accuracy': accuracy_score(labels, predictions)}

    # Save metrics to json file
    with open('./metrics/eval.json', 'w') as f:
        json.dump(metrics, f)
    print("done.")


if __name__ == '__main__':
    eval_model()

The changes here are mainly "cosmetic", changing data types and commands for the ones required for testing a PyTorch model.

Reproduction

Now that that's out of the way, let's reproduce the model. Here we expect only the model and evaluation stages to be rerun.

1
dvc repro

The full output of dvc repro should look like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Stage 'train_data.csv.dvc' didn't change.
Stage 'test_data.csv.dvc' didn't change.
Stage 'featurization.dvc' didn't change.
Dependency 'code/train_model.py' of 'training.dvc' changed.
Output 'data/model.pkl' of 'training.dvc' changed.
Stage 'training.dvc' changed.
Reproducing 'training.dvc'
Running command:
     python3 code/train_model.py
Setting up Params...
done.
Load training data...
done.
Training model...
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.318451
Train Epoch: 1 [6400/60000 (11%)]   Loss: 0.618438
Train Epoch: 1 [12800/60000 (21%)]  Loss: 0.250020
Train Epoch: 1 [19200/60000 (32%)]  Loss: 0.320296
Train Epoch: 1 [25600/60000 (43%)]  Loss: 0.171401
Train Epoch: 1 [32000/60000 (53%)]  Loss: 0.284555
Train Epoch: 1 [38400/60000 (64%)]  Loss: 0.124762
Train Epoch: 1 [44800/60000 (75%)]  Loss: 0.198001
Train Epoch: 1 [51200/60000 (85%)]  Loss: 0.055579
Train Epoch: 1 [57600/60000 (96%)]  Loss: 0.093504
Train Epoch: 2 [0/60000 (0%)]   Loss: 0.043873
Train Epoch: 2 [6400/60000 (11%)]   Loss: 0.040321
Train Epoch: 2 [12800/60000 (21%)]  Loss: 0.118847
Train Epoch: 2 [19200/60000 (32%)]  Loss: 0.069297
Train Epoch: 2 [25600/60000 (43%)]  Loss: 0.121079
Train Epoch: 2 [32000/60000 (53%)]  Loss: 0.186296
Train Epoch: 2 [38400/60000 (64%)]  Loss: 0.117776
Train Epoch: 2 [44800/60000 (75%)]  Loss: 0.130233
Train Epoch: 2 [51200/60000 (85%)]  Loss: 0.091888
Train Epoch: 2 [57600/60000 (96%)]  Loss: 0.119703
Train Epoch: 3 [0/60000 (0%)]   Loss: 0.155649
Train Epoch: 3 [6400/60000 (11%)]   Loss: 0.078831
Train Epoch: 3 [12800/60000 (21%)]  Loss: 0.033430
Train Epoch: 3 [19200/60000 (32%)]  Loss: 0.199271
Train Epoch: 3 [25600/60000 (43%)]  Loss: 0.109838
Train Epoch: 3 [32000/60000 (53%)]  Loss: 0.242237
Train Epoch: 3 [38400/60000 (64%)]  Loss: 0.024396
Train Epoch: 3 [44800/60000 (75%)]  Loss: 0.076162
Train Epoch: 3 [51200/60000 (85%)]  Loss: 0.044260
Train Epoch: 3 [57600/60000 (96%)]  Loss: 0.071618
done.
Save model and training time metric...
done.
Saving 'data/model.pkl' to cache '.dvc/cache'.
Output 'metrics/train_metric.json' doesn't use cache. Skipping saving.
Saving information to 'training.dvc'.
Dependency 'data/model.pkl' of 'Dvcfile' changed.
Stage 'Dvcfile' changed.
Reproducing 'Dvcfile'
Running command:
     python3 code/eval.py
Loading data and model...
done.
Running model on test data...
done.
Calculating metrics...
done.
Output 'metrics/eval.json' doesn't use cache. Skipping saving.
Saving information to 'Dvcfile'.

To track the changes with git run:

    git add training.dvc Dvcfile

Everything should run smoothly.

Committing to Git

Let's commit this experiment to Git

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ git status -s
  M Dvcfile
  M code/eval.py
  M code/train_model.py
  M metrics/eval.json
  M metrics/train_metric.json
  M requirements.txt
  M training.dvc
 ?? code/my_torch_model.py

$ git add .
$ git commit -m "Experiment with CNN model"
$ git push origin CNN

Visualizing changes

Now that we committed the results to Git, we can see the change in metrics.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ dvc metrics show -a
CNN:
    metrics/train_metric.json: [150.54586172103882]
    metrics/eval.json: [0.9845]
PCA:
    metrics/train_metric.json: [3.2777750492095947]
    metrics/eval.json: [0.804]
master:
    metrics/train_metric.json: [34.35910105705261]
    metrics/eval.json: [0.8583]

Or in the DAGsHub branches view:

Screenshot
You can compare metrics across different branches to see how different, parallel approaches compare with each other. Here we are looking at the accuracy metric.

Or in the DAGsHub commits view:

Screenshot
Lookking at metrics in the commits view allows you to see the evolution of your metrics over time. Here we are looking at the training time metric.

This model took much longer to train, but performed over 12% better than our best previous work. Let's merge this branch with the master.

Merging the chosen model to master

To merge, perform the usual commands

1
2
3
git checkout master
git merge CNN
git push

To make sure we have all the updated files, lets get the relevant DVC files as well, and try dvc repro.

1
2
3
4
5
6
7
8
9
$ dvc checkout

$ dvc repro
Stage 'train_data.csv.dvc' didn't change.
Stage 'test_data.csv.dvc' didn't change.
Stage 'featurization.dvc' didn't change.
Stage 'training.dvc' didn't change.
Stage 'Dvcfile' didn't change.
Pipeline is up to date. Nothing to reproduce.

Success!


Optional - pushing to the remote cache

If you performed the optional stage, now would be a good time to push the updated files for all branches to the cloud.

To push all branches into the remote use the following command:

1
dvc push -a -r gs_remote


And you are done, my friend! Congrats on completing the DAGsHub tutorial. If you have any questions, or feedback, feel free to use the online chat or send us an email at contact@dagshub.com.

(∩,,◕◞౪◟◕)⊃━☆+ ゚ .+ .゚.゚。 ゚ 。. +゚ 。゚.゚。☆。。 . 。 o .。゚。.o。 。 .。