Skip to content

Defining the Pipeline

Section overview

In this section, we cover creating our basic pipeline, which will use a multi-class Support Vector Machine (SVM) to classify the data. Later on in the tutorial, we will experiment with different solutions, and learn how each variation affects the metrics we are tracking.

Our pipeline will consist of the following stages:

  1. Import training and test data.
    • Output: The train and test splits of the data.
  2. Featurization or pre-processing.
    • Output: Processed data, ready for model training.
  3. Training, in which our model is created and trained base on the training dataset.
    • Output: The trained model, plus training metrics.
  4. Evaluation, in which our model is scored for performance on the testing dataset.
    • Output: Metrics for our model's performance on the test data.

We have chosen two metrics which will be measured throughout our pipeline - the model training time and test accuracy. They are not necessarily the "correct" metrics for a project like this, but are used mainly to show how DVC works with metrics.

DVC uses the command dvc run in order to create stages in the pipeline. A stage is defined by its dependencies, outputs and metrics, as well as the command needed to reproduce the stage. We will see examples for many of these in this tutorial.

Importing the data

We'll use the MNIST datasets in CSV format, as prepared by https://pjreddie.com.

We'll go over two ways to add the data to your code. The most straightforward way is using the dvc import command.

1
2
dvc import https://pjreddie.com/media/files/mnist_train.csv data/train_data.csv
dvc import https://pjreddie.com/media/files/mnist_test.csv data/test_data.csv
This will download the file directly from the url and begin tracking it using DVC.

Another way is to download the files to your project folder using wget and then use dvc add to track them.

1
2
3
4
5
wget https://pjreddie.com/media/files/mnist_train.csv -O data/train_data.csv
wget https://pjreddie.com/media/files/mnist_train.csv -O data/test_data.csv

dvc add data/train_data.csv
dvc add data/test_data.csv

Info

dvc add is similar to to git add - it tells DVC that this is a file we should be tracking changes on. The immediate effect of this command will be that the file is added to .dvc/cache

Committing progress to Git

Lets check the Git status of the data folder.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ git status -s
?? data/
?? test_data.csv.dvc
?? train_data.csv.dvc

$ git status -s data/
?? data/.gitignore

$ cat data/.gitignore
/train_data.csv
/test_data.csv
If you used the second method for adding the files, the two .dvc files will be inside the data directory.

As you might have noticed, the data files themselves (which weigh around 130Mb together) are not tracked by Git. Adding them to the .gitignore file is part of the dvc import and dvc add command. What these commands additionally do, is keep the actual file in the DVC cache located in .dvc/cache and create a reflink to the file in its intended location in data/. This means that in practice you don't use twice the space, which is a pretty neat feature.

Now, to commit the changes:

1
2
3
git add .
git commit -m "Imported training and test datasets"
git push

Visualizing changes

Let's go back to our repo page and see how it looks now.

Screenshot
Repo view after importing data

Surprise! we have a DAG with 6 nodes appearing in the Data Pipeline view. Let's dive into the meaning and uses of each node.

The first node is the remote location we imported the data from. If we click on it it expands and we can see the following details:


Data node

The blue color signifies that this is a generic file node, which is usually data but could be normalization parameters or anything that doesn't fall into any of the other file categories.

The rest of the data represented by this node is as follows:

  1. Icon representing the file location. In our case, the file is downloaded from the internets, and therefore we have a globe icon.
  2. The file name.
  3. Path to the file, in the case of http/s, this is a clickable link to the file's original location.
  4. The location of the file in textual form.

The lower files are the same as these, but represent the data files that are now in our project. Since the working copy is located on our machine, they have an HDD icon, and if you expand them, the path link is disabled due to the fact that they are not available online (Remember, they are not committed to git!). We will solve this in the optional stage Adding a Remote Cache at the end of this section.

The nodes in the middle are stage nodes (or nodes representing .dvc files). Let's expand one and see what details they hold:


DVC node

Here, the gray color represents that this is a stage node, meaning it describes a stage in the pipeline.

The other details represent the following:

  1. In the case of stage nodes, the icon is a terminal icon, representing that this is a command. It's pointless to show a location icon, since .dvc files are always checked into the git repo.
  2. The command that created this stage. We may truncate it if it's too long. In this case, the command is just a dvc add or import. Later in the tutorial, we will see stages where the command is more complex and defined by the user.
  3. Path to the DVC file containing the details of the command. You can click on it to see the contents of the .dvc file in your browser.
  4. The full command used to create the stage. Not much info here in the case of dvc add or import.
  5. Location of the dvc file. Should always be Git.

Now let's move on to create the rest of the pipeline.

Featurization or pre-processing

First, we need to create a python module to pre-process our data. You can download it from this link and save it to your code/ folder.

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""
Create feature CSVs for train and test datasets
"""
import json
import numpy as np
import pandas as pd


def featurization():
    # Load data-sets
    print("Loading data sets...")
    train_data = pd.read_csv('./data/train_data.csv', header=None, dtype=float)
    test_data = pd.read_csv('./data/test_data.csv', header=None, dtype=float)
    print("done.")

    # Normalize the train data
    print("Normalizing data...")
    # We choose all columns except the first, since that is where our labels are
    train_mean = train_data.values[:, 1:].mean()
    train_std = train_data.values[:, 1:].std()

    # Normalize train and test data according to the train data distribution
    train_data.values[:, 1:] -= train_mean
    train_data.values[:, 1:] /= train_std
    test_data.values[:, 1:] -= train_mean
    test_data.values[:, 1:] /= train_std

    print("done.")

    print("Saving processed datasets and normalization parameters...")
    # Save normalized data-sets
    np.save('./data/processed_train_data', train_data)
    np.save('./data/processed_test_data', test_data)

    # Save mean and std for future inference
    with open('./data/norm_params.json', 'w') as f:
        json.dump({'mean': train_mean, 'std': train_std}, f)

    print("done.")


if __name__ == '__main__':
    featurization()
The code loads the training and test datasets using pandas, normalizes them, saves them as .npy files, and saves the normalization parameters in a separate .json file. We'll need those later, for inference in production.

A few things to note

  • It is important to note file names. The processed data files will be saved as processed_train_data.npy and processed_test_data.npy and the normalization parameters will be saved as norm_params.json. This is important as we'll use these as arguments for the dvc run command.

Now, to create the stage and run the code, use the following command:

1
2
3
4
5
6
7
8
9
dvc run \
-d data/train_data.csv \
-d data/test_data.csv \
-d code/featurization.py \
-o data/norm_params.json \
-o data/processed_train_data.npy \
-o data/processed_test_data.npy \
-f featurization.dvc \
python3 code/featurization.py
The -d flags define dependencies of the stage, and include both data and code files (in our case).

The -o files define cached outputs. This means that after running the command, DVC will cache these files in .dvc/cache and track their changes (as well as add them to .gitignore).

Uncached outputs

DVC also enables you to define uncached outputs with the -O flag. These files will be tracked by Git instead of living in the .dvc/cache.

The -f specifies what to name the stage file resulting from this dvc run command. It defaults to the name of the first output appearing in the command - in this case norm_params.json.dvc, but it's better practice to give it a descriptive name. If you forget to specify the name in advance, you can always just rename the file after the command completes.

The last line is the command itself. It is possible to input any shell command as the dvc command.

Every dvc run command results in a .dvc stage file. Let's look at the contents of the stage file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ cat featurization.dvc
cmd: ' python3 code/featurization.py'
deps:
- md5: 5b49cf1b57fb9d6102b559d59d99df7c
  path: data/train_data.csv
- md5: c807df8d6d804ab2647fc15c3d40f543
  path: data/test_data.csv
- md5: 55f2ab79ee6dad39bd0a96ffff39dc64
  path: code/featurization.py
md5: cd1851464314765e96e47cc3f945669f
outs:
- cache: true
  md5: e46984ac8b7097bfddfe5d9210f78ca4
  path: data/norm_params.json
- cache: true
  md5: 9ee0468925c998fda26d197a14d1caec
  path: data/processed_train_data.npy
- cache: true
  md5: a5257a91e73920bdd4cafd0f88105b74
  path: data/processed_test_data.npy

This file holds details on the stage, including the command which you can see in the Data Pipeline view, as well as the dependencies and outputs. For every one of these, a md5 hash is saved which is used to determine when a dependency or output has changed, and the pipeline should be reproduced. Examples will come later in the tutorial.

Let's perform another step before committing to Git.

Model training

After processing our data, we now wish to train a multiclass SVM on the training data. We will create a file called code/train_model.py. You can download the complete file from this link.

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
"""
Train classification model for MNIST
"""
import json
import pickle
import numpy as np
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import time


def train_model():
    # Measure training time
    start_time = time.time()

    # Load training data
    print("Load training data...")
    train_data = np.load('./data/processed_train_data.npy')

    # Choose a random sample of images from the training data.
    # This is important since SVM training time increases quadratically with the number of training samples.
    print("Choosing smaller sample to shorten training time...")
    # Set a random seed so that we get the same "random" choices when we try to recreate the experiment.
    np.random.seed(42)

    num_samples = 5000
    choice = np.random.choice(train_data.shape[0], num_samples, replace=False)
    train_data = train_data[choice, :]

    # Divide loaded data-set into data and labels
    labels = train_data[:, 0]
    data = train_data[:, 1:]
    print("done.")

    # Define SVM classifier and train model
    print("Training model...")
    model = OneVsRestClassifier(SVC(kernel='linear'), n_jobs=6)
    model.fit(data, labels)
    print("done.")

    # Save model as pkl
    print("Save model and training time metric...")
    with open("./data/model.pkl", 'wb') as f:
        pickle.dump(model, f)

    # End training time measurement
    end_time = time.time()

    # Create metric for model training time
    with open('./metrics/train_metric.json', 'w') as f:
        json.dump({'training_time': end_time - start_time}, f)
    print("done.")


if __name__ == '__main__':
    train_model()
The code loads the processed training data file, divides into data and labels, takes a smaller subsample to maintain reasonable training times, and trains a Scikit-Learn multiclass SVM on it. All the while we measure the training time, as we would like to keep it low (We don't want to waste your time while going through this tutorial    ∩(︶▽︶)∩   ).

Now, let's run the training stage:

1
2
3
4
5
6
7
dvc run \
-d data/processed_train_data.npy \
-d code/train_model.py \
-M metrics/train_metric.json \
-o data/model.pkl \
-f training.dvc \
python3 code/train_model.py

Here, we have a new flag -M which tells DVC to mark the following file as a metric. Metric files are expected to be small and will be checked into git rather than being tracked by DVC. In DVC jargon, they will be uncached.

Committing to Git

Let's check the Git status of the data folder.

1
2
3
4
5
6
7
$ git status -su
M data/.gitignore
?? code/featurization.py
?? code/train_model.py
?? featurization.dvc
?? metrics/train_metric.json
?? training.dvc
We now have two new pipeline stages, each represented by a .dvc file, as well as the two code files, one for each stage. The metric file will be committed to git as well.

The .gitignore file has been updated to include the DVC tracked files:

1
2
3
4
5
6
$ cat data/.gitignore
/train_data.csv
/test_data.csv
/norm_params.json
/processed_train_data.npy
/processed_test_data.npy

Now, to commit the changes:

1
2
3
git add .
git commit -m "Trained basic multiclass SVM model"
git push

Visualizing changes

First, it's worth noting that in order to view metrics in the command line, you can use the built in command:

1
2
$ dvc metrics show
    metrics/train_metric.json: {"training_time": 34.35910105705261}
So we know that training has taken a bit longer than 34 seconds. We think that's a reasonable time for this tutorial.

Let's go back to our repo page and see how our data pipeline looks now.

Screenshot
Data Pipeline view after processing data and training model

Now THAT is a DAG (). Let's quickly go over the flow of our pipeline:

  1. We imported the data from an online https:// address and tracked it locally.
  2. Using featurization.py we took that data and processed it, outputting the processed data as .npy files and a normalization parameters .json file.
  3. We then used our next block of code train_model.py and processed_train_data.npy to train an SVM model, while measuring the training time and tracking it as a metric.

That's it. There are three stages, represented by three levels of .dvc files.

We also have two new types of nodes.

The first is a code node. When expanded we get:


Code node

The only new thing here is the green color which represents that this is a code file.

The purple file represents a metric node:


Metric node

The metric node looks strange and uninformative right now, because we haven't defined its format yet. DAGsHub just shows a truncated view of its raw contents. In the recommended optional step defining metrics more specifically for DVC, we'll fix this to get a much better DAG visualization.

Let's create the final stage of our pipeline.

Evaluating the model

For the evaluation stage we have decided to use the accuracy metric. DVC gives you complete freedom in choosing your metrics, and will display metric details properly as long as they are defined in the same way throughout your project.

We will call the evaluation module eval.py You can download the code from this link and put it in your code folder.

This is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""
Evaluate model performance
"""
import pickle
import json
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score


def eval_model():
    # Load test data
    print("Loading data and model...")
    test_data = np.load('./data/processed_test_data.npy')

    # Load trained model
    with open('./data/model.pkl', 'rb') as f:
        model = pickle.load(f)
    print("done.")
    # Divide loaded data-set into data and labels
    labels = test_data[:, 0]
    data = test_data[:, 1:]

    # Run model on test data
    print("Running model on test data...")
    predictions = model.predict(data)
    print("done.")

    # Calculate metric scores
    print("Calculating metrics...")
    metrics = {'accuracy': accuracy_score(labels, predictions)}

    # Save metrics to json file
    with open('./metrics/eval.json', 'w') as f:
        json.dump(metrics, f)
    print("done.")


if __name__ == '__main__':
    eval_model()

Now let's use the last DVC run command for this pipeline to create the evaluation stage:

1
2
3
4
5
6
7
dvc run \
-d data/processed_test_data.npy \
-d data/model.pkl \
-d code/eval.py \
-M metrics/eval.json \
-f Dvcfile \
python3 code/eval.py

Note that instead of the usual .dvc naming convention, we named the stage file Dvcfile. When you name a stage file as Dvcfile, DVC will set it as the default reproducibility target (i.e. the end of the pipeline). We will dive into this topic when we use the dvc repro command in the next section.

Now, let's see what our model's accuracy score is?

1
2
3
4
$ dvc metrics show -a
master:
    metrics/eval.json: {"accuracy": 0.8583}
    metrics/train_metric.json: {"training_time": 34.35910105705261}
The -a flag shows the metrics for all existing git branches. We can see that the resulting model's accuracy is 85% which isn't bad for something we trained in 34 seconds. We'll try to improve upon both metrics later in the tutorial.


Optional - defining metrics more specifically for dvc

We can tell DVC what type of files we are using in order to store our metrics, and where in the file our metric is stored. We do this using the dvc metrics modify command. Let's add these details to our metrics:

1
2
dvc metrics modify metrics/train_metric.json -t json -x training_time
dvc metrics modify metrics/eval.json -t json -x accuracy

Now let's see the (subtle) difference:

1
2
3
4
$ dvc metrics show -a
master:
    metrics/train_metric.json: [34.35910105705261]
    metrics/eval.json: [0.8583]
This is useful if you store more data in the metric files but want DVC to use only one of the data points as the referenced metric.


Committing to Git

Lets check the Git status of the data folder.

1
2
3
4
5
$ git status -s
 M model.pkl.dvc
?? Dvcfile
?? code/eval.py
?? metrics/eval.json

We have the new files created by the evaluation stage, as well as a modified model.pkl.dvc file resulting from the optional stage of metric definition (if you didn't perform it you should only see the last three lines).

Now, to commit the changes:

1
2
3
git add .
git commit -m "Evaluate basic SVM model"
git push

Visualizing changes

Our pipeline is now complete and should look like this in the Data Pipeline view:

Screenshot
Data Pipeline view after processing data and training model

As we can see the eval.py code file, as well as the model.pkl and processed_test_data.npy are used as dependencies for the last stage. The last stage outputs a new metric file eval.json which holds the accuracy metric.

Note how the metric nodes look much nicer, assuming you completed the optional step of configuring their type and xpath:


A metric node with a configured type + xpath. It will prominently display the actual metric name and its value, instead of the metric file's raw content.

Comparing metrics across different experiments and over time

Later on in this tutorial, we'll see how correctly defining our metric files will allow us to get a bird's eye view of the progression of our experiments - how different approaches affect our metrics, and how our metrics evolve over time as we iterate.

The "hard" task of setting up the pipeline is done, and we can now move onto experimenting and reaping the benefits of this setup. This is what we'll do in the next section.

You can also complete this next stage, which covers adding a remote cache. This is extremely important for team collaboration, and requires some remote storage account. If this is relevant to you, we recommend not to skip it.


Optional - adding a remote cache

Throughout this tutorial, we have used Git to version and share our code files as well as the .dvc stage files that have been created throughout our pipeline. The code managed by Git has been pushed to our Git remote (hosted on dagshub.com), but the data files managed by DVC are still on our local machine, in the .dvc/cache.

DVC, however, enables you to push your data and models to the cloud, and thus to share them with team members and collaborators. Just like pushing your code to a shared Git remote enables you to share code with your team mates.

We will cover how to do that, using Google Storage (the cloud storage service, not to be confused with Google Drive). The same methods with very slight modifications will work with other cloud providers, such as AWS and Azure, and even with just plain old shared directories or SSH servers.

We assume that you have already created a storage bucket for this tutorial, have the appropriate permissions and the URL for it.

In our case we have called the bucket dagshub-tutorial and the link is gs://dagshub-tutorial. You will have viewing permission for our bucket but won't be able to push files to it.

To push our data to a remote cache, we need to do a few things:

Installing the DVC extension

Install the corresponding dvc remote. If you have the command line utility for the cloud service you are using, you can skip this step. Otherwise type in the following command (according to the service you are using):

All Extensions
1
pip install dvc[all]
AWS - Amazon Web Services
1
pip install dvc[s3]
GS - Google Storage
1
pip install dvc[gs]
Microsoft Azure
1
pip install dvc[azure]
SSH
1
pip install dvc[ssh]

After installation you need to use another terminal window for the changes to take place.

Defining the remote in DVC

Define the dvc remote. We do this with one command (don't forget to replace the bucket name with your own bucket):

1
dvc remote add --local gs_remote gs://dagshub-tutorial
As long as you don't forget the --local flag, this shouldn't affect your Git repo. You can confirm this by running git status and seeing that there are no uncommitted changes

Why --local?

It is our opinion that the configuration of the remote may vary between team members (working in various environments) and over time (if you switch between cloud providers), therefore it is prudent not to modify the .dvc/config file which is monitored by Git.

Instead, we prefer to use the local configuration instead. You can find it in .dvc/config.local, and confirm that it's ignored in .dvc/.gitignore.

That way you don't couple the current environment configuration to the code history. This is the same best practice which naturally occurs when you run git remote add - the configuration is only local to your own working repo, and won't be pushed to any git remote.

Pushing the files to the cloud

Is as simple as one command.

1
$ dvc push -r gs_remote
This step might take a while.

Profit!

To reap the benefits of doing this while using DAGsHub to host your repo, go to your repo settings, and add the link to the bucket in the Advanced Settings Local DVC cache URL. In our case it looks something like this:


Local DVC cache URL setting

Going back to the nodes that were locally stored, they now have functioning links, and are therefore available for viewing or download for anyone who would want to (provided they have authorization to your bucket, of course).


Path link change after adding remote

We believe this is useful for several reasons:

  • If you want to let someone browse your data and trained models, you can just send them a link to your DAGsHub repo. They don't need to clone or run anything, or sift through undocumented directory structures to find the model they are looking for.
  • The files managed by DVC and pushed to the cloud are immutable - just like a specific version of a file which is saved in a Git commit, even if you continue working and the branch has moved on, you can always go back to some old branch or commit and the download links will still point to the files as they were in the past.
  • By using DVC and DAGsHub, you can preserve your own sanity when running a lot of different experiments in multiple parallel branches. Don't remember where you saved that model which you trained a month ago? Just take a look at your repo, it's a click away. Let software do the grunt work of organizing files, just like those wonderfully lazy software developers do.

Warning

When downloading a link through the graph, it might be saved as a file with the DVC hash as its filename. You can safely change it to the intended filename, including the original extension and it'll work just fine.

Additionally you can use the dvc pull command to retrieve the remote data, which was possibly pushed there by someone else.

1
dvc pull -r gs_remote


Congratulations on building your first DAGsHub/DVC pipeline. Let's experiment with it and see how DVC enables reproducibility, in the next section.