Build Data Pipelines

In this section, we'll learn how to create a reproducible data pipeline.

Versioning large data files and directories for data science is powerful but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose, we will use a system to define, execute and track data pipelines — a series of data processing stages that produce a final result.

We can extend their usage to orchestrate the training and evaluation stages and manage the entire project lifecycle.

Configure DagsHub¶

We'll Start by creating a new project on DagsHub and configuring it with our machine

Set up the project¶

We'll be working on the Urban Sound Classification project, where we develop a neural network model capable of recognizing urban sound events from a set of 10 categories. We'll use the Python librosa library, which helps us extract essential numerical features from audio clips. These features will serve as the input for training the neural network model.

Import project¶

We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

What is dvc get?

The dvc get command downloads files from a Git repository or DVC storage without tracking them.

Run the following commands from your CLI: === "Mac, Linux, Windows"

pip install dvc
dvc get https://dagshub.com/nirbarazida/urban-audio-classifier requirements.txt --rev 2f60f46
dvc get https://dagshub.com/nirbarazida/urban-audio-classifier src --rev 2f60f46
dvc get --rev fold1 https://dagshub.com/nirbarazida/UrbanSound8K-Thin audio_data.tar.gz
tar -vzxf audio_data.tar.gz
rm audio_data.tar.gz

Learn more about the project's structure

The new project structure:

.
├── audio_data
│   ├── audio
│   │   └── fold1
│   │       ├── 101415-3-0-2.wav
│   │       ├── 101415-3-0-3.wav
│   │       ├── ...
│   │       └── 99180-9-0-7.wav
│   └── metadata
│       └── UrbanSound8K.csv
├── requirements.txt
└── src
    ├── const.py
    ├── data
    │   ├── generator.py
    │   └── __init__.py
    ├── __init__.py
    └── model
        ├── __init__.py
        └── train.py

Install requirements¶

Run the following commands from your CLI:
Mac, Linux, Windows
pip install -r requirements.txt

Version code and data¶

The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

Version data files¶

We will use DagsHub Client to version our data, which uses DVC under the hood.

We will start by installing DagsHub Client using pip
Mac, Linux, Windows
pip install dagshub

Use DagsHub Client to version the data

Mac, Linux, Windows

dagshub upload --update --message "Add raw Data" "<repo-owner>/<repo-name>" "audio_data/" "audio_data/"

Pull the new commit create by DagsHub with the dvc pointer file
Mac, Linux, Windows
git pull

Version code files¶

We will use Git to version all the other files we have in this project

Mac, Linux, Windows

git add requirements.txt src/
!git commit -m "Add requirements and src to Git tracking"
git push

Build a data pipeline¶

To build a data pipelines, we can use the dvc stage or the dvc run commands. With those commands, we can define the stage dependencies, outputs, metrics and more, and define if DVC should version the file or not.

The structure of the pipeline is saved in a dvc.yaml file with all the relevant information.

Once the pipeline is run, DVC versions the relevant files, and saves its pointer file in a file named dvc.lock. This file holds the information and version of all the files in the pipeline - versioned by DVC or not.

This way, DVC knows which files in the pipeline were modified, and when you'd like to rerun it again, it would know from which step to start and which it can skip because there weren't modified.

Run the following command from the CLI

Mac, Linux, Windows

dvc run -n train \
-d audio_data/audio/ \
-d audio_data/metadata/UrbanSound8K.csv \
-d src/data/ \
-d src/model/ \
-d src/const.py \
--outs-persist models \
-O params.yml \
-M metrics.csv \
python3 -m src.model.train

What are the dvc.yaml and dvc.lock files?

This action should generate two new files, dvc.yaml and dvc.lock, that comtains the follwing

dvc.yaml

$cat dvc.yaml
stages:
  train:
    cmd: python3 -m src.model.train
    deps:
    - audio_data/audio/
    - audio_data/metadata/UrbanSound8K.csv
    - src/const.py
    - src/data/
    - src/model/
    outs:
    - models:
        persist: true
    - params.yml:
        cache: false
    metrics:
    - metrics.csv:
        cache: false

dvc.lock

$cat dvc.lock
schema: '2.0'
stages:
  train:
    cmd: python3 -m src.model.train
    deps:
    - path: audio_data/audio/
      md5: 5beb628dfbf9ece0ef66a1bef896ff93.dir
      size: 807148184
      nfiles: 873
    - path: audio_data/metadata/UrbanSound8K.csv
      md5: 947b97f3241e6f8f956690fe23ac610b
      size: 51769
    - path: src/const.py
      md5: 194acee9d0ba6361048568f14d52060f
      size: 266
    - path: src/data/
      md5: d502927d542948f888f6e18020f43772.dir
      size: 3222
      nfiles: 4
    - path: src/model/
      md5: a7efe80e9af5e4c621d306f559346fdc.dir
      size: 3655
      nfiles: 4
    outs:
    - path: metrics.csv
      md5: ef92ebb7afaef832df13b11b2f4deba6
      size: 140126
    - path: models
      md5: 1ee45889b1a00246af367c53e0f6cff3.dir
      size: 1899629
      nfiles: 5
    - path: params.yml
      md5: 7680ae72213291b15db4ef1ebec17034
      size: 472

We'll version and push the pipeline files using Git by running the following from the CLI:

Mac, Linux, Windows

git add dvc.lock dvc.yaml metrics.csv params.yml .gitignore
git commit -m "build a pipeline to train the model"
git push

Last, we'll push the DVC tracked files
Mac, Linux, Windows
dvc push -r origin

Results¶

Congratulations! By completing this tutorial, you've built your very first data pipeline. You can now go to your DagsHub repository and see it!

See the project on DagsHub