Skip to content

Build Data Pipelines

In this section, we'll learn how to create a reproducible data pipeline.

Versioning large data files and directories for data science is powerful but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose, we will use a system to define, execute and track data pipelines — a series of data processing stages that produce a final result.

We can extend their usage to orchestrate the training and evaluation stages and manage the entire project lifecycle.

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine

Set up the project

We'll be working on the Urban Sound Classification project, where we develop a neural network model capable of recognizing urban sound events from a set of 10 categories. We'll use the Python librosa library, which helps us extract essential numerical features from audio clips. These features will serve as the input for training the neural network model.

Import project

We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

What is dvc get?

The dvc get command downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI: === "Mac, Linux, Windows"
    pip install dvc
    dvc get https://dagshub.com/nirbarazida/urban-audio-classifier requirements.txt --rev 2f60f46
    dvc get https://dagshub.com/nirbarazida/urban-audio-classifier src --rev 2f60f46
    dvc get --rev fold1 https://dagshub.com/nirbarazida/UrbanSound8K-Thin audio_data.tar.gz
    tar -vzxf audio_data.tar.gz
    rm audio_data.tar.gz
    
Learn more about the project's structure

The new project structure:

.
├── audio_data
│   ├── audio
│      └── fold1
│          ├── 101415-3-0-2.wav
│          ├── 101415-3-0-3.wav
│          ├── ...
│          └── 99180-9-0-7.wav
│   └── metadata
│       └── UrbanSound8K.csv
├── requirements.txt
└── src
    ├── const.py
    ├── data
       ├── generator.py
       └── __init__.py
    ├── __init__.py
    └── model
        ├── __init__.py
        └── train.py

Install requirements

  • Run the following commands from your CLI:

    pip install -r requirements.txt
    

Version code and data

The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

Version data files

We will use DagsHub Client to version our data, which uses DVC under the hood.

  • We will start by installing DagsHub Client using pip

    pip install dagshub
    
  • Use DagsHub Client to version the data

    dagshub upload --update --message "Add raw Data" "<repo-owner>/<repo-name>" "audio_data/" "audio_data/"
    
  • Pull the new commit create by DagsHub with the dvc pointer file

    git pull
    

Version code files

  • We will use Git to version all the other files we have in this project

    git add requirements.txt src/
    !git commit -m "Add requirements and src to Git tracking"
    git push
    

Build a data pipeline

To build a data pipelines, we can use the dvc stage or the dvc run commands. With those commands, we can define the stage dependencies, outputs, metrics and more, and define if DVC should version the file or not.

The structure of the pipeline is saved in a dvc.yaml file with all the relevant information.

Once the pipeline is run, DVC versions the relevant files, and saves its pointer file in a file named dvc.lock. This file holds the information and version of all the files in the pipeline - versioned by DVC or not.

This way, DVC knows which files in the pipeline were modified, and when you'd like to rerun it again, it would know from which step to start and which it can skip because there weren't modified.

  • Run the following command from the CLI

    dvc run -n train \
    -d audio_data/audio/ \
    -d audio_data/metadata/UrbanSound8K.csv \
    -d src/data/ \
    -d src/model/ \
    -d src/const.py \
    --outs-persist models \
    -O params.yml \
    -M metrics.csv \
    python3 -m src.model.train
    
What are the dvc.yaml and dvc.lock files?

This action should generate two new files, dvc.yaml and dvc.lock, that comtains the follwing

  • dvc.yaml
    $cat dvc.yaml
    stages:
      train:
        cmd: python3 -m src.model.train
        deps:
        - audio_data/audio/
        - audio_data/metadata/UrbanSound8K.csv
        - src/const.py
        - src/data/
        - src/model/
        outs:
        - models:
            persist: true
        - params.yml:
            cache: false
        metrics:
        - metrics.csv:
            cache: false
    
  • dvc.lock
    $cat dvc.lock
    schema: '2.0'
    stages:
      train:
        cmd: python3 -m src.model.train
        deps:
        - path: audio_data/audio/
          md5: 5beb628dfbf9ece0ef66a1bef896ff93.dir
          size: 807148184
          nfiles: 873
        - path: audio_data/metadata/UrbanSound8K.csv
          md5: 947b97f3241e6f8f956690fe23ac610b
          size: 51769
        - path: src/const.py
          md5: 194acee9d0ba6361048568f14d52060f
          size: 266
        - path: src/data/
          md5: d502927d542948f888f6e18020f43772.dir
          size: 3222
          nfiles: 4
        - path: src/model/
          md5: a7efe80e9af5e4c621d306f559346fdc.dir
          size: 3655
          nfiles: 4
        outs:
        - path: metrics.csv
          md5: ef92ebb7afaef832df13b11b2f4deba6
          size: 140126
        - path: models
          md5: 1ee45889b1a00246af367c53e0f6cff3.dir
          size: 1899629
          nfiles: 5
        - path: params.yml
          md5: 7680ae72213291b15db4ef1ebec17034
          size: 472
    
  • We'll version and push the pipeline files using Git by running the following from the CLI:

    git add dvc.lock dvc.yaml metrics.csv params.yml .gitignore
    git commit -m "build a pipeline to train the model"
    git push
    
  • Last, we'll push the DVC tracked files

    dvc push -r origin
    

Results

Congratulations! By completing this tutorial, you've built your very first data pipeline. You can now go to your DagsHub repository and see it!

data pipeline

See the project on DagsHub