Build Data Pipelines
In this section, we'll learn how to create a reproducible data pipeline.
Versioning large data files and directories for data science is powerful but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose, we will use a system to define, execute and track data pipelines — a series of data processing stages that produce a final result.
We can extend their usage to orchestrate the training and evaluation stages and manage the entire project lifecycle.
Configure DagsHub¶
We'll Start by creating a new project on DagsHub and configuring it with our machine
Set up the project¶
We'll be working on the Urban Sound Classification project, where we develop a neural network model capable of recognizing urban sound events from a set of 10 categories. We'll use the Python librosa
library, which helps us extract essential numerical features from audio clips. These features will serve as the input for training the neural network model.
Import project¶
We will use the dvc get
command that downloads files from a Git repository or DVC storage without tracking them.
What is dvc get
?
The dvc get
command downloads files from a Git repository or DVC storage without tracking them.
- Run the following commands from your CLI:
=== "Mac, Linux, Windows"
pip install dvc dvc get https://dagshub.com/nirbarazida/urban-audio-classifier requirements.txt --rev 2f60f46 dvc get https://dagshub.com/nirbarazida/urban-audio-classifier src --rev 2f60f46 dvc get --rev fold1 https://dagshub.com/nirbarazida/UrbanSound8K-Thin audio_data.tar.gz tar -vzxf audio_data.tar.gz rm audio_data.tar.gz
Learn more about the project's structure
The new project structure:
.
├── audio_data
│ ├── audio
│ │ └── fold1
│ │ ├── 101415-3-0-2.wav
│ │ ├── 101415-3-0-3.wav
│ │ ├── ...
│ │ └── 99180-9-0-7.wav
│ └── metadata
│ └── UrbanSound8K.csv
├── requirements.txt
└── src
├── const.py
├── data
│ ├── generator.py
│ └── __init__.py
├── __init__.py
└── model
├── __init__.py
└── train.py
Install requirements¶
-
Run the following commands from your CLI:
pip install -r requirements.txt
Version code and data¶
The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.
Version data files¶
We will use DagsHub Client to version our data, which uses DVC under the hood.
-
We will start by installing DagsHub Client using pip
pip install dagshub
-
Use DagsHub Client to version the data
dagshub upload --update --message "Add raw Data" "<repo-owner>/<repo-name>" "audio_data/" "audio_data/"
-
Pull the new commit create by DagsHub with the dvc pointer file
git pull
Version code files¶
-
We will use Git to version all the other files we have in this project
git add requirements.txt src/ !git commit -m "Add requirements and src to Git tracking" git push
Build a data pipeline¶
To build a data pipelines, we can use the dvc stage
or the dvc run
commands.
With those commands, we can define the stage dependencies, outputs, metrics and more, and define if DVC should version the file or not.
The structure of the pipeline is saved in a dvc.yaml
file with all the relevant information.
Once the pipeline is run, DVC versions the relevant files, and saves its pointer file in a file named dvc.lock
. This file holds the information and version of all the files in the pipeline - versioned by DVC or not.
This way, DVC knows which files in the pipeline were modified, and when you'd like to rerun it again, it would know from which step to start and which it can skip because there weren't modified.
-
Run the following command from the CLI
dvc run -n train \ -d audio_data/audio/ \ -d audio_data/metadata/UrbanSound8K.csv \ -d src/data/ \ -d src/model/ \ -d src/const.py \ --outs-persist models \ -O params.yml \ -M metrics.csv \ python3 -m src.model.train
What are the dvc.yaml
and dvc.lock
files?
This action should generate two new files, dvc.yaml
and dvc.lock
, that comtains the follwing
- dvc.yaml
$cat dvc.yaml stages: train: cmd: python3 -m src.model.train deps: - audio_data/audio/ - audio_data/metadata/UrbanSound8K.csv - src/const.py - src/data/ - src/model/ outs: - models: persist: true - params.yml: cache: false metrics: - metrics.csv: cache: false
- dvc.lock
$cat dvc.lock schema: '2.0' stages: train: cmd: python3 -m src.model.train deps: - path: audio_data/audio/ md5: 5beb628dfbf9ece0ef66a1bef896ff93.dir size: 807148184 nfiles: 873 - path: audio_data/metadata/UrbanSound8K.csv md5: 947b97f3241e6f8f956690fe23ac610b size: 51769 - path: src/const.py md5: 194acee9d0ba6361048568f14d52060f size: 266 - path: src/data/ md5: d502927d542948f888f6e18020f43772.dir size: 3222 nfiles: 4 - path: src/model/ md5: a7efe80e9af5e4c621d306f559346fdc.dir size: 3655 nfiles: 4 outs: - path: metrics.csv md5: ef92ebb7afaef832df13b11b2f4deba6 size: 140126 - path: models md5: 1ee45889b1a00246af367c53e0f6cff3.dir size: 1899629 nfiles: 5 - path: params.yml md5: 7680ae72213291b15db4ef1ebec17034 size: 472
-
We'll version and push the pipeline files using Git by running the following from the CLI:
git add dvc.lock dvc.yaml metrics.csv params.yml .gitignore git commit -m "build a pipeline to train the model" git push
-
Last, we'll push the DVC tracked files
dvc push -r origin
Results¶
Congratulations! By completing this tutorial, you've built your very first data pipeline. You can now go to your DagsHub repository and see it!