Are you sure you want to delete this access key?
In this section, we'll learn how to create a reproducible data pipeline.
Versioning large data files and directories for data science is powerful but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose, we will use a system to define, execute and track data pipelines — a series of data processing stages that produce a final result.
We can extend their usage to orchestrate the training and evaluation stages and manage the entire project lifecycle.
We'll Start by creating a new project on DagsHub and configuring it with our machine
We'll be working on the Urban Sound Classification project, where we develop a neural network model capable of recognizing urban sound events from a set of 10 categories. We'll use the Python librosa
library, which helps us extract essential numerical features from audio clips. These features will serve as the input for training the neural network model.
We will use the dvc get
command that downloads files from a Git repository or DVC storage without tracking them.
??? Info "What is dvc get
?"
The dvc get
command downloads files from a Git repository or DVC storage without tracking them.
bash pip install dvc dvc get https://dagshub.com/nirbarazida/urban-audio-classifier requirements.txt --rev 2f60f46 dvc get https://dagshub.com/nirbarazida/urban-audio-classifier src --rev 2f60f46 dvc get --rev fold1 https://dagshub.com/nirbarazida/UrbanSound8K-Thin audio_data.tar.gz tar -vzxf audio_data.tar.gz rm audio_data.tar.gz
??? checkpoint "Learn more about the project's structure"
The new project structure:
```bash
.
├── audio_data
│ ├── audio
│ │ └── fold1
│ │ ├── 101415-3-0-2.wav
│ │ ├── 101415-3-0-3.wav
│ │ ├── ...
│ │ └── 99180-9-0-7.wav
│ └── metadata
│ └── UrbanSound8K.csv
├── requirements.txt
└── src
├── const.py
├── data
│ ├── generator.py
│ └── __init__.py
├── __init__.py
└── model
├── __init__.py
└── train.py
```
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install -r requirements.txt
The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.
We will use DagsHub Client to version our data, which uses DVC under the hood.
We will start by installing DagsHub Client using pip
=== "Mac, Linux, Windows"
bash pip install dagshub
Use DagsHub Client to version the data
=== "Mac, Linux, Windows"
```bash
dagshub upload --update --message "Add raw Data" "<repo-owner>/<repo-name>" "audio_data/" "audio_data/"
```
Pull the new commit create by DagsHub with the dvc pointer file
=== "Mac, Linux, Windows"
```bash
git pull
```
We will use Git to version all the other files we have in this project
=== "Mac, Linux, Windows"
```bash
git add requirements.txt src/
!git commit -m "Add requirements and src to Git tracking"
git push
```
To build a data pipelines, we can use the dvc stage
or the dvc run
commands.
With those commands, we can define the stage dependencies, outputs, metrics and more, and define if DVC should version the file or not.
The structure of the pipeline is saved in a dvc.yaml
file with all the relevant information.
Once the pipeline is run, DVC versions the relevant files, and saves its pointer file in a file named dvc.lock
. This file holds the information and version of all the files in the pipeline - versioned by DVC or not.
This way, DVC knows which files in the pipeline were modified, and when you'd like to rerun it again, it would know from which step to start and which it can skip because there weren't modified.
Run the following command from the CLI
=== "Mac, Linux, Windows"
```bash
dvc run -n train \
-d audio_data/audio/ \
-d audio_data/metadata/UrbanSound8K.csv \
-d src/data/ \
-d src/model/ \
-d src/const.py \
--outs-persist models \
-O params.yml \
-M metrics.csv \
python3 -m src.model.train
```
??? checkpoint "What are the dvc.yaml
and dvc.lock
files?"
This action should generate two new files, `dvc.yaml` and `dvc.lock`, that comtains the follwing
- dvc.yaml
```yaml
$cat dvc.yaml
stages:
train:
cmd: python3 -m src.model.train
deps:
- audio_data/audio/
- audio_data/metadata/UrbanSound8K.csv
- src/const.py
- src/data/
- src/model/
outs:
- models:
persist: true
- params.yml:
cache: false
metrics:
- metrics.csv:
cache: false
```
- dvc.lock
```yaml
$cat dvc.lock
schema: '2.0'
stages:
train:
cmd: python3 -m src.model.train
deps:
- path: audio_data/audio/
md5: 5beb628dfbf9ece0ef66a1bef896ff93.dir
size: 807148184
nfiles: 873
- path: audio_data/metadata/UrbanSound8K.csv
md5: 947b97f3241e6f8f956690fe23ac610b
size: 51769
- path: src/const.py
md5: 194acee9d0ba6361048568f14d52060f
size: 266
- path: src/data/
md5: d502927d542948f888f6e18020f43772.dir
size: 3222
nfiles: 4
- path: src/model/
md5: a7efe80e9af5e4c621d306f559346fdc.dir
size: 3655
nfiles: 4
outs:
- path: metrics.csv
md5: ef92ebb7afaef832df13b11b2f4deba6
size: 140126
- path: models
md5: 1ee45889b1a00246af367c53e0f6cff3.dir
size: 1899629
nfiles: 5
- path: params.yml
md5: 7680ae72213291b15db4ef1ebec17034
size: 472
```
We'll version and push the pipeline files using Git by running the following from the CLI:
=== "Mac, Linux, Windows"
```bash
git add dvc.lock dvc.yaml metrics.csv params.yml .gitignore
git commit -m "build a pipeline to train the model"
git push
```
Last, we'll push the DVC tracked files
=== "Mac, Linux, Windows"
```bash
dvc push -r origin
```
Congratulations! By completing this tutorial, you've built your very first data pipeline. You can now go to your DagsHub repository and see it!
[![data pipeline](assets/data_pipeline.png){: style="padding-top:0.7em"}](assets/data_pipeline.png){target=_blank} See the project on DagsHubPress p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?