Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

build_data_pipeline.md 7.6 KB

You have to be logged in to leave a comment. Sign In

In this section, we'll learn how to create a reproducible data pipeline.

Versioning large data files and directories for data science is powerful but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose, we will use a system to define, execute and track data pipelines — a series of data processing stages that produce a final result.

We can extend their usage to orchestrate the training and evaluation stages and manage the entire project lifecycle.

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine

Set up the project

We'll be working on the Urban Sound Classification project, where we develop a neural network model capable of recognizing urban sound events from a set of 10 categories. We'll use the Python librosa library, which helps us extract essential numerical features from audio clips. These features will serve as the input for training the neural network model.

Import project

We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

??? Info "What is dvc get?" The dvc get command downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI: === "Mac, Linux, Windows" bash pip install dvc dvc get https://dagshub.com/nirbarazida/urban-audio-classifier requirements.txt --rev 2f60f46 dvc get https://dagshub.com/nirbarazida/urban-audio-classifier src --rev 2f60f46 dvc get --rev fold1 https://dagshub.com/nirbarazida/UrbanSound8K-Thin audio_data.tar.gz tar -vzxf audio_data.tar.gz rm audio_data.tar.gz

??? checkpoint "Learn more about the project's structure"

The new project structure:
```bash
.
├── audio_data
│   ├── audio
│   │   └── fold1
│   │       ├── 101415-3-0-2.wav
│   │       ├── 101415-3-0-3.wav
│   │       ├── ...
│   │       └── 99180-9-0-7.wav
│   └── metadata
│       └── UrbanSound8K.csv
├── requirements.txt
└── src
    ├── const.py
    ├── data
    │   ├── generator.py
    │   └── __init__.py
    ├── __init__.py
    └── model
        ├── __init__.py
        └── train.py
```

Install requirements

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install -r requirements.txt

Version code and data

The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

Version data files

We will use DagsHub Client to version our data, which uses DVC under the hood.

  • We will start by installing DagsHub Client using pip

    === "Mac, Linux, Windows" bash pip install dagshub

  • Use DagsHub Client to version the data

    === "Mac, Linux, Windows"

      ```bash
      dagshub upload --update --message "Add raw Data" "<repo-owner>/<repo-name>" "audio_data/" "audio_data/"
      ```
    
  • Pull the new commit create by DagsHub with the dvc pointer file

    === "Mac, Linux, Windows"

      ```bash
      git pull
      ```
    

Version code files

  • We will use Git to version all the other files we have in this project

    === "Mac, Linux, Windows"

      ```bash
      git add requirements.txt src/
      !git commit -m "Add requirements and src to Git tracking"
      git push
      ```
    

Build a data pipeline

To build a data pipelines, we can use the dvc stage or the dvc run commands. With those commands, we can define the stage dependencies, outputs, metrics and more, and define if DVC should version the file or not.

The structure of the pipeline is saved in a dvc.yaml file with all the relevant information.

Once the pipeline is run, DVC versions the relevant files, and saves its pointer file in a file named dvc.lock. This file holds the information and version of all the files in the pipeline - versioned by DVC or not.

This way, DVC knows which files in the pipeline were modified, and when you'd like to rerun it again, it would know from which step to start and which it can skip because there weren't modified.

  • Run the following command from the CLI

    === "Mac, Linux, Windows"

      ```bash
      dvc run -n train \
      -d audio_data/audio/ \
      -d audio_data/metadata/UrbanSound8K.csv \
      -d src/data/ \
      -d src/model/ \
      -d src/const.py \
      --outs-persist models \
      -O params.yml \
      -M metrics.csv \
      python3 -m src.model.train
      ```
    

??? checkpoint "What are the dvc.yaml and dvc.lock files?"

This action should generate two new files, `dvc.yaml` and `dvc.lock`, that comtains the follwing

- dvc.yaml
    ```yaml
    $cat dvc.yaml
    stages:
      train:
        cmd: python3 -m src.model.train
        deps:
        - audio_data/audio/
        - audio_data/metadata/UrbanSound8K.csv
        - src/const.py
        - src/data/
        - src/model/
        outs:
        - models:
            persist: true
        - params.yml:
            cache: false
        metrics:
        - metrics.csv:
            cache: false
    ```
- dvc.lock
```yaml
$cat dvc.lock
schema: '2.0'
stages:
  train:
    cmd: python3 -m src.model.train
    deps:
    - path: audio_data/audio/
      md5: 5beb628dfbf9ece0ef66a1bef896ff93.dir
      size: 807148184
      nfiles: 873
    - path: audio_data/metadata/UrbanSound8K.csv
      md5: 947b97f3241e6f8f956690fe23ac610b
      size: 51769
    - path: src/const.py
      md5: 194acee9d0ba6361048568f14d52060f
      size: 266
    - path: src/data/
      md5: d502927d542948f888f6e18020f43772.dir
      size: 3222
      nfiles: 4
    - path: src/model/
      md5: a7efe80e9af5e4c621d306f559346fdc.dir
      size: 3655
      nfiles: 4
    outs:
    - path: metrics.csv
      md5: ef92ebb7afaef832df13b11b2f4deba6
      size: 140126
    - path: models
      md5: 1ee45889b1a00246af367c53e0f6cff3.dir
      size: 1899629
      nfiles: 5
    - path: params.yml
      md5: 7680ae72213291b15db4ef1ebec17034
      size: 472
```
  • We'll version and push the pipeline files using Git by running the following from the CLI:

    === "Mac, Linux, Windows"

      ```bash
      git add dvc.lock dvc.yaml metrics.csv params.yml .gitignore
      git commit -m "build a pipeline to train the model"
      git push
      ```
    
  • Last, we'll push the DVC tracked files

    === "Mac, Linux, Windows"

      ```bash
      dvc push -r origin
      ```
    

Results

Congratulations! By completing this tutorial, you've built your very first data pipeline. You can now go to your DagsHub repository and see it!

[![data pipeline](assets/data_pipeline.png){: style="padding-top:0.7em"}](assets/data_pipeline.png){target=_blank} See the project on DagsHub
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...