Sample ML project with DVC pipelines

Puneetha Pai ac44e1a160 DVC: Delete all parameter logging 5 months ago
.dvc f84e5f1502 Add source data into dvc 6 months ago
data ac44e1a160 DVC: Delete all parameter logging 5 months ago
notebooks 310268abdf Update initial exploration notebook 6 months ago
src 6a6639047c DVC: Read params for all models 5 months ago
test d9b58c894e Add simple test for YAML parser 5 months ago
.dvcignore 9b30b3dd11 DVC: Refactor and update pipeline with proper metrics data 6 months ago
.gitignore f84e5f1502 Add source data into dvc 6 months ago
Readme.md ac44e1a160 DVC: Delete all parameter logging 5 months ago
featurize.dvc ac44e1a160 DVC: Delete all parameter logging 5 months ago
params.yaml 6a6639047c DVC: Read params for all models 5 months ago
requirements.txt d9b58c894e Add simple test for YAML parser 5 months ago
split.dvc f9d55c4c3d DVC: Add pipeline to split train and test data 6 months ago
train_ensemble.dvc ac44e1a160 DVC: Delete all parameter logging 5 months ago
train_forrest.dvc ac44e1a160 DVC: Delete all parameter logging 5 months ago
train_logistic.dvc ac44e1a160 DVC: Delete all parameter logging 5 months ago
train_svc.dvc ac44e1a160 DVC: Delete all parameter logging 5 months ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

Readme.md

Intent:

  • This repo is an introductory project to understand the creation of ML models using DVC pipelines
  • With DagsHub user can visualise and compare their experiments

Setup:

  1. Clone the repo:
  2. Initialise dvc
  3. Configure dvc remote
  4. Version data file with dvc
  5. You are good to go. No define pipelines and build your first ML model and pipeline. Enjoy the reproducibility of experiments.

Defining Pipelines

Idea is to build multiple model and ensemble them to create a stable strong model.

Your final pipeline structure will look like this in the end:

                                                  +-------------------+
                                                  | data/iris.csv.dvc |
                                                  +-------------------+
                                                            *
                                                            *
                                                            *
                                                      +-----------+
                                                      | split.dvc |
                                                      +-----------+
                                                            *
                                                            *
                                                            *
                                                    +---------------+
                                                ****| featurize.dvc |****
                                        ********    +---------------+    ********
                                ********           **              ***           *********
                        ********                ***                   **                  ********
                   *****                      **                        **                        ********
+--------------------+             +---------------+             +-------------------+                    *****
| train_logistic.dvc |**           | train_svc.dvc |             | train_forrest.dvc |            ********
+--------------------+  ********   +---------------+             +-------------------+    ********
                                ********           **              ***           *********
                                        ********     ***         **      ********
                                                *****   **     **   *****
                                                 +--------------------+
                                                 | train_ensemble.dvc |
                                                 +--------------------+

Order is data -> train_test_split -> feature_extraction -> 3 models -> ensemble_model and Done!!

With data/iris.csv versioned in DVC you can start with your first pipeline.

Define Stage1 => split.dvc i.e train_test_split.

dvc run -f split.dvc\
 -d data/iris.csv\
 -d src/train_test_split.py\
 -o data/split\
 python src/train_test_split.py -i "data/iris.csv" -o "data/split/"

Define Stage2 => featurize.dvc i.e Feature Engineering.

dvc run -f featurize.dvc\
 -d data/split\
 -d src/feature_engineering.py\
 -p pca\
 -o data/features\
 -o data/models/pca/model.gz\
 -M data/models/pca/metrics.csv\
 python src/feature_engineering.py -i "data/split/" -o "data/features/" -o "data/models/pca/"

Define Stage3.a => train_logistic.dvc i.e Fit Logistic Regression Model.

dvc run -f train_logistic.dvc\
 -d src/logistic_regression.py\
 -d data/features\
 -p logistic\
 -o data/models/logistic/model.gz\
 -M data/models/logistic/metrics.csv\
 python src/logistic_regression.py -i "data/features/" -o "data/models/logistic/"

Define Stage3.b => train_svc.dvc i.e Fit Linear SVC Model.

dvc run -f train_svc.dvc\
 -d src/linear_svc.py\
 -d data/features\
 -p svc\
 -o data/models/svc/model.gz\
 -M data/models/svc/metrics.csv\
 python src/linear_svc.py -i "data/features/" -o "data/models/svc/"

Define Stage3.c => train_forrest.dvc i.e Fit Random Forrest Model.

dvc run -f train_forrest.dvc\
 -d src/random_forrest.py\
 -d data/features\
 -p forrest\
 -o data/models/r_forrest/model.gz\
 -M data/models/r_forrest/metrics.csv\
 python src/random_forrest.py -i "data/features/" -o "data/models/r_forrest/"

Define Stage4 => train_ensemble.dvc i.e Create an Ensemble Model.

dvc run -f train_ensemble.dvc\
 -d src/ensemble.py\
 -d data/features\
 -d data/models/logistic/model.gz\
 -d data/models/svc/model.gz\
 -d data/models/r_forrest/model.gz\
 -p ensemble\
 -o data/models/ensemble/model.gz\
 -M data/models/ensemble/metrics.csv\
 python src/ensemble.py -i "data/features/" -m "data/models/" -o "data/models/ensemble/"