Demo repository to test out DVC and DAGsHub https://github.com/arjvik/dvc-demo

Arjun Vikram 42dfa0e2be Update README.md to include experiment log command 2 months ago
.dvc 1b73f685f8 Push dvc to DAGsHub 3 months ago
.github 8dbc9e261a Update README.md 2 months ago
data
outputs
.dvcignore 3cf662ac39 dvc init 3 months ago
.gitignore 7c4c6fb410 Tried new SGDClassifier algorithm 3 months ago
Pipfile 695e8dd553 Modify main.py to seperate training steps and extract parameters 2 months ago
Pipfile.lock 695e8dd553 Modify main.py to seperate training steps and extract parameters 2 months ago
README.md 42dfa0e2be Update README.md to include experiment log command 2 months ago
data.dvc 8d51143947 Added dvc-tracked folders 3 months ago
dvc.lock 898fc3ba3a Return to original parameters 2 months ago
dvc.yaml ccceac7028 Move metrics files to being tracked by Git 2 months ago
main.py e08a41766a Store model performance in DVC-tracked metrics 2 months ago
metrics-test.yaml 898fc3ba3a Return to original parameters 2 months ago
metrics-train.yaml 898fc3ba3a Return to original parameters 2 months ago
params.yaml 898fc3ba3a Return to original parameters 2 months ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

DVC Demo Project

Demo project to test out DVC and DAGsHub

See this repository on DAGsHub and GitHub

Data Version Control (DVC) is a version control system built around the machine learning workflow. It allows you to build and run pipelines, represented as a Directed Acyclic (dependency) Graphs, with data and code, tracking large outputs using Git-controlled metafiles. DAGsHub is a fully-featured Git and DVC remote, i.e. DAGsHub is to DVC as GitHub is to Git.

This repository implements a binary classifier on questions from CrossValidated Stack Exchange to determine if they are about machine learning or not. The machine learning portion of this repository is unremarkable and uses standard techniques. The python file main.py contains code for all steps of the ML pipeline.

Usage: python3 main.py [split|featurize|tfidf|train|test]

View Experiment Log

The following command transforms the experiment log outputted by DVC into a human-readable format. It pipes the raw JSON outputted by DVC into a jq program to transform it into a TSV (tab-seperated value) of metrics. The TSV is then piped to column to pretty-print it.

$ dvc metrics show --show-json --all-commits | jq -r '(["ID", "Train Accuracy", "Test Accuracy", "Train ROC AUC", "Test ROC AUC"] | ., map("=============")), (to_entries[] as {key: $id, value: {"metrics-train.yaml": $train, "metrics-test.yaml": $test}} | [$id[:10], $train.accuracy, $test.accuracy, $train.roc_auc, $test.roc_auc]) | @tsv' | column -tns$'\t'

Output:

ID              Train Accuracy      Test Accuracy   Train ROC AUC       Test ROC AUC
==============  ==============      ==============  ==============      ==============
workspace       0.9192533333333334  0.89608         0.9546657196819067  0.8611241864339614
8dbc9e261a      0.9192533333333334  0.89608         0.9546657196819067  0.8611241864339614
898fc3ba3a      0.9192533333333334  0.89608         0.9546657196819067  0.8611241864339614
1296fd461c      0.9185066666666667  0.8948          0.954678369966714   0.8550331715537685
75372d83ab      0.9195466666666666  0.89552         0.9541852826125495  0.8629513709508426
6a7e1866a3      0.9181866666666667  0.8964          0.9551873097990777  0.8595586812709163
ccceac7028      0.9192533333333334  0.89608         0.9546657196819067  0.8611241864339614

DAGsHub Features

Experiment Tracker

Pipeline DAG

DVC-tracked folder view (outputs/)

Credits