No Description

DAGsHub-Official 6702ad3940 merge train_bigram into master 2 years ago
.dvc aa4d264533 init DVC 2 years ago
code b59d4a28ad Merge bigrams into the tuned model 2 years ago
data b59d4a28ad Merge bigrams into the tuned model 2 years ago
.gitignore 47859c598d Process to TSV and separate test and training data 2 years ago
Dvcfile b59d4a28ad Merge bigrams into the tuned model 2 years ago
Posts-test.tsv.dvc 47859c598d Process to TSV and separate test and training data 2 years ago
Posts.tsv.dvc 47859c598d Process to TSV and separate test and training data 2 years ago
Posts.xml.dvc 516b05276e extract data 2 years ago
README.md 6702ad3940 merge train_bigram into master 2 years ago
matrix-train.p.dvc 735e5cedc1 Bigrams 2 years ago
model.p.dvc b59d4a28ad Merge bigrams into the tuned model 2 years ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

DVC Tutorial

A repo for the dvc tutorial shown on DVC.org

Step 0

Initial git commit. Here we have downloaded the code from the DVC site

Step 1

Initialized DVC, and added a virtual environment

Step 2

Retrieved example data which is about 41Mb in size. Because of how DVC works, this data is not commited to the git repo, but instead exists in the DVC cache.

Step 3

Unzipped the data file. According to the command given, DVC knows to automatically add the unzipped data file to the .gitignore and the .dvc/cache

Step 4

Performed XML to TSV and performed the data train/test split. These are two consecutive steps of the data pipeline. This step goes to show that you can perform multiple steps of the pipeline before commiting without any problems

Step 5

Peformed the following DVC steps - Featurization, Training and model evaluation.For the final step we create an eval.txt file which includes an AUC metric for measuring the performance of the model.

Outside of the original DVC tutorial we have created this file as a metric using the flag -M instead of the -o flag appearing in the original tutorial.

Step 6

Created a new branch called bigram. Like it's name, we have tried to use bigrams (features extract from word pairs) additionally to the unigrams (single word features) used earlier.

This step is performed in order to try and improve our AUC metric. It has indeed improved, but by a very small amount, which is not so exciting. We are logging this relatively unsuccessful attempt nontheless.

Step 7

We have now created a new branch called tuning, which aims to improve the model performance on the AUC metric by changing the parameters of the random forest classifier used in this project. Here we have changed the number of estimatiors to be 700 (and increase of 600) and the number of jobs to 6 (an increase of 4).

After performing the DVC repro command we acheive a model with ~0.64 AUC which is a decent improvement.

Step 8

Here we try to combine the modifications in the former 2 stages in order to get another improvement. In this case the metric has, in fact, not improved and the AUC is now ~0.638 as opposed to the case shown in the original tutorial where a small improvement had been made. Nonetheless we continue the flow of the tutorial and will perform the next steps as if it had improved.

Step 9 - Current Step

We have now merged our "improved" model with the original branch.