Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
aa4d264533
init DVC
5 years ago
b59d4a28ad
Merge bigrams into the tuned model
5 years ago
b59d4a28ad
Merge bigrams into the tuned model
5 years ago
47859c598d
Process to TSV and separate test and training data
5 years ago
b59d4a28ad
Merge bigrams into the tuned model
5 years ago
47859c598d
Process to TSV and separate test and training data
5 years ago
47859c598d
Process to TSV and separate test and training data
5 years ago
516b05276e
extract data
5 years ago
6702ad3940
merge train_bigram into master
5 years ago
735e5cedc1
Bigrams
5 years ago
b59d4a28ad
Merge bigrams into the tuned model
5 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

DVC Tutorial

A repo for the dvc tutorial shown on DVC.org

Step 0

Initial git commit. Here we have downloaded the code from the DVC site

Step 1

Initialized DVC, and added a virtual environment

Step 2

Retrieved example data which is about 41Mb in size. Because of how DVC works, this data is not commited to the git repo, but instead exists in the DVC cache.

Step 3

Unzipped the data file. According to the command given, DVC knows to automatically add the unzipped data file to the .gitignore and the .dvc/cache

Step 4

Performed XML to TSV and performed the data train/test split. These are two consecutive steps of the data pipeline. This step goes to show that you can perform multiple steps of the pipeline before commiting without any problems

Step 5

Peformed the following DVC steps - Featurization, Training and model evaluation.For the final step we create an eval.txt file which includes an AUC metric for measuring the performance of the model.

Outside of the original DVC tutorial we have created this file as a metric using the flag -M instead of the -o flag appearing in the original tutorial.

Step 6

Created a new branch called bigram. Like it's name, we have tried to use bigrams (features extract from word pairs) additionally to the unigrams (single word features) used earlier.

This step is performed in order to try and improve our AUC metric. It has indeed improved, but by a very small amount, which is not so exciting. We are logging this relatively unsuccessful attempt nontheless.

Step 7

We have now created a new branch called tuning, which aims to improve the model performance on the AUC metric by changing the parameters of the random forest classifier used in this project. Here we have changed the number of estimatiors to be 700 (and increase of 600) and the number of jobs to 6 (an increase of 4).

After performing the DVC repro command we acheive a model with ~0.64 AUC which is a decent improvement.

Step 8

Here we try to combine the modifications in the former 2 stages in order to get another improvement. In this case the metric has, in fact, not improved and the AUC is now ~0.638 as opposed to the case shown in the original tutorial where a small improvement had been made. Nonetheless we continue the flow of the tutorial and will perform the next steps as if it had improved.

Step 9 - Current Step

We have now merged our "improved" model with the original branch.

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 2

Comments

Loading...