1 Branches

.dvc

647c9374ac

change dvc config

4 years ago

configs

722941df16

rename config file

4 years ago

data

mlruns

6c75c7bad6

change structure of the source code

4 years ago

outputs

6c75c7bad6

change structure of the source code

4 years ago

src

6c75c7bad6

change structure of the source code

4 years ago

.dvcignore

722941df16

rename config file

4 years ago

.gitignore

c63ec9af13

add gitignore

4 years ago

README.md

a8a1e50fc2

Update README.md

4 years ago

Screenshot from 2020-05-03 16-41-21.png

3a56d91a76

Add files via upload

5 years ago

data.dvc

c63ec9af13

add gitignore

4 years ago

DagsHub Storage

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

You have to be logged in to leave a comment.

Machine learning pipeline

This repo provides an example of how to incorporate popular machine learning tools such as DVC, MLflow, and Hydra in your machine learning project. I use my project on predicting aggressive tweets as an example.

Find the article on how to use MLflow and Hydra here

Find the article on how to use DVC here

DVC

DVC is a data version control tool. To install DVC, run

pip install dvc

Hydra

With Hydra, you can compose your configuration dynamically. To install Hydra, simply run

pip install hydra-core --upgrade

MLflow

MLflow is a platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. Install MLflow with

pip install mlflow

Structure's explanation

src: file for source code
mlruns: file for mlflow runs
configs: to keep config files
outputs: results from the runs of Hydra. Each time you run your function nested inside Hydra's decoration, the output will be saved here. If you want to change the directory in mlflow folder, use

import mlflow
import hydra
from hydra import utils

mlflow.set_tracking_uri('file://' + utils.get_original_cwd() + '/mlruns')

src/preprocessing.py: file for preprocessing
src/train_pipeline.py: training's pipeline
src/train.py: file for training and saving model
src/predict.py: file for prediction and loading model

How to pull the data with DVC

Pull the data from Google Drive

dvc pull

How to run this file

To run the configs and see how these experiments are displayed on MLflow's server, clone this repo and run

python src/train.py

Once the run is completed, you can access to MLflow's server with

mlflow ui

Access http://localhost:5000/ from the same directory that you run the file, you should be able to see your experiment like this

Tip!

Press p or to see the previous file or, n or to see the next file

README.md

Machine learning pipeline

DVC

Hydra

MLflow

Structure's explanation

How to pull the data with DVC

How to run this file

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

khuyentran1401 / Machine-learning-pipeline connected to https://github.com/khuyentran1401/Machine-learning-pipeline.git

README.md

Machine learning pipeline

DVC

Hydra

MLflow

Structure's explanation

How to pull the data with DVC

How to run this file

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

khuyentran1401
/
Machine-learning-pipeline
connected to https://github.com/khuyentran1401/Machine-learning-pipeline.git