penguinsfly/dagshub-docs

You have to be logged in to leave a comment.

In this section, we'll learn how to track machine learning experiments

In the world of data science, conducting experiments is a fundamental component of the project, as we heavily rely on research and empirical analysis. However, as your projects grow in complexity, keeping track of various experiments, their configurations, and results can quickly become overwhelming.

In this tutorial, we'll build a systematic solution to record and manage experiments that is based on MLflow.

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine.

Set up the project

In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.

??? Checkpoint "Learn more about the project's structure"

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:

```bash
tree -I <venv-name> 
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files
```

- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
    - `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    - `modeling.py` - Simple Random Forest Regressor.
    - `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.

Download files

To begin, we will use the dvc get command to download the project's files (code, data, and dependencies) to our local directory.

??? Info "What is dvc get?" The dvc get command downloads files from a Git repository or DVC storage without tracking them.

Run the following commands from your CLI:

=== "Mac, Linux, Windows" bash pip install dvc==2.58.0 dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Install requirements

Run the following commands from your CLI:

=== "Mac, Linux, Windows" bash pip install -r requirements.txt

Track experiments

We will use MLflow to track the experiments in our project. To make the experiments' information accessible outside our local machine, we'll utilize DagsHub integration with MLflow to log the experiments on the repository remote tracking server.

Configurations

We'll start by installing MLflow and DagsHub from the CLI

=== "Mac, Linux, Windows" bash pip install mlflow dagshub
Next, we'll set up the credentials required for writing access to your MLflow remote server on DagsHub. We'll do it with DagsHub client. We will do it from a python scrippt

=== "Mac, Linux, Windows" Python import dagshub dagshub.init(repo_name="<repo-name>", repo_owner="<repo-owner>")

??? info "Configure DagsHub from the CLI"
```
  You can also configure DagsHub's MLflow tracking server from the CLI. Read more about it in the [MLflow integration](../integration_guide/mlflow_tracking.md#3-set-up-your-credentials) page
```

Add MLflow logging

To log the information of the experiment with MLflow we need to add only 3 lines of code to modeling.py

import mlflow
with mlflow.start_run(): - “scope” each run in one block of code
mlflow.sklearn.autolog() - automatic logging for sklearn framework

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
from const import *
import mlflow

print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH)
X_test = pd.read_csv(X_TEST_PATH)
y_train = pd.read_csv(Y_TRAIN_PATH)
y_test = pd.read_csv(Y_TEST_PATH)

print(M_MOD_RFC)
with mlflow.start_run():
    mlflow.sklearn.autolog()

    rfc = RandomForestClassifier(n_estimators=1, random_state=0)

    # Train the model
    rfc.fit(X_train, y_train.values.ravel())
    y_pred = rfc.predict(X_test)
    
    print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))

!!! Note "MLflow autologging"

MLflow supports the autologging of many popular frameworks such as PyTourch, Tensorflow, XGBoost and more. You can find all the information on [MLflow docs](https://mlflow.org/docs/latest/tracking.html#automatic-logging).

Run the above code from the CLI or from your IDE

=== "Mac, Linux, Windows" bash python src/modeling.py [DEBUG] Initialize Modeling [DEBUG] Loading data sets for modeling [DEBUG] Runing Random Forest Classifier [INFO] Finished modeling with AUC Score: 0.931

Results

Congratulations! By completing this tutorial, you've run your very first ML experiment. You can now go to your DagsHub repository and see it under the experiment tab

[![track_ml_experiments](assets/track_ml_experiments.png){: style="padding-top:0.7em"}](assets/track_ml_experiments.png){target=_blank}

See the project on DagsHub

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

track_ml_experiments.md 5.7 KB

History Raw

Configure DagsHub

Set up the project

Download files

Install requirements

Track experiments

Configurations

Add MLflow logging

Results

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

penguinsfly / dagshub-docs forked from DAGsHub-Official/dagshub-docs

track_ml_experiments.md 5.7 KB History Raw

Configure DagsHub

Set up the project

Download files

Install requirements

Track experiments

Configurations

Add MLflow logging

Results

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

penguinsfly
/
dagshub-docs
forked from DAGsHub-Official/dagshub-docs

track_ml_experiments.md 5.7 KB

History Raw