Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

track_ml_experiments.md 5.7 KB

You have to be logged in to leave a comment. Sign In

In this section, we'll learn how to track machine learning experiments

In the world of data science, conducting experiments is a fundamental component of the project, as we heavily rely on research and empirical analysis. However, as your projects grow in complexity, keeping track of various experiments, their configurations, and results can quickly become overwhelming.

In this tutorial, we'll build a systematic solution to record and manage experiments that is based on MLflow.

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine.

Set up the project

In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.

??? Checkpoint "Learn more about the project's structure"

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:

```bash
tree -I <venv-name> 
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files
```

- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
    - `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    - `modeling.py` - Simple Random Forest Regressor.
    - `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.

Download files

To begin, we will use the dvc get command to download the project's files (code, data, and dependencies) to our local directory.

??? Info "What is dvc get?" The dvc get command downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install dvc==2.58.0 dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Install requirements

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install -r requirements.txt

Track experiments

We will use MLflow to track the experiments in our project. To make the experiments' information accessible outside our local machine, we'll utilize DagsHub integration with MLflow to log the experiments on the repository remote tracking server.

Configurations

  • We'll start by installing MLflow and DagsHub from the CLI

    === "Mac, Linux, Windows" bash pip install mlflow dagshub

  • Next, we'll set up the credentials required for writing access to your MLflow remote server on DagsHub. We'll do it with DagsHub client. We will do it from a python scrippt

    === "Mac, Linux, Windows" Python import dagshub dagshub.init(repo_name="<repo-name>", repo_owner="<repo-owner>")

    ??? info "Configure DagsHub from the CLI"

      You can also configure DagsHub's MLflow tracking server from the CLI. Read more about it in the [MLflow integration](../integration_guide/mlflow_tracking.md#3-set-up-your-credentials) page
    

Add MLflow logging

To log the information of the experiment with MLflow we need to add only 3 lines of code to modeling.py

  • import mlflow
  • with mlflow.start_run(): - “scope” each run in one block of code
  • mlflow.sklearn.autolog() - automatic logging for sklearn framework
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
from const import *
import mlflow

print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH)
X_test = pd.read_csv(X_TEST_PATH)
y_train = pd.read_csv(Y_TRAIN_PATH)
y_test = pd.read_csv(Y_TEST_PATH)

print(M_MOD_RFC)
with mlflow.start_run():
    mlflow.sklearn.autolog()

    rfc = RandomForestClassifier(n_estimators=1, random_state=0)

    # Train the model
    rfc.fit(X_train, y_train.values.ravel())
    y_pred = rfc.predict(X_test)
    
    print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))

!!! Note "MLflow autologging"

MLflow supports the autologging of many popular frameworks such as PyTourch, Tensorflow, XGBoost and more. You can find all the information on [MLflow docs](https://mlflow.org/docs/latest/tracking.html#automatic-logging).
  • Run the above code from the CLI or from your IDE

    === "Mac, Linux, Windows" bash python src/modeling.py [DEBUG] Initialize Modeling [DEBUG] Loading data sets for modeling [DEBUG] Runing Random Forest Classifier [INFO] Finished modeling with AUC Score: 0.931

Results

Congratulations! By completing this tutorial, you've run your very first ML experiment. You can now go to your DagsHub repository and see it under the experiment tab

[![track_ml_experiments](assets/track_ml_experiments.png){: style="padding-top:0.7em"}](assets/track_ml_experiments.png){target=_blank} See the project on DagsHub
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...