Are you sure you want to delete this access key?
In this section, we'll learn how to track machine learning experiments
In the world of data science, conducting experiments is a fundamental component of the project, as we heavily rely on research and empirical analysis. However, as your projects grow in complexity, keeping track of various experiments, their configurations, and results can quickly become overwhelming.
In this tutorial, we'll build a systematic solution to record and manage experiments that is based on MLflow.
We'll Start by creating a new project on DagsHub and configuring it with our machine.
In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.
??? Checkpoint "Learn more about the project's structure"
This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:
```bash
tree -I <venv-name>
.
├── data
│ └── enron.csv
├── requirements.txt
└── src
├── const.py
├── data_preprocessing.py
└── modeling.py
2 directories, 5 files
```
- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
- `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
- `modeling.py` - Simple Random Forest Regressor.
- `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.
To begin, we will use the dvc get
command to download the project's files (code, data, and dependencies) to our local directory.
??? Info "What is dvc get
?"
The dvc get
command downloads files from a Git repository or DVC storage without tracking them.
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install dvc==2.58.0 dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install -r requirements.txt
We will use MLflow to track the experiments in our project. To make the experiments' information accessible outside our local machine, we'll utilize DagsHub integration with MLflow to log the experiments on the repository remote tracking server.
We'll start by installing MLflow and DagsHub from the CLI
=== "Mac, Linux, Windows"
bash pip install mlflow dagshub
Next, we'll set up the credentials required for writing access to your MLflow remote server on DagsHub. We'll do it with DagsHub client. We will do it from a python scrippt
=== "Mac, Linux, Windows"
Python import dagshub dagshub.init(repo_name="<repo-name>", repo_owner="<repo-owner>")
??? info "Configure DagsHub from the CLI"
You can also configure DagsHub's MLflow tracking server from the CLI. Read more about it in the [MLflow integration](../integration_guide/mlflow_tracking.md#3-set-up-your-credentials) page
To log the information of the experiment with MLflow we need to add only 3 lines of code to modeling.py
import mlflow
with mlflow.start_run():
- “scope” each run in one block of codemlflow.sklearn.autolog()
- automatic logging for sklearn frameworkfrom sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
from const import *
import mlflow
print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH)
X_test = pd.read_csv(X_TEST_PATH)
y_train = pd.read_csv(Y_TRAIN_PATH)
y_test = pd.read_csv(Y_TEST_PATH)
print(M_MOD_RFC)
with mlflow.start_run():
mlflow.sklearn.autolog()
rfc = RandomForestClassifier(n_estimators=1, random_state=0)
# Train the model
rfc.fit(X_train, y_train.values.ravel())
y_pred = rfc.predict(X_test)
print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))
!!! Note "MLflow autologging"
MLflow supports the autologging of many popular frameworks such as PyTourch, Tensorflow, XGBoost and more. You can find all the information on [MLflow docs](https://mlflow.org/docs/latest/tracking.html#automatic-logging).
Run the above code from the CLI or from your IDE
=== "Mac, Linux, Windows"
bash python src/modeling.py [DEBUG] Initialize Modeling [DEBUG] Loading data sets for modeling [DEBUG] Runing Random Forest Classifier [INFO] Finished modeling with AUC Score: 0.931
Congratulations! By completing this tutorial, you've run your very first ML experiment. You can now go to your DagsHub repository and see it under the experiment tab
[{: style="padding-top:0.7em"}](assets/track_ml_experiments.png){target=_blank}Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?