Are you sure you want to delete this access key?
title | description |
---|---|
Version Control and ML Experimentation Tutorial with DagsHub – Experiment Tracking | Delve into version control and track machine learning experiments with DagsHub. This tutorial guides you through data exploration, experiment tracking with DagsHub and DVC, and creating a model to classify Stack Exchange questions on machine learning, offering practical workflow enhancements at each step. |
Now that we have a project and the raw data, the next step is to try different types of data processing and models to learn what works better.
In real life, this part is often where things get complicated, difficult to remember, track, and reproduce.
The data versioning we set up will help us keep track of data and model versions, and easily reproduce and share them. But how will we compare the different experiments we're going to run?
It's a sadly common tale, of a data scientist getting really good results with some combination of data, model, and hyperparameters, only to later forget exactly what they did and having to rediscover it. This situation gets much worse when multiple team members are involved.
This level of the tutorial shows how using DagsHub's integration with MLflow allows us to easily keep a reproducible record of our experiments, both for ourselves and for our teammates.
The full resulting project can be found here:
See the project on DagsHubWe're now at a point where we can start experimenting with different models, hyperparameters, and data preprocessing. However, we don't have a way to record and compare results yet.
To solve this, we can use MLflow, which will record information about each of our experiments to the MLflow server provided with each DagsHub repository. Then, we can search, visualize, and compare our experiments on DagsHub.
MLflow is already installed since it was already included in our requirements.txt, so we can start right away with adjusting our code.
!!! tip Alternatively, you can download the complete file here: main.py
Let's make the following changes to main.py:
import dagshub
import mlflow
DAGSHUB_REPO_OWNER = "<username>"
DAGSHUB_REPO = "DAGsHub-Tutorial"
dagshub.init(DAGSHUB_REPO, DAGSHUB_REPO_OWNER)
def get_or_create_experiment_id(name):
exp = mlflow.get_experiment_by_name(name)
if exp is None:
exp_id = mlflow.create_experiment(name)
return exp_id
return exp.experiment_id
train()
function:
def train():
print('Loading data...')
train_df = pd.read_csv(train_df_path)
test_df = pd.read_csv(test_df_path)
print('Engineering features...')
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)
exp_id = get_or_create_experiment_id("tutorial")
with mlflow.start_run(experiment_id=exp_id):
print('Fitting TFIDF...')
train_tfidf, test_tfidf, tfidf = fit_tfidf(train_df, test_df)
print('Saving TFIDF object...')
joblib.dump(tfidf, 'outputs/tfidf.joblib')
mlflow.log_params({'tfidf': tfidf.get_params()})
print('Training model...')
train_y = train_df[CLASS_LABEL]
model = fit_model(train_tfidf, train_y)
print('Saving trained model...')
joblib.dump(model, 'outputs/model.joblib')
mlflow.log_param("model_class", type(model).__name__)
mlflow.log_params({'model': model.get_params()})
print('Evaluating model...')
train_metrics = eval_model(model, train_tfidf, train_y)
print('Train metrics:')
print(train_metrics)
mlflow.log_metrics({f'train__{k}': v for k,v in train_metrics.items()})
test_metrics = eval_model(model, test_tfidf, test_df[CLASS_LABEL])
print('Test metrics:')
print(test_metrics)
mlflow.log_metrics({f'test__{k}': v for k,v in test_metrics.items()})
!!! note Notice the calls made to MLflow to log the hyperparameters of the experiment as well as metrics.
Commit the changed file:
git add main.py
git commit -m "Added experiment logging"
Now, we can run the first experiment which will be recorded:
python3 main.py train
Now, let's record this baseline experiment's parameters and results. Remember that since we trained a new model, our outputs have changed:
$ dvc status
outputs.dvc:
changed outs:
modified: outputs
So we should commit them to DVC before committing to Git:
dvc commit -f outputs.dvc
# DVC will change the contents of outputs.dvc, to record the new hashes of the models saved in the outputs directory
git add outputs.dvc
git commit -m "Baseline experiment"
Now, we can let our imaginations run free with different configurations for experiments.
Here are a few examples (with a link to the code for them):
max_depth
parameter – main.py with different max depthAfter each such modification, we'll want to save our code and models. We make sure to commit our code first, because MLflow will point any runs done to a particular commit, if run from a Git repository. This lets you match up code changes with experiment results.
We can do that by running a set of commands like this:
=== "Without Branching Strategy"
bash git add main.py git commit -m "Description of the experiment" python3 main.py train dvc commit -f outputs.dvc git add outputs.dvc git commit -m "Results of the experiment"
=== "With Branching Strategy"
bash git add main.py git commit -m "Description of the experiment" python3 main.py train dvc commit -f outputs.dvc git checkout -b "Experiment branch name" # We recommend separating distinct experiments to separate branches. Read more in the note below. git add outputs.dvc git commit -m "Results of the experiment" git checkout master
Of course, it's a good (but optional) idea to change the commit message to something meaningful.
!!! note "Branching strategy for experiments" It's often hard to decide what structure to use for your project, and there are no right answers – it depends on your needs and preferences.
___Our recommendation___ is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.
To see our experiments visualized, we can navigate to the "Experiments" tab in our DagsHub repo:
If you want to interact with the experiments table of our pre-made repo, you can find it here.
Here is what our experiments table looked like at this stage, after running a few different configurations:
This table has a row for each detected experiment in your Git history, showing its information and columns for hyperparameters and metrics. Each of these rows corresponds to a single Git commit.
You can interact with this table to:
hidden
are automatically hidden by default,
but you can show them anyway by removing the default filter.
The next logical steps for this project would be to:
Stay tuned for updates to this tutorial, where we will show you how to implement these steps.
In the meantime, if you want to learn more about how to use DVC with DagsHub, you can follow our other tutorial, which focuses on data pipeline versioning & reproducibility.
![To Be Continued...](assets/to_be_continued.png)Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?