You have to be logged in to leave a comment.

Level 3 - Experimentation

Level overview

Now that we have a project and the raw data, the next step is to try different types of data processing and models to learn what works better.

In real life, this part is often where things get complicated, difficult to remember, track, and reproduce.

The data versioning we set up will help us keep track of data and model versions, and easily reproduce and share them. But how will we compare the different experiments we're going to run?

It's a sadly common tale, of a data scientist getting really good results with some combination of data, model, and hyperparameters, only to later forget exactly what they did and having to rediscover it. This situation gets much worse when multiple team members are involved.

This level of the tutorial shows how using DAGsHub Logger allows us to easily keep a reproducible record of our experiments, both for ourselves and for our teammates.

Too slow for you?

The full resulting project can be found here: https://dagshub.com/DAGsHub-Official/DAGsHub-Tutorial

Using the logger to track experiments

We're now at a point where we can start experimenting with different models, hyperparameters, and data preprocessing. However, we don't have a way to record and compare results yet.

To solve this, we can use the DAGsHub logger, which will record information about each of our experiments as Git commits. Then, we can push these Git commits to DAGsHub to search, visualize, and compare our experiments.

The logger is already installed since it was already included in our requirements.txt, so we can start right away with adjusting our code.

!!! tip Alternatively, you can download the complete file here: main.py

Let's make the following changes to main.py:

Add an import line to the top of the file:

import dagshub

Now, modify the train() function:


def train():
    print('Loading data...')
    train_df = pd.read_csv(train_df_path)
    test_df = pd.read_csv(test_df_path)

    print('Engineering features...')
    train_df = feature_engineering(train_df)
    test_df = feature_engineering(test_df)

    with dagshub.dagshub_logger() as logger:
        print('Fitting TFIDF...')
        train_tfidf, test_tfidf, tfidf = fit_tfidf(train_df, test_df)

        print('Saving TFIDF object...')
        joblib.dump(tfidf, 'outputs/tfidf.joblib')
        logger.log_hyperparams({'tfidf': tfidf.get_params()})

        print('Training model...')
        train_y = train_df[CLASS_LABEL]
        model = fit_model(train_tfidf, train_y)

        print('Saving trained model...')
        joblib.dump(model, 'outputs/model.joblib')
        logger.log_hyperparams(model_class=type(model).__name__)
        logger.log_hyperparams({'model': model.get_params()})

        print('Evaluating model...')
        train_metrics = eval_model(model, train_tfidf, train_y)
        print('Train metrics:')
        print(train_metrics)
        logger.log_metrics({f'train__{k}': v for k,v in train_metrics.items()})

        test_metrics = eval_model(model, test_tfidf, test_df[CLASS_LABEL])
        print('Test metrics:')
        print(test_metrics)
        logger.log_metrics({f'test__{k}': v for k,v in test_metrics.items()})

!!! note Notice the calls made to the logger to log the hyperparameters of the experiment as well as metrics.

Commit the changed file:

git add main.py
git commit -m "Added experiment logging"

Now, we can run the first experiment which will be recorded:

python3 main.py train

And note the 2 new files created by the logger:

$ git status -s
?? metrics.csv
?? params.yml

We can take a look at the contents of these files, and see that they're pretty readable:

$ head params.yml
model:
  alpha: 0.0001
  average: false
  class_weight: null
  early_stopping: false
  epsilon: 0.1
  eta0: 0.0
  fit_intercept: true
  l1_ratio: 0.15
  learning_rate: optimal

$ head metrics.csv
Name,Value,Timestamp,Step
"train__roc_auc",0.9546657196819067,1605478533224,1
"train__average_precision",0.7174193549553161,1605478533224,1
"train__accuracy",0.9192533333333334,1605478533224,1
"train__precision",0.794300518134715,1605478533224,1
"train__recall",0.3681556195965418,1605478533224,1
"train__f1",0.5031178208073515,1605478533224,1
"test__roc_auc",0.8611241864339614,1605478533309,1
"test__average_precision",0.45830796380276095,1605478533309,1
"test__accuracy",0.89608,1605478533309,1

You can see the full description of these file formats here.

Now, let's record this baseline experiment's parameters and results. Remember that since we trained a new model, our outputs have changed:

$ dvc status
outputs.dvc:
	changed outs:
		modified:           outputs

So we should commit them to DVC before committing to Git:

dvc commit -f outputs.dvc
# DVC will change the contents of outputs.dvc, to record the new hashes of the models saved in the outputs directory
git add outputs.dvc
git add metrics.csv params.yml
git commit -m "Baseline experiment"

Running a few more experiments

Now, we can let our imaginations run free with different configurations for experiments.

Here are a few examples (with a link to the code for them):

We can change the type of model:
- AdaBoost model – main.py with AdaBoost
- Random Forest model – main.py with Random Forest
We can play around with parameters:
- We can try out different values for random forest's max_depth parameter – main.py with different max depth
Etc.

After each such modification, we'll want to save our code and models. We can do that by running a set of commands like this:

=== "Without Branching Strategy" bash python3 main.py train dvc commit -f outputs.dvc git add outputs.dvc main.py metrics.csv params.yml git commit -m "Description of the experiment"

=== "With Branching Strategy" bash python3 main.py train dvc commit -f outputs.dvc git checkout -b "Experiment branch name" # We recommend separating distinct experiments to separate branches. Read more in the note below. git add outputs.dvc main.py metrics.csv params.yml git commit -m "Description of the experiment" git checkout master

Of course, it's a good (but optional) idea to change the commit message to something meaningful.

!!! note "Branching strategy for experiments" Its often hard to decide what structure to use for your project, and there are no right answers – it depends on your needs and preferences.

___Our recommendation___ is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.

Pushing our committed experiments to DAGsHub

To really start getting the benefits of DAGsHub, we should now push our Git commit, which captures an experiment and its results, to DAGsHub. That will allow us to visualize results.

# You may be asked for your DAGsHub username and password when running this command
git push --all
dvc push --all-commits

Visualizing experiments on DAGsHub

To see our experiments visualized, we can navigate to the "Experiments" tab in our DAGsHub repo:

If you want to interact with the experiments table of our pre-made repo, you can find it here.

Here is what our experiments table looked like at this stage, after running a few different configurations:

This table has a row for each detected experiment in your Git history, showing its information and columns for hyperparameters and metrics. Each of these rows corresponds to a single Git commit.

You can interact with this table to:

Filter experiments by hyperparameters:
Filter & sort experiments by numeric metric values - i.e. easily find your best experiments:
Choose the columns to display in the table - by default, we limit the number of columns to a reasonable number:
Label experiments for easy filtering.
Experiments labeled hidden are automatically hidden by default, but you can show them anyway by removing the default filter.
See the commit IDs and code of each experiment, for easy reproducibility.
Select experiments for comparison.
For example, we can check the top 3 best experiments:
Then click on the Compare button to see all 3 of them side by side:

Next Steps

The next logical steps for this project would be to:

Experiment more with data preprocessing and cleaning, and do so in a separate step to save processing time.
Add more training data, and see if it improves results.
Store the trained models & pipelines in a centrally accessible location, so it's easy to deploy them to production or synchronize with team members.
Track different versions of raw data, processed data, and models using DVC, to make it easy for collaborators (and yourself) to reproduce experiments.

Stay tuned for updates to this tutorial, where we will show you how to implement these steps.

In the meantime, if you want to learn more about how to use DVC with DAGsHub, you can follow our other tutorial, which focuses on data pipeline versioning & reproducibility.

![To Be Continued...](assets/to_be_continued.png)

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

3-experiments.md 11 KB

History Raw

Level 3 - Experimentation

Level overview

Too slow for you?

Using the logger to track experiments

Running a few more experiments

Pushing our committed experiments to DAGsHub

Visualizing experiments on DAGsHub

Next Steps

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

DAGsHub-Official / dagshub-docs

3-experiments.md 11 KB History Raw

Level 3 - Experimentation

Level overview

Too slow for you?

Using the logger to track experiments

Running a few more experiments

Pushing our committed experiments to DAGsHub

Visualizing experiments on DAGsHub

Next Steps

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

DAGsHub-Official
/
dagshub-docs

3-experiments.md 11 KB

History Raw