Skip to content

Level 2 - Experimentation

Level overview

Now that we have a project and the raw data, and have understood its structure well enough to train a basic model from it, the next step is to try different types of data processing and models to learn what works better.

In real life, this part is often where things get complicated, difficult to remember, track, and reproduce.

It's a sadly common tale, of a data scientist getting really good results with some combination of data, model and hyperparameters, only to later forget exactly what they did and having to rediscover it. This situation gets much worse when multiple team members are involved.

This level of the tutorial shows how using the DAGsHub Logger allows us to easily keep a reproducible record of our experiments, both for ourselves and for our teammates.

Too slow for you?

The full resulting project can be found here: https://dagshub.com/DAGsHub-Official/DAGsHub-Tutorial-StackExchange

Writing the basic training code

Let's use our existing insights and code from the data exploration level to get started with a single Python script which:

  1. Loads the data
  2. Processes the data
  3. Trains a classification model
  4. Evaluates the trained model and reports relevant metrics.

We'll put all this in a single script called main.py for now. You can download the complete file here: main.py and save it to your project folder.

Tip

Alternatively, you can create a file called main.py and copy the following into it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score, accuracy_score, precision_score, recall_score, \
    f1_score
from sklearn.model_selection import train_test_split


def feature_engineering(raw_df):
    df = raw_df.copy()
    df['CreationDate'] = pd.to_datetime(df['CreationDate'])
    df['CreationDate_Epoch'] = df['CreationDate'].astype('int64') // 10 ** 9
    df['MachineLearning'] = df['Tags'].str.contains('machine-learning').fillna(False)
    df = df.drop(columns=['Id', 'Tags'])
    df['Title_Len'] = df.Title.str.len()
    df['Body_Len'] = df.Body.str.len()
    # Drop the correlated features
    df = df.drop(columns=['FavoriteCount'])
    df['Text'] = df['Title'].fillna('') + ' ' + df['Body'].fillna('')
    return df


def fit_tfidf(train_df, test_df):
    tfidf = TfidfVectorizer(max_features=25000)
    tfidf.fit(train_df['Text'])
    train_tfidf = tfidf.transform(train_df['Text'])
    test_tfidf = tfidf.transform(test_df['Text'])
    return train_tfidf, test_tfidf, tfidf


def fit_model(train_X, train_y):
    clf_tfidf = LogisticRegression()
    clf_tfidf.fit(train_X, train_y)
    return clf_tfidf


def eval_model(clf, X, y):
    y_proba = clf.predict_proba(X)[:, 1]
    y_pred = clf.predict(X)
    return {
        'roc_auc': roc_auc_score(y, y_proba),
        'average_precision': average_precision_score(y, y_proba),
        'accuracy': accuracy_score(y, y_pred),
        'precision': precision_score(y, y_pred),
        'recall': recall_score(y, y_pred),
        'f1': f1_score(y, y_pred),
    }


if __name__ == '__main__':
    print('Loading data...')
    df = pd.read_csv('data/CrossValidated-Questions.csv')
    train_df, test_df = train_test_split(df)
    del df

    train_df = feature_engineering(train_df)
    test_df = feature_engineering(test_df)

    print('Fitting TFIDF...')
    train_tfidf, test_tfidf, tfidf = fit_tfidf(train_df, test_df)

    print('Fitting classifier...')
    train_y = train_df['MachineLearning']
    model = fit_model(train_tfidf, train_y)

    train_metrics = eval_model(model, train_tfidf, train_y)
    print('Train metrics:')
    print(train_metrics)

    test_metrics = eval_model(model, test_tfidf, test_df['MachineLearning'])
    print('Test metrics:')
    print(test_metrics)

Running the training script for the first time

We can see that the script works by running:

1
python3 ./main.py

The output should look more or less like this:

1
2
3
4
5
6
7
Loading data...
Fitting TFIDF...
Fitting classifier...
Train metrics:
{'roc_auc': 0.9485806946672573, 'average_precision': 0.6257059393253142, 'accuracy': 0.9279466666666667, 'precision': 0.7319116527037319, 'recall': 0.2902446390818484, 'f1': 0.41565743944636674}
Test metrics:
{'roc_auc': 0.9163103279051873, 'average_precision': 0.4982401837282909, 'accuracy': 0.92064, 'precision': 0.5975308641975309, 'recall': 0.22595704948646125, 'f1': 0.32791327913279134}
Tip

If you encounter an error which looks like this:

1
2
3
4
Traceback (most recent call last):
  File "main.py", line 1, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

That probably means you forgot to activate your virtual environment:

1
source .venv/bin/activate
1
.venv\Scripts\activate.bat

We can see that we get decent performance considering the problem and how basic our model is, and that it's consistent with what we got during data exploration.

It's a good idea to commit this to Git so we can always get back to a working version:

1
2
git add main.py
git commit -m "Basic training script"

Now, we want to see how we can improve on this baseline performance.

Things to improve in the script

That script was nice just to see that everything works, but before we start really experimenting, there are some issues we should fix:

  • Right now, the test set will be different every time we run the script.
    If we want to compare different runs, we need to make sure the test set stays the same across different runs or risk introducing noise and uncertainty into our decision making.
    To fix this, we should do the train-test split as a separate step which we run only once, and train the model in a different step which we will run several times, with different configurations, using the same test set.
  • It's also a good idea to stratify our train-test split by the MachineLearning class, since our classes are imbalanced.
  • We didn't set random seeds - to get reproducible research and leave as little to chance as possible, this is a also an important practice.
  • We should save our trained model as a file - otherwise, how will we use it in real life?

Simple things first - let's create a directory to save our outputs in (trained model & preprocessing objects):

1
2
mkdir -p outputs
echo /outputs/ >> .gitignore

Note that our outputs are also in .gitignore - you usually won't want to save these using Git, especially if dealing with large models like neural networks.
In our case, the TFIDF object is fairly large.

Now, we'll mostly change our main function so that it supports running one of the two steps, as well as a few other code changes to address all the points above. You can download the complete file here: main.py

Tip

Alternatively, you can copy the updated main.py contents here:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import argparse
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score, accuracy_score, precision_score, recall_score, \
    f1_score
from sklearn.model_selection import train_test_split
import joblib

# Consts
CLASS_LABEL = 'MachineLearning'
train_df_path = 'data/train.csv.zip'
test_df_path = 'data/test.csv.zip'


def feature_engineering(raw_df):
    df = raw_df.copy()
    df['CreationDate'] = pd.to_datetime(df['CreationDate'])
    df['CreationDate_Epoch'] = df['CreationDate'].astype('int64') // 10 ** 9
    df = df.drop(columns=['Id', 'Tags'])
    df['Title_Len'] = df.Title.str.len()
    df['Body_Len'] = df.Body.str.len()
    # Drop the correlated features
    df = df.drop(columns=['FavoriteCount'])
    df['Text'] = df['Title'].fillna('') + ' ' + df['Body'].fillna('')
    return df


def fit_tfidf(train_df, test_df):
    tfidf = TfidfVectorizer(max_features=25000)
    tfidf.fit(train_df['Text'])
    train_tfidf = tfidf.transform(train_df['Text'])
    test_tfidf = tfidf.transform(test_df['Text'])
    return train_tfidf, test_tfidf, tfidf


def fit_model(train_X, train_y, random_state=42):
    clf_tfidf = LogisticRegression(random_state=random_state)
    clf_tfidf.fit(train_X, train_y)
    return clf_tfidf


def eval_model(clf, X, y):
    y_proba = clf.predict_proba(X)[:, 1]
    y_pred = clf.predict(X)
    return {
        'roc_auc': roc_auc_score(y, y_proba),
        'average_precision': average_precision_score(y, y_proba),
        'accuracy': accuracy_score(y, y_pred),
        'precision': precision_score(y, y_pred),
        'recall': recall_score(y, y_pred),
        'f1': f1_score(y, y_pred),
    }


def split(random_state=42):
    print('Loading data...')
    df = pd.read_csv('data/CrossValidated-Questions.csv')
    df[CLASS_LABEL] = df['Tags'].str.contains('machine-learning').fillna(False)
    train_df, test_df = train_test_split(df, random_state=random_state, stratify=df[CLASS_LABEL])

    print('Saving split data...')
    train_df.to_csv(train_df_path)
    test_df.to_csv(test_df_path)


def train():
    print('Loading data...')
    train_df = pd.read_csv(train_df_path)
    test_df = pd.read_csv(test_df_path)

    print('Engineering features...')
    train_df = feature_engineering(train_df)
    test_df = feature_engineering(test_df)

    print('Fitting TFIDF...')
    train_tfidf, test_tfidf, tfidf = fit_tfidf(train_df, test_df)

    print('Saving TFIDF object...')
    joblib.dump(tfidf, 'outputs/tfidf.joblib')

    print('Training model...')
    train_y = train_df[CLASS_LABEL]
    model = fit_model(train_tfidf, train_y)

    print('Saving trained model...')
    joblib.dump(model, 'outputs/model.joblib')

    print('Evaluating model...')
    train_metrics = eval_model(model, train_tfidf, train_y)
    print('Train metrics:')
    print(train_metrics)

    test_metrics = eval_model(model, test_tfidf, test_df[CLASS_LABEL])
    print('Test metrics:')
    print(test_metrics)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers(title='Split or Train step:', dest='step')
    subparsers.required = True
    split_parser = subparsers.add_parser('split')
    split_parser.set_defaults(func=split)
    train_parser = subparsers.add_parser('train')
    train_parser.set_defaults(func=train)
    parser.parse_args().func()

Finally, to see that it works you can run the two steps one after another. Note that you should only run the split step once during this project's lifetime!

1
2
python main.py split
python main.py train

And it's a good idea to commit our changes:

1
2
git add main.py .gitignore
git commit -m "Separate split step"

Using the logger to track experiments

We're now at a point where we can start experimenting with different models, hyperparameters, and data preprocessing. However, we don't have a way to record and compare results yet.

To solve this, we can use the DAGsHub logger, which will record information about each of our experiments as Git commits. Then, we can push these Git commits to DAGsHub to search, visualize and compare our experiments.

The logger is already installed since it was already included in our requirements.txt, so we can start right away with adjusting our code.

Tip

Alternatively, you can download the complete file here: main.py

Let's make the following changes to main.py:

  • Add an import line to the top of the file:

    1
    import dagshub
    

  • Now, modify the train() function, and add the prepare_log(d, prefix='') function:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    # Prepare a dictionary of either hyperparams or metrics for logging.
    def prepare_log(d, prefix=''):
        if prefix:
            prefix = f'{prefix}__'
    
        # Ensure all logged values are suitable for logging - complex objects aren't supported.
        def sanitize(value):
            return value if value is None or type(value) in [str, int, float, bool] else str(value)
    
        return {f'{prefix}{k}': sanitize(v) for k, v in d.items()}
    
    def train():
        print('Loading data...')
        train_df = pd.read_csv(train_df_path)
        test_df = pd.read_csv(test_df_path)
    
        print('Engineering features...')
        train_df = feature_engineering(train_df)
        test_df = feature_engineering(test_df)
    
        with dagshub.dagshub_logger() as logger:
            print('Fitting TFIDF...')
            train_tfidf, test_tfidf, tfidf = fit_tfidf(train_df, test_df)
    
            print('Saving TFIDF object...')
            joblib.dump(tfidf, 'outputs/tfidf.joblib')
            logger.log_hyperparams(prepare_log(tfidf.get_params(), 'tfidf'))
    
            print('Training model...')
            train_y = train_df[CLASS_LABEL]
            model = fit_model(train_tfidf, train_y)
    
            print('Saving trained model...')
            joblib.dump(model, 'outputs/model.joblib')
            logger.log_hyperparams(model_class=type(model).__name__)
            logger.log_hyperparams(prepare_log(model.get_params(), 'model'))
    
            print('Evaluating model...')
            train_metrics = eval_model(model, train_tfidf, train_y)
            print('Train metrics:')
            print(train_metrics)
            logger.log_metrics(prepare_log(train_metrics, 'train'))
    
            test_metrics = eval_model(model, test_tfidf, test_df[CLASS_LABEL])
            print('Test metrics:')
            print(test_metrics)
            logger.log_metrics(prepare_log(test_metrics, 'test'))
    

Note

Notice the calls made to the logger in order to log the hyperparameters of the experiment as well as metrics.

Note

The prepare_log(d, prefix='') is written since some of the parameters of the model and of the TFIDF vectorizer have the same name. The function makes sure to differentiate their names, as well as verifying all logged values are of a supported type.

Commit the changed file:

1
2
git add main.py
git commit -m "Added experiment logging"

Now, we can run the first experiment which will actually be recorded:

1
python main.py train

And note the 2 new files created by the logger:

1
2
3
$ git status -s
?? metrics.csv
?? params.yml

We can take a look at the contents of these files, and see that they're pretty readable.
You can see the full description of these file formats here.

Now, let's record this baseline experiment's parameters and results:

1
2
git add metrics.csv params.yml
git commit -m "Baseline experiment"

Running a few more experiments

Now, we can let our imaginations run free with different configurations for experiments.

Here are a few examples (with a link to the code for them):

After each such modification, we'll want to save our code and run a set of commands like this:

1
2
3
python main.py train
git add main.py metrics.csv params.yml
git commit -m "Description of the experiment"

Of course, it's a good (but optional) idea to change the commit message to something meaningful.

Pushing our committed experiments to DAGsHub

To really start getting the benefits of DAGsHub, we should now push our Git commit, which captures an experiment and its results, to DAGsHub. That will allow us to visualize results.

1
2
# You may be asked for your DAGsHub username and password when running this command
git push origin --all

Visualizing experiments on DAGsHub

To see our experiments visualized, we can navigate to the "Experiments" tab in our DAGsHub repo:

Click on the experiments tab

If you want to interact with the experiments table of our pre-made repo, you can find it here.

Here is what our experiments table looked like at this stage, after running a few different configurations:

Experiments table

This table has a row for each detected experiment in your Git history, showing its information and columns for hyperparameters and metrics. Each of these rows corresponds to a single Git commit.

You can interact with this table to:

  • Filter experiments by hyperparameters: Filter experiments by model class
  • Filter & sort experiments by numeric metric values - i.e. easily find your best experiments: Filter experiments by minimum F1 test score
  • Choose the columns to display in the table - by default, we limit the number of columns to a reasonable number: Choose displayed columns
  • Label experiments for easy filtering.
    Experiments labeled hidden are automatically hidden by default, but you can show them anyway by removing the default filter. Apply freestyle labels to experiments
  • See the commit IDs and code of each experiment, for easy reproducibility.
  • Select experiments for comparison.
    For example, we can check the top 3 best experiments: Select 3 experiments
    Then click on the Compare button to see all 3 of them side by side: Experiment comparison
    Experiment comparison

Next Steps

The next logical steps for this project would be to:

  • Encapsulate the data processing steps in a Pipeline, to make it easily runnable in production and not just training.
  • Experiment more with data preprocessing and cleaning, and do so in a separate step to save processing time.
  • Add more training data, and see if it improves results.
  • Store the trained models & pipelines in a centrally accessible location, so it's easy to deploy them to production or synchronize with team members.
  • Track different versions of raw data, processed data and models using DVC, to make it easy for collaborators (and yourself) to reproduce experiments.

Stay tuned for updates to this tutorial, where we will show you how to implement these steps.

In the meantime, if you want to learn more about how to use DVC with DAGsHub, you can follow our other tutorial, which focuses on data pipeline versioning & reproducibility.

To Be Continued...