Skip to content
Reader Mode

Found a problem?
Let us know (or fix it):

Edit this Page

Have a question?
Join our community now:

Discord Chat

Track Experiments

In the previous part of the Get Started section, we learned how to track and push files to DAGsHub using Git and DVC. This part will cover how to track your Data Science Experiments and save their parameters and metrics. We assume you have a project that you want to add experiment tracking to. We will be showing an example based on the result of the last section, but you can adapt it to your project in a straightforward way.

Start From This Part

To start the project from this part, please follow the instructions below.

  • Fork the hello-world repository.
  • Clone the repository and work on the start-track-experiments branch using the following command (change the user name):

    git clone -b start-track-experiments https://dagshub.com/<DAGsHub-user-name>/hello-world.git
    

  • Create and activate a virtual environment.

  • Install the python dependencies:
    pip3 install -r requirements.txt
    pip3 install dvc
    
  • Configure DVC locally and set DAGsHub storage as the remote.
  • Download the files using following commands:
    dvc get --rev processed-data https://dagshub.com/nirbarazida/hello-world-files data/
    
  • Track the data directory using DVC and the data.dvc file using Git.
  • Push the files to Git and DVC remotes.

Important

To avoide conflicts, work on the start-track-experiments branch for the rest of the toturial.

Add DAGsHub Logger

DAGsHub logger is a plain Python Logger for your metrics and parameters. The logger saves the information as human-readable files – CSV for metrics files, and YAML for parameters. Once you push these files to your DAGsHub repository, they will be automatically parsed and visualized in the Experiments Tab. For further information please see the Experiment Tab documentation and the DAGsHub Logger repository.

Note

Since DAGsHub Experiments uses generic formats, you don't have to use DAGsHub Logger. Instead, you can write your metrics and parameters into metrics.csv and params.yml files however you want, and push them to your DAGsHub repository, where they will automatically be scanned and added to the experiment tab.

  • We will start by installing the 'dagshub' python package on the project's virtual environment.

    pip3 install dagshub
    
  • Next, we will import 'dagshub' to modeling.py module and track the Random Forest Classifier Hyperparameters and ROC AUC Score. You can copy the code below into your modeling.py folder:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score
    import pandas as pd
    from const import *
    import dagshub
    
    print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
    X_train = pd.read_csv(X_TRAIN_PATH)
    X_test = pd.read_csv(X_TEST_PATH)
    y_train = pd.read_csv(Y_TRAIN_PATH)
    y_test = pd.read_csv(Y_TEST_PATH)
    
    print(M_MOD_RFC)
    with dagshub.dagshub_logger() as logger:
        rfc = RandomForestClassifier(n_estimators=1, random_state=0)
        # log the model's parameters
        logger.log_hyperparams(model_class=type(rfc).__name__)
        logger.log_hyperparams({'model': rfc.get_params()})
    
        # Train the model
        rfc.fit(X_train, y_train.values.ravel())
        y_pred = rfc.predict(X_test)
    
        # log the model's performances
        logger.log_metrics({f'roc_auc_score':round(roc_auc_score(y_test, y_pred),3)})
        print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))
    
Checkpoint

Check that the current status of your Git tracking matches the following

git status -s
    M modeling.py
  • Track and commit the changes with Git

    git add src/modeling.py
    git commit -m "Add DAGsHub Logger to the modeling module"
    

Create New Experiment

As mentioned above, to create a new experiment, we need to update at least one of the two metrics.csv ,params.yml files, track them using Git, and push them to the DAGsHub repository. After editing the modeling.py module, once we run its script it will generate those two files.

  • Run the modeling.py script

    python3 src/modeling.py
        [DEBUG] Initialize Modeling
             [DEBUG] Loading data sets for modeling
             [DEBUG] Running Random Forest Classifier
             [INFO] Finished modeling with AUC Score: 0.931
    git status -s
        ?? metrics.csv
        ?? params.yml
    
  • As we can see for the above output, two new files were created containing the current experiment's information.

The Files Content

The metrics.csv file has four fields:

  • Name - the name of the Metric.
  • Value - the value of the Metric.
  • Timestamp - the time that the log was written.
  • Step - the step number when logging multi-step metrics like loss.

The params.yml file holds all the hyperparameters of the Random Forest Classifier

Example of the files content:

cat metrics.csv
    Name,Value,Timestamp,Step
    "roc_auc_score",0.931,1615794229099,1
cat params.yml
    model:
      bootstrap: true
      ccp_alpha: 0.0
      class_weight: null
      criterion: gini
      max_depth: null
      max_features: auto
      max_leaf_nodes: null
      max_samples: null
      min_impurity_decrease: 0.0
      min_impurity_split: null
      min_samples_leaf: 1
      min_samples_split: 2
      min_weight_fraction_leaf: 0.0
      n_estimators: 1
      n_jobs: null
      oob_score: false
      random_state: 0
      verbose: 0
      warm_start: false
    model_class: RandomForestClassifier
type metrics.csv
    Name,Value,Timestamp,Step
    "roc_auc_score",0.931,1615794229099,1
type params.yml
    model:
      bootstrap: true
      ccp_alpha: 0.0
      class_weight: null
      criterion: gini
      max_depth: null
      max_features: auto
      max_leaf_nodes: null
      max_samples: null
      min_impurity_decrease: 0.0
      min_impurity_split: null
      min_samples_leaf: 1
      min_samples_split: 2
      min_weight_fraction_leaf: 0.0
      n_estimators: 1
      n_jobs: null
      oob_score: false
      random_state: 0
      verbose: 0
      warm_start: false
    model_class: RandomForestClassifier
  • Commit and push the files to our DAGsHub repository using Git

    git add metrics.csv params.yml
    git commit -m "New Experiment - Random Forest Classifier with basic processing"
    git push
    
  • Let's check the new status of our repository The two files were added to the repository and one experiment was created.

  • The information about the experiment is displayed under the Experiment Tab. Congratulations - You created your first Experiment!

This part covers the Experiment Tracking workflow. We highly recommend reading the experiment tab documentation to explore the various features that it has to offer. In the next part, we will learn how to explore a new hypothesis and switch between project versions