Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

git_tracking.md 6.7 KB

You have to be logged in to leave a comment. Sign In

DagsHub Tracking

DagsHub Tracking unlocks fully reproducible experiments using Git. Git is one of the cornerstone tools for managing data science projects, which lets us track, version, and reproduce code files easily. Therefore, DagsHub supports Git and expands its capabilities to track experiments as well. Using Git to track the experiment, we can also encapsulate the code, data, and model that produced the results. This way, even when the project evolves or grows in complexity, we can easily reproduce experimental results.

How DagsHub Tracking works?

Creating a new experiment using Git tracking can be done in two ways:

Track files with specific names

You can save the information of the experiment to open-source format files that end with params.yml / params.json for parameters and metrics.csv for metrics. Then, track them using Git and push them to the remote repository. DagsHub will parse the files in the Git server for these specific names, and when finding a new or modified file, a new experiment will be generated with the information the file contains.

Track files with DVC

When creating a pipeline with DVC, you will define output files as parameters and/or metrics. Then, you will use Git to track these files with the updated lock.dvc and dvc.yml files and push them to the remote repository. DagsHub will parse the lock.dvc and dvc.yml files in the Git server and will look for the relevant parameters and metrics files. When finding a new or modified file, a new experiment will be generated with the information they contain.

What should the experiment files contain?

The files of the parameters and metrics should hold the data in the following way.

Parameter schema

The parameters format is a simple key: value pair saved as a .yaml or .json file. The value can be a string, number or boolean value. Here is a simple example:

=== "params.yaml" yaml batch_size: 32 learning_rate: 0.02 max_nb_epochs: 2 === "params.json" json { "batch_size": 32, "learning_rate": 0.02, "max_nb_epochs": 2 }

Metrics schema

For metrics, the format is a .csv file with the following headers:

  • Name can be any string or number.
  • Value can be any real number.
  • Timestamp is the UNIX epoch time in milliseconds.
  • Step represents the training step/iteration/batch index. It can be any positive integer. This will serve as the X-axis in most metric charts.

This format enables you to save multiple metrics in one metric file by modifying the Name column while writing to the file. DagsHub knows to plot the graph where needed and show you the last value for each metric. Here is a simple example:

=== "metrics.csv" csv Name,Value,Timestamp,Step loss,2.29,1573118607163,1 epoch,0,1573118607163,1 loss,2.26,1573118607366,11 epoch,0,1573118607366,11 loss,1.44,1573118607572,21 epoch,0,1573118607572,21 loss,0.65,1573118607773,31 avg_val_loss,0.17,1573118812491,3375

How to create a new experiment using DagsHub Tracking?

To help you log the experiment information in the format stated above, DagsHub created the open-source "dagshub logger". With this logger, you can log the parameters and metrics to file within the python scripts or by using auto-logging with libraries such as PyTorch Lightning and fast.ai.

!!!info "Installing & Using the logger" Use pip to install the logger:

```bash
$ pip3 install dagshub
```

=== "Manual Logging" ```python from dagshub import dagshub_logger, DAGsHubLogger

# Option 1 - As a context manager:
with dagshub_logger( metrics_path="logs/test_metrics.csv", hparams_path="logs/test_params.yml") as logger:

    # Metric logging: 
    logger.log_metrics(loss=3.14, step_num=1) 
    # OR: 
    logger.log_metrics({'loss': 3.14}, step_num=1) 
    # Hyperparameters logging: 
    logger.log_hyperparams(optimizer='sgd') 
    # OR: 
    logger.log_hyperparams({'optimizer': 'sgd'}) 

# Option 2 - As a normal Python object: 
logger = DAGsHubLogger(metrics_path="logs/test_metrics.csv", hparams_path="logs/test_params.yml") 
logger.log_hyperparams(optimizer='sgd') 
logger.log_metrics(loss=3.14, step_num=1) 
# ... 
logger.save() 
logger.close()
```

=== "Auto-logging: PyTorch Lightning" ```python

from dagshub.pytorch_lightning import DAGsHubLogger
from pytorch_lightning import Trainer

trainer = Trainer(
          logger=DAGsHubLogger(metrics_path="logs/test_metrics.csv", hparams_path="logs/test_params.yml"),
          default_save_path='lightning_logs',
                 )

```

=== "Auto-logging: fast.ai" ```python

from dagshub.fastai import DAGsHubLogger

# To log only during a single training phase
learn.fit(..., cbs=DAGsHubLogger(metrics_path="logs/test_metrics.csv",
                                hparams_path="logs/test_params.yml"))

```

Running the above script will generate two files: test_metrics.csv test_params.csv. You will use Git to track those files and push them to the remote repository.

$ git add logs/test_metrics.csv logs/test_params.csv
$ git commit -m "New experiment - learning rate 1e-4"
$ git push

The above action will generate a new experiment.

![Git experiment](assets/git-experiment.png) Git experiment
  • For more information about auto-logging supported frameworks:

How to use DagsHub Tracking in a Colab environment?

We shared an example{target=_blank} of experiment tracking with Git to DagsHub in a Colab environment.

When to use DagsHub Tracking?

Using DagsHub Tracking to log experiments enables you to reproduce their results easily. However, in cases that you don't want to reproduce the results, it can be a hassle. Therefore, we recommend using DagsHub Tracking to track the experiments that produced meaningful results that you might want to reproduce in the future.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...