Hugging Face Transformers¶

The Hugging Face Transformers library is an open-source machine-learning library. It is built on top of PyTorch and TensorFlow and provides a set of pre-trained models for natural language processing tasks. With Hugging Face Transformers, developers and researchers can easily fine-tune the pre-trained models on their own datasets, or even train their own models from scratch with ease.

With DagsHub, you can easily log the experiments you run with Hugging Face Transformers to a remote server with minimal changes to your code.

This includes versioning raw and processed data with DVC, as well as logging experiment metrics, parameters, and trained models with MLflow. This integration enables you to continue using the familiar MLflow interface, while also facilitating collaboration with others, comparing results from different runs, and making data-driven decisions with ease.

How do Hugging Face Transformers work with DagsHub?¶

DagsHub leverages the hooks developed by Hugging Face’s Transformers library to inject code at specific points during the training run. These code snippets log information regarding the training run, like metrics and artifacts, to the DagsHub remote using information provided using environment variables set before the trainer is run.

How to log experiments with Transformers and DagsHub?¶

Log your transformer experiments in 3 simple steps:

Install DagsHub¶

Mac-os, Linux, Windows

pip install dagshub

Configure DagsHub¶

import dagshub 
import os

dagshub.init(repo_name='Repository-Name', repo_owner='Username')
os.environ["HF_DAGSHUB_LOG_ARTIFACTS"]= "True" # optional; if disabled, only logs metrics!

dagshub.init configures your DagsHub account and repository, including the remote Mlflow tracking server and DagsHub Storage, with your local machine. If the repository you provide as input doesn’t exist, it will automatically create it for you.
Running this command requires authenticating your DagsHub user. If you want to automate this process, you need to set your DagsHub Token under DAGSHUB_USER_TOKEN environment variable.

You need to set the environment variable before you initialize the Trainer

Optional Environment Variables

The following are optional environmental variables that can be configured.

os.environ["HF_DAGSHUB_MODEL_NAME"] = "model name" # defaults to 'main'
os.environ["BRANCH"] = "branch" # defaults to 'main'

Configure Hugging Face Transformers¶

Mac-os, Linux, Windows

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="experiment-name")
trainer = Trainer(... , args=training_args)

Great job! The integration has been successfully finished. Transformers will automatically recognize the activation of DagsHub integration and include our hook in your pipeline. Consequently, every run will be logged to your DagsHub repository.

Additional Resources¶

DagsHub x Hugging Face - learn more about DagsHub x Hugging Face Transformers integration.
Example notebook - create your own transformer model and track your experiments.

Known Issues, Limitations & Restrictions¶

The artifacts created during training fail to get overridden if the same experiment is run multiple times. However, the experiments are still logged and can be tracked.