No Description

added more epochs
3 months ago
4 months ago
4 months ago
Data Pipeline

Your version controlled data pipeline could be here! Learn how to create one with our tutorial.

MNIST example with MLFlow

In this example, we train a Pytorch Lightning model to predict handwritten digits, leveraging early stopping. The code, adapted from this repository, is almost entirely dedicated to model training, with the addition of a single mlflow.pytorch.autolog() call to enable automatic logging of params, metrics, and models, including the best model from early stopping.

Running the code

To run the example via MLflow, navigate to the mlflow/examples/pytorch/MNIST/example1 directory and run the command

mlflow run .

This will run with the default set of parameters such as --max_epochs=5. You can see the default value in the MLproject file.

In order to run the file with custom parameters, run the command

mlflow run . -P max_epochs=X

where X is your desired value for max_epochs.

If you have the required modules for the file and would like to skip the creation of a conda environment, add the argument --no-conda.

mlflow run . --no-conda

Viewing results in the MLflow UI

Once the code is finished executing, you can view the run's metrics, parameters, and details by running the command

mlflow ui

and navigating to http://localhost:5000.

For more details on MLflow tracking, see the docs.

Passing custom training parameters

The parameters can be overridden via the command line:

  1. max_epochs - Number of epochs to train model. Training can be interrupted early via Ctrl+C
  2. gpus - Number of GPUs
  3. accelerator - Accelerator backend (e.g. "ddp" for the Distributed Data Parallel backend) to use for training. By default, no accelerator is used.
  4. batch_size - Input batch size for training
  5. num_workers - Number of worker threads to load training data
  6. lr - Learning rate
  7. patience -parameter of early stopping
  8. mode - parameter of early stopping
  9. monitor - parameter of early stopping 10.verbose - parameter of early stopping

For example:

mlflow run . -P max_epochs=5 -P gpus=1 -P batch_size=32 -P num_workers=2 -P learning_rate=0.01 -P accelerator="ddp" -P patience=5 -P mode="min" -P monitor="val_loss" -P verbose=True

Or to run the training script directly with custom parameters:

python \
    --max_epochs 5 \
    --gpus 1 \
    --accelerator "ddp" \
    --batch_size 64 \
    --num_workers 3 \
    --lr 0.001 \
    --es_patience 5 \
    --es_mode "min" \
    --es_monitor "val_loss" \
    --es_verbose True

Logging to a custom tracking server

To configure MLflow to log to a custom (non-default) tracking location, set the MLFLOW_TRACKING_URI environment variable, e.g. via export MLFLOW_TRACKING_URI=http://localhost:5000/. For more details, see the docs.