Train a Text-to-Image Model Using Kubeflow

Kubeflow May 31, 2023

If you’ve been following along in our Kubeflow series, you should now know how to install Kubeflow locally and how to setup Kubeflow on AWS using Terraform. If you missed those, go check them out. This post will wait for you. Promise.

In this post, we’re going to look at training a model on a Kubeflow cluster! It’s probably what you’ve been waiting for, right?

How will the training architecture look like?

An overview of the architecture looks like this:

Architecture chart showing Kubflow on the top and DagsHub on the bottom. Within Kubeflow is a representation of several Pods, one of them being Minimagen. Within DagsHub, there is a representation of the LAION dataset and MLflow, which has Parameters, Metrics, and Artifacts within it. There is a dashed arrow pointing from the LAION dataset to Minimagen with Streaming written about it. There are three arrows pointing from Minimagen to Parameters, Metrics, and Artifacts within MLflow.

For this project, we’re going to be training a mini version of Google’s Imagen model, cleverly dubbed Minimagen. Furthermore, we’re going to use a subset of the LAION-Aesthetics V2 dataset, some of the same data used to train Stable Diffusion.

Since we’re training on a Kubeflow cluster, we have to think about how we get data into and out of the cluster. This is where some of the features and integrations we’ve added to DagsHub over the years come in.

We’ve downloaded the images from the LAION-Aesthetics V2 dataset, which have an aesthetics score of 6.5 or higher and put them into a public repository, anyone can access. This repo has 540k images and is over 87GB large! To avoid the upfront cost of downloading the entire dataset before we start, we can use DagsHub’s Direct Data Access (DDA) to stream the data from the repo to the Kubeflow cluster.

Additionally, we can log the parameters, metrics, and artifacts of the training run to MLflow.

All communication between our Kubeflow Pod and DagsHub will be handled automatically using the dagshub and mlflow Python libraries! Easy peasy, lemon squeezy.

Project Details

The heart of our training project is in this repo. It contains everything used to create a Docker image, which will be spun up in a Pod within our Kubeflow cluster. While you won’t need to do anything specifically, with this repo, it’s important to understand how everything fits together. This way, you’ll have a head start when implementing it in future projects.

train.py 🚂

Let’s start with the training script . If you take a look at the top of the file, you’ll see that some environment variables are required:

DAGSHUB_TOKEN = os.environ.get('DAGSHUB_TOKEN', None)
DAGSHUB_USER_NAME = os.environ.get('DAGSHUB_USER_NAME', None)
DAGSHUB_REPO_NAME = os.environ.get('DAGSHUB_REPO_NAME', None)

if DAGSHUB_TOKEN is None:
    raise EnvironmentError(
        "Environment variable 'DAGSHUB_TOKEN' must be set with valid token"
    )
if DAGSHUB_USER_NAME is None:
    raise EnvironmentError(
        "Environment variable 'DAGSHUB_USER_NAME' must be set"
    )
if DAGSHUB_REPO_NAME is None:
    raise EnvironmentError(
        "Environment variable 'DAGSHUB_REPO_NAME' must be set"
    )

This is so we can set up communication between your repo and the Pod running the training. Later we’ll show how to specify these environment variables.

Next, we set up the DagsHub Dataset repo, we’ll be streaming from:

import dagshub

# 1
dagshub.auth.add_app_token(DAGSHUB_TOKEN)

# 2
from dagshub.streaming import install_hooks
install_hooks(project_root='.', 
              repo_url='https://dagshub.com/DagsHub-Datasets/LAION-Aesthetics-V2-6.5plus', 
              branch='main')

Here we:

Authenticate to DagsHub using the token we provided via the environment variable
Set up DDA by installing some Python hooks based on the LAION dataset repo. These hooks detect calls to Python's built-in file operations. If the files exist in the DagsHub repo, it will load them on the fly as they're requested. Otherwise, it falls back to the local file system. Neat!

After that, we need to setup the connection to the MLflow server associated with our repo:

os.environ['MLFLOW_TRACKING_URI']=f"https://dagshub.com/{DAGSHUB_USER_NAME}/{DAGSHUB_REPO_NAME}.mlflow"

os.environ['MLFLOW_TRACKING_USERNAME'] = DAGSHUB_USER_NAME
os.environ['MLFLOW_TRACKING_PASSWORD'] = DAGSHUB_TOKEN

import mlflow

This allows us to use functions such as mlflow.log_params, mlflow.log_metrics, and mlflow.log_artifact to log things to MLflow. In fact, that’s exactly what the following code at the bottom of the training script does:

# 1
experiment_id = get_experiment_id('minimagen')
with mlflow.start_run(experiment_id=experiment_id):
		# 2
    mlflow.log_params(args.__dict__)

		# 3
    mlflow.log_params({
        "unet0_params": unets_params[0],
        "unet1_params": unets_params[1],
        "imagen_params": imagen_params
    })

		# 4
    mlflow.log_param("model_size_MB", model_size_MB)

		# 5
    mlflow.log_metrics({
        "unet0_avg_loss": avg_losses[0],
        "unet1_avg_loss": avg_losses[1],
    })

		# 6
    mlflow.log_artifact(dir_path)

After training is done, this code:

Gets or creates a new MLflow experiment based on the name 'minimagen'
Logs the arguments used to run the training script
Logs the parameters for the different pieces of the model
Logs the size of the model in MBs, as calculated earlier in the script
Logs the average loss of each of the two Unets
And finally logs the output directory, which includes the trained model

The rest of the script is adapted from the original Minimagen repo. The only major difference is the use of a custom data loader, which was needed due to the format of the dataset.

Dockerfile 🚢

The Dockerfile is relatively short for this project:

# 1
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

# 2
WORKDIR /opt/workdir
COPY . /opt/workdir

# 3
RUN apt-get update; apt-get install -y --no-install-recommends \\
    g++ \\
    git

# 4
RUN pip install --no-cache-dir --upgrade -r /opt/workdir/requirements.txt

# 5
RUN pip install --no-cache-dir --upgrade torch torchvision

# 6
ENTRYPOINT ["python", "train.py", "--PARAMETERS", "./parameters"]

In summary, it:

Start with a PyTorch image that includes CUDA support, so we can train using NVIDIA GPUs
Copy the files from the repo to the working directory, that’s defined
Install dependencies, which are required for at least one of the Python libraries we need to install
Install the Python libraries necessary for the project
Reinstall PyTorch and TorchVision, because one of the requirements in step #4 decided to unnecessarily downgrade our chose versions ಠ_ಠ
Define the entry point for our Docker image, which is running our training script. Various model parameters are defined in the copied over parameters folder.

Speaking of Python requirements, checkout the requirements.txt to see what we need to directly install:

dagshub
minimagen
mlflow
pillow

We’ve gone over most of these already. minimagen is the original Minimagen repo. pillow is a powerful image processing library, often used with PyTorch and ML repos.

Testing locally

To build your own Docker image based on this repo, you could do so with the following command, run from the top level repo path:

docker build -t minimagen .

Once it’s finished building, you can test is with the following command:

docker run --rm --name minimagen-test \\
        --gpus all \\
        -e DAGSHUB_TOKEN=<token> \\
        -e DAGSHUB_USER_NAME=<username> \\
        -e DAGSHUB_REPO_NAME=<repo_to_write_to> \\
        minimagen:latest --BATCH_SIZE 2 --TIMESTEPS 25 --TESTING

Here we’re:

setting the name for the running container to minimagen-test
(optionally) enabling GPUs for training
passing in the three environment variables we mentioned the train.py script needs
specifying we want to spin up the Docker image tagged as minimagen:latest
and passing additional arguments to our training script

Everything after the Docker image tag is appended to what’s specified in the ENTRYPOINT section of the Dockerfile. So in this case, the command the Docker container runs when it spins up would look like this:

python train.py --PARAMETERS ./parameters \\
                --BATCH_SIZE 2 \\
                --TIMESTEPS 25 \\
                --TESTING

When everything is working properly, your output should look something like this:

2023-05-17 13:11:31.310 - INFO     Training size: 8
2023-05-17 13:11:31.310 - INFO     Validation size: 8
0it [00:00, ?it/s]
-------------------- EPOCH 1 --------------------

----------Training...----------
Downloading: 100%|██████████| 773k/773k [00:00<00:00, 1.96MB/s]
Downloading: 100%|██████████| 2.27k/2.27k [00:00<00:00, 2.03MB/s]
Downloading: 100%|██████████| 1.18k/1.18k [00:00<00:00, 1.47MB/s]
Downloading: 100%|██████████| 231M/231M [00:02<00:00, 101MB/s]
Some weights of the model checkpoint at t5-small were not used when ...
- This IS expected if you are initializing T5EncoderModel from the ...
- This IS NOT expected if you are initializing T5EncoderModel from ...
Some weights of T5EncoderModel were not initialized from the model ...
You should probably TRAIN this model on a down-stream task to be able ...

----------Validation...----------t/s]
100%|██████████| 4/4 [00:04<00:00,  1.15s/it]
4it [00:17,  4.46s/it]00:04<00:00,  1.18s/it]
0it [00:00, ?it/s]Unet 0 avg validation loss:  tensor(1.0752, device='cuda:0')
Unet 1 avg validation loss:  tensor(1.0622, device='cuda:0')

-------------------- EPOCH 2 --------------------

----------Training...----------

----------Validation...----------
100%|██████████| 4/4 [00:00<00:00, 18.77it/s]
4it [00:00,  5.72it/s]00:00<00:00, 21.44it/s]
Unet 0 avg validation loss:  tensor(1.0575, device='cuda:0')
Unet 1 avg validation loss:  tensor(1.0436, device='cuda:0')
100%|██████████| 4/4 [00:00<00:00, 17.79it/s]

How do I train a model on a Kubeflow cluster?

Now comes the fun part! We’re going to assume that you have a Kubeflow cluster running somewhere. If you need to get one set up on AWS, check out our previous post.

Since Kubeflow builds on top of Kubernetes, can you guess what you need to deploy our training image? That’s right! YAML!

Meme of SpongeBob Squarepants smiling with a sparkling rainbow between his hands. At the top of the rainbow is the word "YAML". The rain colors are backward for some reason with violet on the top and red on the bottom.

From a machine, where you control your Kubeflow cluster, create a new YAML file and call it minimagen.yaml. Add this to the file:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: minimagen-train
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: public.ecr.aws/j0t5n4n3/dagshub/minimagen-train
              imagePullPolicy: Always
              args: ["--BATCH_SIZE", "2", "--TIMESTEPS", "25", "--TESTING"]
              env:
                - name: DAGSHUB_TOKEN
                  value: "<token>"
                - name: DAGSHUB_USER_NAME
                  value: "<username>"
                - name: DAGSHUB_REPO_NAME
                  value: "<repo_to_write_to>"
              resources:
                  limits:
                      nvidia.com/gpu: 1

The important parts to note here are:

kind is set to PyTorchJob. PyTorchJob is a Kubernetes custom resource to run PyTorch training jobs on Kubernetes.
we use the Master label under pytorchReplicaSpecs, since we’re not doing distributed training. If you are doing distributed training, you would also have a Worker label under pytorchReplicaSpecs.
args are where we pass additional arguments that appended to our Docker image’s ENTRYPOINT
env is where we specify the environment variables we need
resources is where we tell Kubeflow that this Docker container needs a GPU to run on

Once you have the YAML file ready, you can start training on your Kubeflow cluster using the following command:

kubectl apply -f minimagen.yaml

This will schedule a new pod to run your Docker container on using parameters specified in the YAML file.

To make sure everything is working, run:

kubectl get pods -n kubeflow | grep minimagen

You should see something like this:

minimagen-train-master-0                                 2/2     Running   0            3h36m

How do I delete the training pod?

If for whatever reason you need to stop the pod, you can do so by running:

kubectl delete -f minimagen.yaml

Where can we go from here?

This post, along with the previous ones, have shown you the basics for spinning up a Kubeflow cluster and starting training on a single GPU node. However, the real power of using Kubeflow comes when you need to perform distributed training across multiple nodes. Take a look at this example YAML for how to get started with distributed training.

Additional improvements you could make include:

Increase size of Minimagen — As a starting point, look at the ./parameters folder for JSON files to change the internal parameters of the internal Minimagen models
Register model to MLflow — Create an mlflow.PyFunc wrapper around the trained Minimagen model and register it to the MLflow Model Registry. This will allow you to easily deploy the model. For a reference on how to do that, check out our MLflow Crash Course on the Model Registry.
Inference — Once you’ve registered the trained model to the MLflow Registry, then the next step is to try deploying it! You can do this either through Kubeflow or other means. More information on setting up an inference service can be found here and here.
Log realtime training metrics to Mlflow — Currently, the training metrics are only logged at the end of training. It could be more interesting (and useful!) to be able to see how the training metrics evolve during the training run.

Each of these improvements is an open issue on the repo. PRs are welcome! If you’d like help or advice on how to get started, feel free to ask 🙂

Congratulations! Through this series of posts, you’ve learned a lot about setting up Kubeflow and using it to train a model. You deserve a high five!

There’s so much more to Kubeflow for you to discover. Check out how you can use notebooks and setup pipelines for starters!

If you have any questions, feel free to reach out. You can join our Discord, where we’ve built a vibrant, helpful and friendly community.

Recommended for you

Active Learning

Active Learning Your Way to Better Models

2 years ago • 10 min read

Computer Vision

Train An Emotion Recognition Model Using Open Source MLOps Tools

a year ago • 11 min read

CI/CD

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions

2 years ago • 9 min read

Common Pitfalls To Avoid When Using Vector Databases

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

Train a Text-to-Image Model Using Kubeflow

How will the training architecture look like?

Project Details

train.py 🚂

Dockerfile 🚢

Testing locally

How do I train a model on a Kubeflow cluster?

How do I delete the training pod?

Where can we go from here?

Tags

Yono Mittlefehldt

Recommended for you

Active Learning Your Way to Better Models

Train An Emotion Recognition Model Using Open Source MLOps Tools

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions

Common Pitfalls To Avoid When Using Vector Databases

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

How will the training architecture look like?

Project Details

train.py 🚂

Dockerfile 🚢

Testing locally

How do I train a model on a Kubeflow cluster?

How do I delete the training pod?

Where can we go from here?

Tags

Join DAGsHub

Yono Mittlefehldt

Recommended for you

Active Learning Your Way to Better Models

Train An Emotion Recognition Model Using Open Source MLOps Tools

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions