Are you sure you want to delete this access key?
comments | description | keywords |
---|---|---|
true | Learn how to train YOLOv5 on multiple GPUs for optimal performance. Guide covers single and multiple machine setups with DistributedDataParallel. | YOLOv5, multiple GPUs, machine learning, deep learning, PyTorch, data parallel, distributed data parallel, DDP, multi-GPU training |
This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 🚀 on single or multiple machine(s).
Clone repo and install requirements.txt in a Python>=3.8.0 environment, including PyTorch>=1.8. Models and datasets download automatically from the latest YOLOv5 release.
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
!!! tip "ProTip!"
**Docker Image** is recommended for all Multi-GPU trainings. See [Docker Quickstart Guide](../environments/docker_image_quickstart_tutorial.md) <a href="https://hub.docker.com/r/ultralytics/yolov5"><img src="https://img.shields.io/docker/pulls/ultralytics/yolov5?logo=docker" alt="Docker Pulls"></a>
!!! tip "ProTip!"
`torch.distributed.run` replaces `torch.distributed.launch` in **[PyTorch](https://www.ultralytics.com/glossary/pytorch)>=1.9**. See [PyTorch distributed documentation](https://docs.pytorch.org/docs/stable/distributed.html) for details.
Select a pretrained model to start training from. Here we select YOLOv5s, the smallest and fastest model available. See our README table for a full comparison of all models. We will train this model with Multi-GPU on the COCO dataset.
python train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0
You can increase the device
to use Multiple GPUs in DataParallel mode.
python train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
This method is slow and barely speeds up training compared to using just 1 GPU.
You will have to pass python -m torch.distributed.run --nproc_per_node
, followed by the usual arguments.
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
--nproc_per_node
specifies how many GPUs you would like to use. In the example above, it is 2.--batch
is the total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.The code above will use GPUs 0... (N-1)
.
You can do so by simply passing --device
followed by your specific GPUs. For example, in the code below, we will use GPUs 2,3
.
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3
SyncBatchNorm could increase accuracy for multiple GPU training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.
It is best used when the batch-size on each GPU is small (<= 8).
To use SyncBatchNorm, simply pass --sync-bn
to the command like below:
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn
This is only available for Multiple GPU DistributedDataParallel training.
Before we continue, make sure the files on all machines are the same, dataset, codebase, etc. Afterward, make sure the machines can communicate with each other.
You will have to choose a master machine (the machine that the others will talk to). Note down its address (master_addr
) and choose a port (master_port
). I will use master_addr = 192.168.1.1
and master_port = 1234
for the example below.
To use it, you can do as the following:
# On master machine 0
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank 0 --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
# On machine R
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
where G
is number of GPU per machine, N
is the number of machines, and R
is the machine number from 0...(N-1)
. Let's say I have two machines with two GPUs each, it would be G = 2
, N = 2
, and R = 1
for the above.
Training will not start until all N
machines are connected. Output will only be shown on the master machine!
Windows support is untested, Linux is recommended.
--batch
must be a multiple of the number of GPUs.
GPU 0 will take slightly more memory than the other GPUs as it maintains EMA and is responsible for checkpointing etc.
If you get RuntimeError: Address already in use
, it could be because you are running multiple trainings at a time. To fix this, simply use a different port number by adding --master_port
like below:
python -m torch.distributed.run --master_port 1234 --nproc_per_node 2 ...
DDP profiling results on an AWS EC2 P4d instance with 8x A100 SXM4-40GB for YOLOv5l for 1 COCO epoch.
# prepare
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/coco:/usr/src/coco $t
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
cd .. && rm -rf app && git clone https://github.com/ultralytics/yolov5 -b master app && cd app
cp data/coco.yaml data/coco_profile.yaml
# profile
python train.py --batch-size 16 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 64 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 128 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3,4,5,6,7
GPUs A100 |
batch-size | CUDA_mem device0 (G) |
COCO train |
COCO val |
---|---|---|---|---|
1x | 16 | 26GB | 20:39 | 0:55 |
2x | 32 | 26GB | 11:43 | 0:57 |
4x | 64 | 26GB | 5:57 | 0:55 |
8x | 128 | 26GB | 3:09 | 0:57 |
As shown in the results, using DistributedDataParallel with multiple GPUs provides nearly linear scaling in training speed. With 8 GPUs, training completes approximately 6.5 times faster than with a single GPU, while maintaining the same memory usage per device.
If an error occurs, please read the checklist below first! (It could save your time)
If you went through all the above, feel free to raise an Issue by giving as much detail as possible following the template.
Ultralytics provides a range of ready-to-use environments, each pre-installed with essential dependencies such as CUDA, CUDNN, Python, and PyTorch, to kickstart your projects.
This badge indicates that all YOLOv5 GitHub Actions Continuous Integration (CI) tests are successfully passing. These CI tests rigorously check the functionality and performance of YOLOv5 across various key aspects: training, validation, inference, export, and benchmarks. They ensure consistent and reliable operation on macOS, Windows, and Ubuntu, with tests conducted every 24 hours and upon each new commit.
We would like to thank @MagicFrogSJTU, who did all the heavy lifting, and @glenn-jocher for guiding us along the way.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?