Are you sure you want to delete this access key?
SuperGradients allows users to train models on different modes:
Requirement: None.
How to use it: If you don't have any CUDA device available, your training will automatically be run on CPU.
Otherwise, the default device will be CUDA, but you can still easily set it to CPU using setup_device
as follow:
from super_gradients import Trainer
from super_gradients.training.utils.distributed_training_utils import setup_device
setup_device(device='cpu')
# Unchanged
trainer = Trainer(...)
trainer.train(...)
Requirement: Having at least one CUDA device available
How to use it: If you have at least one CUDA device, nothing! Otherwise, you will have to use CPU...
DataParallel (DP) is a single-process, multi-thread technic for scaling deep learning model training across multiple GPUs on a single machine.
The general flow is as below
For more detailed information, feel free to check out this blog for a more in-depth explanation.
Requirement: Having at least one CUDA devices available
How to use it: All you need to do is to call a magic function setup_device
before instantiating the Trainer.
from super_gradients import Trainer
from super_gradients.training.utils.distributed_training_utils import setup_device
# Launch DP on 4 GPUs'
setup_device(multi_gpu='DP', num_gpus=4)
# Unchanged
trainer = Trainer(...)
trainer.train(...)
Tip: To optimize runtime we recommend to call setup_device
as early as possible.
Distributed Data Parallel (DDP) is a powerful technique for scaling deep learning model training across multiple GPUs. It involves the use of multiple processes, each running on a different GPU and having its own instance of the model. The processes communicate only to exchange gradients, making it a highly efficient and more scalable solution for training large models than Data Parallel (DP).
Although DDP can be more complex to set up than DP, the SuperGradients library abstracts away the complexity by handling the setup process behind the scenes. This makes it easy for users to take advantage of the benefits of DDP without having to worry about technical details. We highly recommend using DDP over DP whenever possible.
For more detailed information, feel free to check out this blog for a more in-depth explanation.
Requirement: Having multiple CUDA devices available
How to use it: All you need to do is to call a magic function setup_device
before instantiating the Trainer.
from super_gradients import Trainer
from super_gradients.training.utils.distributed_training_utils import setup_device
# Launch DDP on 4 GPUs'
setup_device(num_gpus=4) # Equivalent to: setup_device(multi_gpu='DDP', num_gpus=4)
# Unchanged
trainer = Trainer(...)
trainer.train(...)
Tip: To optimize runtime we recommend to call setup_device
as early as possible.
When running DDP, you will work with multiple processes that will go through the whole training loop. This means that if you run DDP on 4 gpus, any action that you do will be run 4 times.
This impacts especially printing, logging, and file writing. To face this issue, SuperGradients provides a decorator that will ensure that only one process will execute a specific function, whether you work with CPU, GPU, DP, or DDP.
In the following example, the print_hello
function will print Hello world only once, when it would be printed 4 times without the decorator...
from super_gradients.training.utils.distributed_training_utils import setup_device
from super_gradients.common.environment.ddp_utils import multi_process_safe
setup_device(num_gpus=4)
@multi_process_safe # Try with and without this decorator
def print_hello():
print('Hello world')
print_hello()
As explained, multiple processes are used to train a model with DDP, each on its own GPU. This means that the metrics must be computed and aggregated across all the processes, and it requires the metric to be implemented using states.
States are attributes to be reduced. They are defined using the built-in method add_state()
and enable broadcasting
of the states among the different ranks when calling the compute()
method.
An example of a state would be the number of correct predictions, which will be summed across the different processes, and broadcasted to all of them before computing the metric value. You can see an example below.
Feel free to check torchmetrics documentation for more information on how to implement your own metric.
Example In the following example, we start with a custom metric implemented to run on a single device:
import torch
from torchmetrics import Metric
class Top5Accuracy(Metric):
def __init__(self):
super().__init__()
self.correct = torch.tensor(0.)
self.total = torch.tensor(0.)
def update(self, preds: torch.Tensor, target: torch.Tensor):
batch_size = target.size(0)
# Get the top k predictions
_, pred = preds.topk(5, 1, True, True)
pred = pred.t()
# Count the number of correct predictions only for the highest 5
correct = pred.eq(target.view(1, -1).expand_as(pred))
correct5 = correct[:5].reshape(-1).float().sum(0)
self.correct += correct5.cpu()
self.total += batch_size
def compute(self):
return self.correct.float() / self.total
All you need to change to use your metric on DDP is to define your attributes self.correct
and self.total
with add_state
and to define
a reduce function dist_reduce_fx
that will be used to know how to combine the states when calling compute:
import torch
import torchmetrics
class DDPTop1Accuracy(torchmetrics.Metric):
def __init__(self, dist_sync_on_step=False):
super().__init__(dist_sync_on_step=dist_sync_on_step)
self.add_state("correct", default=torch.tensor(0.), dist_reduce_fx="sum") # Set correct to be a state
self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum") # Set total to be a state
def update(self, preds: torch.Tensor, target: torch.Tensor):
batch_size = target.size(0)
# Get the top k predictions
_, pred = preds.topk(5, 1, True, True)
pred = pred.t()
# Count the number of correct predictions only for the highest 5
correct = pred.eq(target.view(1, -1).expand_as(pred))
correct5 = correct[:5].reshape(-1).float().sum(0)
self.correct += correct5
self.total += batch_size
def compute(self):
return self.correct.float() / self.total
Step by step explanation
update()
method modifies the internal state of each instance in each process based on the inputs preds
and target
specific to that process.compute()
triggers torchmetrics.Metric
to gather and combine the states of each process. This reduction step can be customized by setting the dist_reduce_fx
, which in this case is the sum
. This usually happens at the end of the epoch.
compute()
method then calculates the metric value according to your implementation. In this example, every process will return the same result: 0.6
(180 correct predictions out of 300 total predictions).reset()
will reset the internal state of the metric, making it ready to accumulate new data at the start of the next epoch.Using N GPUs in DDP mode, has an effect of increasing batch size by a factor of N. And it has been shown that it may be necessary to scale the learning rate accordingly. The rule of thumb is that if batch size is increased by a factor of N (Or N nodes used in DDP), the learning rate should be also increased by a factor of N.
However, when it comes to adaptive optimizers like Adam, the situation is a bit different. Adaptive optimizers like Adam automatically adjust the learning rate for each parameter based on the historical gradient information. They inherently adapt to the scale of the gradients and don't require manual adjustments of the learning rate in the same way as fixed learning rate methods like SGD.
That being said, we still recommend to try out different learning rates to see the impact on the final metrics. You can run experiments manually or use Hydra sweep syntax to run experiments with custom learning rates as follows:
python -m super_gradients.train_from_recipe -m --config-name=coco2017_yolo_nas_s training_hyperparams.initial_lr=1e-3,5e-3,1e-4
When using recipes you simply need to set values of gpu_mode
and num_gpus
.
# training_recipe.yaml
default:
- ...
...
# Simply add this to run DDP on 4 nodes.
gpu_mode: DDP
num_gpus: 4
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
super-gradients is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
super-gradients is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
super-gradients is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
super-gradients is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?