- Superset of
torch.distributed.launch
. - Related to Launching a distributed training run.
Functionalities
- Worker failures are handled gracefully by restarting all workers.
- Worker
RANK
andWORLD_SIZE
are assigned automatically. - Number of nodes is allowed to change between minimum and maximum sizes (elasticity).