- Superset of
torch.distributed.launch. - Related to Launching a distributed training run.
Functionalities
- Worker failures are handled gracefully by restarting all workers.
- Worker
RANKandWORLD_SIZEare assigned automatically. - Number of nodes is allowed to change between minimum and maximum sizes (elasticity).