Functionalities

  1. Worker failures are handled gracefully by restarting all workers.
  2. Worker RANK and WORLD_SIZE are assigned automatically.
  3. Number of nodes is allowed to change between minimum and maximum sizes (elasticity).