• Since deep neural networks usually contain a stack of vertical layers, it feels straightforward to split a large model by layer, where a small consecutive set of layers are grouped into one partition on one worker.

  • However, a naive implementation for running every data batch through multiple such workers with sequential dependency leads to big bubbles of waiting time and severe under-utilization of computation resources.

  • Bubbles

    • The first worker has to wait until all the next workers are done computing their backward passes to be able to compute its own backward pass