Model Parallelism

Since deep neural networks usually contain a stack of vertical layers, it feels straightforward to split a large model by layer, where a small consecutive set of layers are grouped into one partition on one worker.
However, a naive implementation for running every data batch through multiple such workers with sequential dependency leads to big bubbles of waiting time and severe under-utilization of computation resources.
Bubbles
- The first worker has to wait until all the next workers are done computing their backward passes to be able to compute its own backward pass

🤖 Harold's Notes