Why?
- How is the interaction between CPU and GPU organized in Torch and other frameworks?

What is it

Kernels (functions executed on the GPU) are loaded from the CPU to the GPU in the order of execution.
- To avoid downtime due to the CPU, the kernels are loaded ahead of the computations and are executed asynchronously.
Within a single stream, kernels are always executed in the order in which they were loaded to the CPU.
If we want them to run in parallel, we need to load them to different streams.
- Note that if kernels in different streams use the same resources, they may fail to run in parallel or their executions may be very slow.
- Useful for overlapping communication and computation
- e.g. lauching an all reduce for a given layer as soon as the backward is done, such that the computing stream can go on with computing the next layer’s backward.
Example usage is explained in NCCL.

Communication between streams

To facilitate communication between streams, you can use the “event” primitive (event = torch.cuda.Event() in Torch). We can put an event into a stream (event.record(stream)), and then it’ll be appended to the end of the stream like a microkernel. We can wait for this event in another stream (event.wait(another_stream)), and then this stream will pause until the first stream reaches the event.