- Why?
- How is the interaction between CPU and GPU organized in Torch and other frameworks?
What is it
-
Kernels (functions executed on the GPU) are loaded from the CPU to the GPU in the order of execution.
- To avoid downtime due to the CPU, the kernels are loaded ahead of the computations and are executed asynchronously.
-
Within a single stream, kernels are always executed in the order in which they were loaded to the CPU.
-
If we want them to run in parallel, we need to load them to different streams.
- Note that if kernels in different streams use the same resources, they may fail to run in parallel or their executions may be very slow.
- Useful for overlapping communication and computation
- e.g. lauching an all reduce for a given layer as soon as the backward is done, such that the computing stream can go on with computing the next layer’s backward.
-
Example usage is explained in NCCL.
Communication between streams
- To facilitate communication between streams, you can use the “event” primitive (
event = torch.cuda.Event()
in Torch). We can put an event into a stream (event.record(stream)
), and then it’ll be appended to the end of the stream like a microkernel. We can wait for this event in another stream (event.wait(another_stream)
), and then this stream will pause until the first stream reaches the event.