• Why?
    • How is the interaction between CPU and GPU organized in Torch and other frameworks?

What is it

  • Kernels (functions executed on the GPU) are loaded from the CPU to the GPU in the order of execution.

    • To avoid downtime due to the CPU, the kernels are loaded ahead of the computations and are executed asynchronously.
  • Within a single stream, kernels are always executed in the order in which they were loaded to the CPU.

  • If we want them to run in parallel, we need to load them to different streams.

    • Note that if kernels in different streams use the same resources, they may fail to run in parallel or their executions may be very slow.
    • Useful for overlapping communication and computation
    • e.g. lauching an all reduce for a given layer as soon as the backward is done, such that the computing stream can go on with computing the next layer’s backward.
  • Example usage is explained in NCCL.

Communication between streams

  • To facilitate communication between streams, you can use the “event” primitive (event = torch.cuda.Event() in Torch). We can put an event into a stream (event.record(stream)), and then it’ll be appended to the end of the stream like a microkernel. We can wait for this event in another stream (event.wait(another_stream)), and then this stream will pause until the first stream reaches the event.