• A kernel is a function that is written to be executed on the GPU.

  • Each kernel is executed by a thread in the GPU. And every thread executes the same kernel. When a kernel is launched, multiple GPU threads are spawned that execute instructions written inside that kernel. The number of threads that are spawned at once is configurable.

  • Physically, threads are assigned to cores. Cores execute software threads.

  • A warp is is a group of 32 threads. One block may be made up of multiple warps.

    • The SM executes all the threads within a warp together by fetching and issuing the same instruction to all of them. These threads then execute that instruction simultaneously. This is the most granular physical block.
    • Even if all the processing blocks (groups of cores) within an SM are handling warps, only a few of them are actively executing instructions at any given moment. This happens because there are a limited number of execution units available in the SM.
    • But some instructions take longer to complete, causing a warp to wait for the result. In such cases, the SM puts that waiting warp to sleep and starts executing another warp that doesn’t need to wait for anything. This enables the GPUs to maximally utilize all the available compute and deliver high throughput. zero-overhead scheduling, because each thread in each warp has its own sets of registers.

Logical & physical Layout

Blocks

  • Threads are logically organized into blocks. Every block has a predefined number of threads assigned to it.

  • Just for logical purposes, threads can be arranged inside a block in either a 1D, 2D, or 3D array layout.

  • For example, if the kernel needs to operate on a 100x100 matrix, then a kernel with a block size of 100 by 100 threads can be launched.

    • underneath the hood, it’s just 10^4 threads.
  • In the physical world, every block is assigned an SM.

    • Throughout its execution, the block will only be executed on the same SM.
    • Since every block is assigned an SM, it also has access to the SM’s shared memory.

Grids

  • Similar to how threads are organized in blocks, blocks are themselves organized into a grid.
    • Logical layout, 1D, 2D, 3D
  • That allows the GPU to launch multiple blocks at one time. A single GPU has multiple SMs, so multiple blocks can be launched at once so that all of the SMs and cores are utilized.
  • Example:
    • Let’s assume that the program executes 25 blocks and the GPU has 10 SMs.
    • Then the program will execute 10 blocks in the first wave, 10 blocks in the second wave, and 5 blocks in the third wave. The first two waves will have 100% optimization but the last wave will have 50% utilization.
  • A single program only executes a single grid at a time.
  • Memory: The grid has access to the global memory or HBM of the GPU.

1D-3D layout

  • During execution, a total of blocks per thread (b) * number of blocks (num) physical threads are spawned.
  • Each physical thread is numbered from 0 to
  • A 2D array layout can be unrolled to 1D. If it’s row-major ordering, then a 2D matrix after unrolling will look like this:

Matmul kernel example

  • Let’s setup the grid and blocks structure

  • We have where (d_in, d_out), (B, L, d_in) and (B, L , d_out)

  • Then, our grid will be a 1D array of length batch size B

    • dim3 grid(B);
  • And our blocks will be a 2D array of the same dimension as the output matrix (L, d_out)

    • `dim3 blocks(D_out, N);
    • D_out is first instead of N, because the function dim3 takes input in x, y, z notation. x axis is the columnar axis and y axis is the row axis`
  • Why?

    • Each block will take care of one matrix of the batch MM
    • Each thread will take care of one element from the output matrix
  • In total threads are spawned, arranged in B blocks.