Source: https://learnopencv.com/demystifying-gpu-architectures-for-deep-learning/

  • VRAM = DRAM
  • PCI Express = connection to CPU
  • Scalable Link Interface = connection to other GPUs
  • CUDA = “Compute Unified Device Architecture”

Processing units

CUDA cores

  • executes a thread
    • can execute instructions for multiplying, dividing or calculating special functions, for example, activation functions
  • weak but many of them

CUDA blocks and grids

  • CUDA threads are grouped together into “blocks”.
  • All threads within a block execute the same instructions and run on the same SM
  • Blocks are further grouped into entities called CUDA grids

CUDA kernels

  • On a GPU, each CUDA thread will work to produce only one entry of the output matrix.
  • We need a way to specify the computation that each CUDA thread should perform
  • CUDA kernels
  • A kernel is written in such a way that different threads do the same computation but on different data.

Streaming multiprocessors (SMs)

  • An SM is a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads.
  • Modern GPUs contains several dozens of SMs
  • For the purposes of execution, the SM divides blocks of threads into ‘warps’ which are groups of size 32 THIS IS WHY IT’S SO IMPORTANT TO BE DIVISIBLE BY 32.
  • SMs which are physically located close to each other are further grouped into entities called Graphics Processing Clusters (GPC)

Memory

Global memory or VRAM

  • DRAM

Shared memory

  • roughly a GPU equivalent of the cache in a CPU
  • Shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory.
  • Shared memory is visible to threads in the same block.

Registers

  • registers are small memory banks dedicated to each thread.
  • Threads in a block all execute the same instructions.
  • However, the numerical values of the results of intermediate calculations are different for every thread.
  • Registers allow threads to store local copies of variable which are visible to only that one thread.
  • Each SM has a fixed, limited number of registers