🤖 Harold's Notes

Search

❯

❯

❯

❯

GPU architecture

GPU architecture

Jul 03, 20242 min read

Source: https://learnopencv.com/demystifying-gpu-architectures-for-deep-learning/

VRAM = DRAM
PCI Express = connection to CPU
Scalable Link Interface = connection to other GPUs
CUDA = “Compute Unified Device Architecture”

Processing units

CUDA cores

executes a thread
- can execute instructions for multiplying, dividing or calculating special functions, for example, activation functions
weak but many of them

CUDA blocks and grids

CUDA threads are grouped together into “blocks”.
All threads within a block execute the same instructions and run on the same SM
Blocks are further grouped into entities called CUDA grids

CUDA kernels

On a GPU, each CUDA thread will work to produce only one entry of the output matrix.
We need a way to specify the computation that each CUDA thread should perform
⇒ CUDA kernels
A kernel is written in such a way that different threads do the same computation but on different data.

Streaming multiprocessors (SMs)

An SM is a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads.
Modern GPUs contains several dozens of SMs
For the purposes of execution, the SM divides blocks of threads into ‘warps’ which are groups of size 32 THIS IS WHY IT’S SO IMPORTANT TO BE DIVISIBLE BY 32.
SMs which are physically located close to each other are further grouped into entities called Graphics Processing Clusters (GPC)

Memory

Global memory or VRAM

DRAM

Shared memory

roughly a GPU equivalent of the cache in a CPU
Shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory.
Shared memory is visible to threads in the same block.

Registers

registers are small memory banks dedicated to each thread.
Threads in a block all execute the same instructions.
However, the numerical values of the results of intermediate calculations are different for every thread.
Registers allow threads to store local copies of variable which are visible to only that one thread.
Each SM has a fixed, limited number of registers

Graph View

Processing units
Memory

Backlinks

GPU memory tiling

Created with Quartz v4.2.3 © 2025