Source: https://learnopencv.com/demystifying-gpu-architectures-for-deep-learning/
- VRAM = DRAM
- PCI Express = connection to CPU
- Scalable Link Interface = connection to other GPUs
- CUDA = “Compute Unified Device Architecture”
Processing units
CUDA cores
- executes a thread
- can execute instructions for multiplying, dividing or calculating special functions, for example, activation functions
- weak but many of them
CUDA blocks and grids
- CUDA threads are grouped together into “blocks”.
- All threads within a block execute the same instructions and run on the same SM
- Blocks are further grouped into entities called CUDA grids
CUDA kernels
- On a GPU, each CUDA thread will work to produce only one entry of the output matrix.
- We need a way to specify the computation that each CUDA thread should perform
- ⇒ CUDA kernels
- A kernel is written in such a way that different threads do the same computation but on different data.
Streaming multiprocessors (SMs)
- An SM is a sophisticated processor within the GPU which contains hardware and software for orchestrating the execution of hundreds of CUDA threads.
- Modern GPUs contains several dozens of SMs
- For the purposes of execution, the SM divides blocks of threads into ‘warps’ which are groups of size 32 THIS IS WHY IT’S SO IMPORTANT TO BE DIVISIBLE BY 32.
- SMs which are physically located close to each other are further grouped into entities called Graphics Processing Clusters (GPC)
Memory
Global memory or VRAM
- DRAM
Shared memory
- roughly a GPU equivalent of the cache in a CPU
- Shared memory is located physically close to the CUDA cores and fetching data from shared memory is at least 10 times faster than from global memory.
- Shared memory is visible to threads in the same block.
Registers
- registers are small memory banks dedicated to each thread.
- Threads in a block all execute the same instructions.
- However, the numerical values of the results of intermediate calculations are different for every thread.
- Registers allow threads to store local copies of variable which are visible to only that one thread.
- Each SM has a fixed, limited number of registers