-
heterogeneous: CPU + GPU
CUDA C
-
extends ANSI C with minimal new syntax
-
Terminology: CPU=host, GPU=device
- CUDA C source can be mixture of host & device code
- device code functions: kernels
- grid of threads: many threads are launched to execute a kernel
- don’t be afraid of launching many threads
- e.g. one thread per (output) tensor element
-
CPU & GPU code runs concurrently (overlapped)
Example: Vector addition
- Naive GPU vector addition:
- Allocate device memory for vectors
- Transfer inputs host → device
- Launch kernel and perform additions
- Copy device → host back
- Free device memory
- Normally, we keep data on the GPU as long as possible to asynchronously schedule many kernel launches.
- For vector addition, we can calculate the array index of the thread:
int i = blockIdx.x * blockDim.x + threadIdx.x;
- select the block + assign position within the block
CUDA code
- General strategy: replace loop by grid of threads
- To keep in mind:
- Data sizes might not perfectly divisible by block sizes: always check bounds
- Prevent threads of boundary block to read/write outside allocated memory