• extends ANSI C with minimal new syntax

  • Terminology: CPU=host, GPU=device

    • CUDA C source can be mixture of host & device code
    • device code functions: kernels
      • grid of threads: many threads are launched to execute a kernel
      • don’t be afraid of launching many threads
        • e.g. one thread per (output) tensor element
  • CPU & GPU code runs concurrently (overlapped)

Example: Vector addition

  • Naive GPU vector addition:
    1. Allocate device memory for vectors
    2. Transfer inputs host device
    3. Launch kernel and perform additions
    4. Copy device host back
    5. Free device memory
  • Normally, we keep data on the GPU as long as possible to asynchronously schedule many kernel launches.
  • For vector addition, we can calculate the array index of the thread:
    • int i = blockIdx.x * blockDim.x + threadIdx.x;
      • select the block + assign position within the block

CUDA code

  • General strategy: replace loop by grid of threads
  • To keep in mind:
    • Data sizes might not perfectly divisible by block sizes: always check bounds
    • Prevent threads of boundary block to read/write outside allocated memory
// compute vector sum C=A+B
// each thread performs one pairwise addition
void vecAddKernel(float* A, float *B, float* C, int n){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < n){
	C[i] = A[i] + B[i];