• Code convention: _d is for device variables, _h is for host variables.

Memory allocation

  • Nvidia devices come with their own DRAM (device) global memory
  • cudaMalloc & cudaFree
float *A_d; // 
size_t size = n* sizeof(float); // size in bytes
cudaMalloc((void**)&A_d, size); // pointer to pointer!
...
 
cudaFree(A_d)
 

Host > Device Transfer

  • Copy data from CPU memory to GPU memory and vice versa
// copy input vectors to device (host -> device)
cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
...
// transfer result back to CPU memory (device -> host)
cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);
 

CUDA Error handling

  • CUDA functions return cudaError_t .. if not cudaSuccess, we have a problem…
  • always check returned error status

Kernel functions fn<<>>

  • Lauching kernel = grid of threads is launched
  • All threads execute the same code (SIMD)
  • Threads are hierarchically organized into grid blocks & thread blocks
  • up to 1024 threads can be in a thread block

Kernel coordinates

  • built-in variables available inside the kernel: blockIdx, threadIdx

  • These “coordinates” allow threads (all executing the same code) to identify what to do (e.g. which portion of the data to process)

  • Each thread is uniquely identified by blockIdx & threadIdx

  • Telephone system analogy: blockIdx (+41) is the area code, threadIdx is the local phone number (7856…)

  • built-in blockDim tells us the number of threads in a block

  • For vector addition, we can calculate the array index of the thread:

    • int i = blockIdx.x * blockDim.x + threadIdx.x;
      • select the block + assign position within the block

Calling Kernels

  • kernel configuration is specified between <<< >>>
  • number of blocks, number of threads in each block
  • can specify grid too
dim3 numThreads(256);
dim3 numBlocks((n+numThreads -1)/ numThreads)
// dim3 numGrids
// calling vector addition kernel example
vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d, n)

Function declaration

  • You declare a kernel function with __global__
  • calling a __global__ function launches new grid of CUDA threads
  • Functions declared with __device__ can be called from within a CUDA thread
  • If both __host__ & __device__ are used in a function declaration, CPU & GPU versions will be compiled

Compiler

  • nvcc (NVIDA C compiler) is used to compile kernels into PTX.
  • Parallel Thread Execution (PTX) is a low-level VM & instruction set
  • Graphics driver translates PTX into executable binary code (SASS)