- Code convention:
_d
is for device variables,_h
is for host variables.
Memory allocation
- Nvidia devices come with their own DRAM (device) global memory
cudaMalloc
&cudaFree
Host ←> Device Transfer
- Copy data from CPU memory to GPU memory and vice versa
CUDA Error handling
- CUDA functions return
cudaError_t
.. if notcudaSuccess
, we have a problem… - always check returned error status
Kernel functions fn<<>>
- Lauching kernel = grid of threads is launched
- All threads execute the same code (SIMD)
- Threads are hierarchically organized into grid blocks & thread blocks
- up to 1024 threads can be in a thread block
Kernel coordinates
-
built-in variables available inside the kernel:
blockIdx, threadIdx
-
These “coordinates” allow threads (all executing the same code) to identify what to do (e.g. which portion of the data to process)
-
Each thread is uniquely identified by
blockIdx & threadIdx
-
Telephone system analogy:
blockIdx
(+41) is the area code,threadIdx
is the local phone number (7856…) -
built-in
blockDim
tells us the number of threads in a block -
For vector addition, we can calculate the array index of the thread:
int i = blockIdx.x * blockDim.x + threadIdx.x;
- select the block + assign position within the block
Calling Kernels
- kernel configuration is specified between
<<< >>>
- number of blocks, number of threads in each block
- can specify grid too
Function declaration
- You declare a kernel function with
__global__
- calling a
__global__
function → launches new grid of CUDA threads - Functions declared with
__device__
can be called from within a CUDA thread - If both
__host__
&__device__
are used in a function declaration, CPU & GPU versions will be compiled
Compiler
- nvcc (NVIDA C compiler) is used to compile kernels into PTX.
- Parallel Thread Execution (PTX) is a low-level VM & instruction set
- Graphics driver translates PTX into executable binary code (SASS)