CUDA grid
-
2 level hierarchy: blocks, threads
-
Idea: map threads to multi-dimensional data
-
All threads in a grid execute the same kernel
-
Threads in same block can access the same shared memory
-
Max block size: 1024 threads
-
built-in 3D coordinates of a thread:
blockIdx, threadIdx
- identify which portion of the data to process -
shape of grid & blocks:
gridDim
: number of blocks in the grid (not so often used)blockDim
: number of threads in a block
Grid shape
- How to define
blockDim
is dependent on cache - The grid can be different for each kernel launch, e.g., dependent on data shapes
- Threads can be scheduled in any order
- You can use fewer than 3dims (set others to 1)
- e.g. 1D for sequences
nd-Arrays in Memory
-
Logical view of the data
-
Row-major layout in memory
-
2D array can be linearized in two ways:
- row-major (contiguous elements form rows)
- column-major (contiguous elements form columns)
- important for how to think about data accesses in your code and how cache-friendly they are
- indexing a whole row is cache-friendly if row-major layout.
-
Torch tensors & numpy ndarrays use strides to specify how elements are laid out in memory
Image blur example (3.3, p.60)
- Mean filter example blurKernel
- shows row-major pixel memory access (in & out pointers)
- nice showcase of 3D data access
Interesting snippets
- Defining the threads and blocks
- Getting current row, col, and channel
- Row-major access