Definition
- Takes:
- An input array [x0β,β¦,xnβ]
- An associate operator β
- e.g. sum, product, min, max
- Returns:
- An output array [y0β,β¦,ynβ1β]
- Inclusive scan: yiβ=x0βββ¦βxiβ
- Excluse scan: yiβ=x0βββ¦βxiβ1β
Parallel scan
- Parallel scan requires synchronization across parallel workers
- On GPUs, itβs cheaper to synchronize with threads on the same thread block compared to across blocks.
Segmented scan
- Approach: segmented scan
- Every thread block scans a segment
- Scan the segmentsβ partial sums (parallel across blocks)
- Add each segmentβs scanned partial sum last value to the next segment
- This is another scan but with the last value of each segments.
- Diagram
Parallel scan implementation within a single thread block
Kogge-Stone
Kogge-Stone example with 8 elements
- Weβre looking at parallel (inclusive) scan
- Letβs do ONE parallel reduction tree
- We write in place in the same array
- We get the last element, and some other as byproduct
- Blue values are βalready validβ solutions
- SOME VALUE ARE NEVER TOUCHED e.g. x2β,x4β,x6β
- Now, if we overlap and do 4 reduction trees, we get
- First iteration: for every element, add the element one step before it
- Second iteration: for every element, add the element two steps before it
- Third iteration: for every element, add the element four steps before it
- We get O(logN) iterations
Kogge-Stone Parallel (Inclusive) Scan
- Use the above algorithm with one thread for each element
Runtime analysis
- A parallel algorithm is work-efficient if it performs the same amount of work as the corresponding sequential algorithm
- Scan work efficiency
- Sequential scan performs N additions
- Kogge-Stone parallel scan performs:
- log(N) steps, Nβ2step operations per step
- Total: (Nβ1)+(Nβ2)+β¦+(NβN/2)=Nβlog(N)β(Nβ1)==O(Nβlog(N)) operations
- Algorithm is not work efficient
- If resources are limited, parallel algorithm will be slow because of low work efficiency
Implementation
Memory optimization
- The input array x is in global memory
- It doesnβt make sense to keep reading and writing to and from global memory
- Load once to a shared memory buffer, and do everything in shared memory
- Write out at the end
Details
- You need to sync the threads between every iteration
- You need to be careful of race conditions
- To fix it, you need to ensure that all threads finish reading, before starting to write
- read β
__syncthreads();
β write
- Better way to do it
- double buffering
- read from buffer1, write to buffer 2
- alternate (read from buffer 2, write in buffer 1)
Brent-Kung
- Brent-Kung takes more steps but is more work-efficient
Brent-Kung example with 8 elements
Analysis
- Reduction stage:
- log(N) steps
- N/2+N/4+β¦+2+1=Nβ1 operations
- Post-reduction stage:
- log(N)β1 steps
- (2β1)+(4β1)+β¦+(N/2β1)=(Nβ2)β(log(N)β1) operations
- Total:
- 2βlog(N)β1 steps
- 2βNβlog(N)β2=O(N) operations