Definition

  • Takes:
    • An input array
    • An associate operator
      • e.g. sum, product, min, max
  • Returns:
    • An output array
      • Inclusive scan:
      • Excluse scan:
          • e.g. 0 for sum

Parallel scan

  • Parallel scan requires synchronization across parallel workers
    • On GPUs, it’s cheaper to synchronize with threads on the same thread block compared to across blocks.

Segmented scan

  • Approach: segmented scan
    • Every thread block scans a segment
    • Scan the segments’ partial sums (parallel across blocks)
    • Add each segment’s scanned partial sum last value to the next segment
      • This is another scan but with the last value of each segments.
    • Diagram

Parallel scan implementation within a single thread block

Kogge-Stone

Kogge-Stone example with 8 elements

  • We’re looking at parallel (inclusive) scan
  • Let’s do ONE parallel reduction tree
    • We write in place in the same array
    • We get the last element, and some other as byproduct
      • Blue values are β€œalready valid” solutions
    • SOME VALUE ARE NEVER TOUCHED e.g.
  • Now, if we overlap and do 4 reduction trees, we get
    • First iteration: for every element, add the element one step before it
    • Second iteration: for every element, add the element two steps before it
    • Third iteration: for every element, add the element four steps before it
  • We get iterations

Kogge-Stone Parallel (Inclusive) Scan

  • Use the above algorithm with one thread for each element

Runtime analysis

  • A parallel algorithm is work-efficient if it performs the same amount of work as the corresponding sequential algorithm
  • Scan work efficiency
    • Sequential scan performs additions
    • Kogge-Stone parallel scan performs:
      • steps, operations per step
      • Total: operations
    • Algorithm is not work efficient
  • If resources are limited, parallel algorithm will be slow because of low work efficiency

Implementation

Memory optimization

  • The input array is in global memory
    • It doesn’t make sense to keep reading and writing to and from global memory
  • Load once to a shared memory buffer, and do everything in shared memory
  • Write out at the end

Details

  • You need to sync the threads between every iteration
  • You need to be careful of race conditions
    • To fix it, you need to ensure that all threads finish reading, before starting to write
      • read β†’ __syncthreads(); β†’ write
    • Better way to do it
      • double buffering
        • read from buffer1, write to buffer 2
        • alternate (read from buffer 2, write in buffer 1)

Brent-Kung

  • Brent-Kung takes more steps but is more work-efficient

Brent-Kung example with 8 elements

  • Diagram

Analysis

  • Reduction stage:
    • steps
    • operations
  • Post-reduction stage:
    • steps
    • operations
  • Total:
    • steps
    • operations