• Source: https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=27s

  • Motivation: GPUs go brr, more FLOPS please

  • Bigger models are smarter

  • GPUs are the backbone of modern deep learning

  • Classic software: sequential programs

  • Multi-core CPU came up

  • Developers had to learn multi-threading (deadlocks, races etc.)

The rise of CUDA

  • GPUs have much higher peak FLOPs than multi-core CPUs
  • Main principle: divide work among threads
  • GPUs focus on execution throughput of massive number of threads

Challenges

  • If you do not care about performance, parallel programming is very easy
  • Designing parallel algorithms is harder than sequential algorithms
    • Parallelizing recurrent computations requires non-intuitive thinking (like prefix sum)
  • Speed is often limited by memory latency/throughput (memory bound)
  • Performance of parallel programs can vary dramatically based on input data charactersitics
  • Not all apps are “embarassingly parallel” - synchronization imposes overheads

Main goals of the book

  1. Parallel programming & computational thinking
  2. Correct & reliable: debugging function & performance
  3. Scalability: regularize and localize memory access