-
Motivation: GPUs go brr, more FLOPS please
-
Bigger models are smarter
-
GPUs are the backbone of modern deep learning
-
Classic software: sequential programs
-
Multi-core CPU came up
-
Developers had to learn multi-threading (deadlocks, races etc.)
The rise of CUDA
- GPUs have much higher peak FLOPs than multi-core CPUs
- Main principle: divide work among threads
- GPUs focus on execution throughput of massive number of threads
Challenges
- If you do not care about performance, parallel programming is very easy
- Designing parallel algorithms is harder than sequential algorithms
- Parallelizing recurrent computations requires non-intuitive thinking (like prefix sum)
- Speed is often limited by memory latency/throughput (memory bound)
- Performance of parallel programs can vary dramatically based on input data charactersitics
- Not all apps are “embarassingly parallel” - synchronization imposes overheads
Main goals of the book
- Parallel programming & computational thinking
- Correct & reliable: debugging function & performance
- Scalability: regularize and localize memory access