🤖 Harold's Notes

Search

❯

❯

❯

❯

Flash Attention

Flash Attention

Jul 03, 20241 min read

Good resource: https://www.youtube.com/watch?v=zEuwuCTEf_0

The task

Assume for the moment 1 head, no batch dimension
- This is okay, because they are fully independent, they are “embarrassingly parallel”
We have $Q, K, V$ of shape $(N, d)$ , seq. length $N$ , head dimension $d$
$P = so f t ma x (\frac{Q K ^{T}}{d})$ , $O = P V$
How to parallelize and do this in one go?
- $P$ is an intermediate, can we avoid materializing it?

Thoughts

Attention is like 2-layer network, but the head dimension is very small

Graph View

The task
Thoughts

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025