Flash Attention forward pass

Assume that vectors are column vectors in notation

Computing safe softmax in online fashion

Online algorithm for the normalization factor

$x_{1}, \dots, x_{n}$
Let $m_{i}$ be the max value at step $i$ and $s_{i}$ the running sum at step i
At step 0
- $m_{0} = x_{0}$ , $s_{0} = e^{(} x_{0} - m_{0}) = 1$
At step 1
- if $x_{1} > x_{0}$ , $m_{1} = x_{1}$ , $s_{1} = s^{0} * e^{(} m_{0} - m_{1}) + e^{(} x_{1} - m_{1}) = \sum_{i = 0}^{1} e^{(} x_{i} - m_{1})$
At step i,
- By induction, we assume $s_{i - 1} = \sum_{j = 0}^{i - 1} e^{(} x_{j} - m_{i - 1})$
- then $m_{i} = ma x (x_{i}, m_{i - 1})$ and $s_{i} = s^{i - 1} * e (m_{i - 1} - m_{i}) + e^{(} x_{i} - m_{i})$

Computing $Q K^{T} V$ without materializing $Q K^{T}$

Using Block Matrix Multiplication, the high-level idea is that you compute first a block of $Q K^{T}$ (ideally the shape of the block should be independent of the sequence length e.g. $(32, dim)$ ), and then directly reuse the resulting $(32, 32)$ block to compute with $V$ block $(32, dim)$ .

Online block softmax computation

For the sake of simplicity, let’s say we have blocks of shape (c,dim) of $Q$ and shape (dim,c) for $K^{T}$
- In this case, we’re selecting entire rows of Q as blocks (called $Q_{i}$ ), and entire columns of $K^{T}$ as blocks. (called $K_{i}^{T}$ )
We’ll call the score matrix $S = Q K^{T}$ of shape (T,T)
- Assuming we do block-matrix MM, we get that $S_{ij}$ = $Q_{i} K_{j}^{T}$ of shape (c,c) (note that there’s no accumulation, because the blocks span entire rows/columns)
Let’s call $softmax^{*} (x_{i}) = e x p (x_{i} - x_{ma x})$ , which is softmax without the normalization (still over a row vector)
Let’s call $P = softmax^{*} (S)$ , where the softmax is applied over each block independently i.e. $P_{ij} = e x p (S_{ij} - ro w ma x (S_{ij})$
Then we do a block MM with $V$ with blocks $V_{i}$ of shape (c,dim), giving us block rows $O_{j}$ of shape (c,dim) for the output matrix $O$ i.e. $O_{j} = \sum_{i = 0}^{T / c} P_{ji} V_{i}$
Now obviously, each $P_{ji}$ in $O_{j} = \sum_{i = 0}^{T / c} P_{ji} V_{i}$ uses a different local maximum. However, we can reuse our Computing safe softmax in online fashion idea, communicate the maximum of each row in a block, and just readjust the current maximum as we accumulate
- We can compute the normalization factor in the same online fasion
- Does this require the accumulation to be non-parallel i.e. one-by-one?
  - No, because $ma x$ is an associative operator, we can do a parallel-prefix sum

The pseudo-code with for-loops to make it simpler

For output block o_j = [0,0,...,0)$ # shape (c,dim)
	curr_max = eye(-infinity) # shape (c,1)
	normalization = [0,...0] # shape (c,1)
	for i in range(T/c):
		## compute the block
		S_ji = Q_iK_j^T
		## recomputing the max
		curr_max = max(curr_max, rowmax(S_ji))
		correction_factor = exp(rowmax(S_ji) - curr_max)
		P_ji = exp(S_ji - curr_max)
		## correcting the sums
		normalization = normalization*correction_factor + rowsum(P_ji)
		o_j = o_j*correction_factor + P_ji*V_i
	
	## normalize at the end
	o_j = o_j * normalization

🤖 Harold's Notes

Explorer

Flash Attention forward pass

Computing safe softmax in online fashion

Online algorithm for the normalization factor

Computing $Q K^{T} V$ without materializing $Q K^{T}$

Online block softmax computation

The pseudo-code with for-loops to make it simpler

Actual FA2 pseudocode (with more comments on loading between SRAM and HBM)

Graph View

Table of Contents

Backlinks

🤖 Harold's Notes

Explorer

Flash Attention forward pass

Computing safe softmax in online fashion

Online algorithm for the normalization factor

Computing QKTV without materializing QKT

Online block softmax computation

The pseudo-code with for-loops to make it simpler

Actual FA2 pseudocode (with more comments on loading between SRAM and HBM)

Graph View

Table of Contents

Backlinks

Computing $Q K^{T} V$ without materializing $Q K^{T}$