Introduced in DeepSeek v2, https://arxiv.org/pdf/2405.04434
pytorch implementation https://github.com/fla-org/flash-linear-attention/blob/main/fla/layers/mla.py
- https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py#L396
MLA utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache.

Full Formula of MLA

Notation:
- $W^{D *}$ is an down-projection matrix to the compressed space
- $W^{U *}$ is an up-projection matrix from the compressed space

Setup and dimensions

Let

$d_{model}$ = hidden_dim
$n_{h}$ = num_attention_heads
$d_{c}$ = qk_nope_head_dim (no-PE/low-rank per-head dim)
$d_{r}$ = qk_rope_head_dim (RoPE per-head dim)
$r_{q}$ = q_lora_rank
$r_{k v}$ = kv_lora_rank

Vectors (columns):

$h_{t} \in R^{d_{model}}$
$c_{t}^{Q} \in R^{r_{q}}$ , $c_{t}^{K V} \in R^{r_{k v}}$
Per head $i$ : $q_{t, i}^{C} \in R^{d_{c}}$ , $q_{t, i}^{R} \in R^{d_{r}}$ , $k_{j, i}^{C} \in R^{d_{c}}$ , $k_{j}^{R} \in R^{d_{r}}$ , $v_{j, i}^{C} \in R^{d_{c}}$

Matrices (biases omitted; matrices left-multiply column vectors):

$W^{D Q} \in R^{r_{q} \times d_{model}}$ (code: q_down_lora)
$W^{U Q} \in R^{(n_{h} d_{c}) \times r_{q}}$ (code: q_up_nope)
$W^{QR} \in R^{(n_{h} d_{r}) \times r_{q}}$ (code: q_up_rope)
$W^{DK V} \in R^{r_{k v} \times d_{model}}$ (code: kv_down_lora)
$W^{U K} \in R^{(n_{h} d_{c}) \times r_{k v}}$ (code: k_up_nope)
$W^{K R} \in R^{(n_{h} d_{r}) \times d_{model}}$ (code: k_rope)
$W^{U V} \in R^{(n_{h} d_{c}) \times r_{k v}}$ (code: v_up)
$W^{O} \in R^{d_{model} \times (n_{h} d_{c})}$ (code: out_proj)

RoPE preserves size: $RoPE (\cdot) : R^{n_{h} d_{r}} \to R^{n_{h} d_{r}}$ .

Algorithm

$c_{t}^{Q} = W^{D Q} h_{t},$

$[q_{t, 1}^{C}, q_{t, 2}^{C}, \dots, q_{t, n_{h}}^{C}] = q_{t}^{C} = W^{U Q} c_{t}^{Q}$

$[q_{t, 1}^{R}, q_{t, 2}^{R}, \dots, q_{t, n_{h}}^{R}] = q_{t}^{R} = RoPE (W^{QR} c_{t}^{Q})$

$q_{t, i} = [q_{t, i}^{C}; q_{t, i}^{R}]$

$c_{t}^{K V} = W^{DK V} h_{t}$

$[k_{t, 1}^{C}, k_{t, 2}^{C}, \dots, k_{t, n_{h}}^{C}] = k_{t}^{C} = W^{U K} c_{t}^{K V}$

$k_{t}^{R} = RoPE (W^{K R} h_{t})$

$k_{t, i} = [k_{t, i}^{C}; k_{t}^{R}]$

$[v_{t, 1}^{C}, v_{t, 2}^{C}, \dots, v_{t, n_{h}}^{C}] = v_{t}^{C} = W^{U V} c_{t}^{K V}$

$o_{t, i} = \sum_{j = 1}^{t} Softmax_{j} (\frac{q _{t, i}^{T} k _{j, i}}{d _{h} + d _{h}^{R}}) v_{j, i}^{C}$

$u_{t} = W^{O} [o_{t, 1}; o_{t, 2}; \dots; o_{t, n_{h}}]$

where the boxed vectors in blue need to be cached for generation.

Inference optimization

During inference, the naive formula needs to recover $k_{t}^{C}$ and $v_{t}^{C}$ from $c_{t}^{K V}$ for attention.
Fortunately, due to the associative law of matrix multiplication, we can absorb $W^{U K}$ into $W^{U Q}$ , and $W^{U V}$ into $W^{O}$ .
- Proof below
- Through this optimization, we avoid the computational overhead for recomputing $k_{t}^{C}$ and $v_{t}^{C}$ during inference.
The absorption trick means that head dimension for self-attention is ( $r_{k v} + d_{r}$ ) during inference, compared to ( $d_{c} + d_{r}$ ) during training.
- Given that for DeepSeek-V2, $r_{k v}$ is set to $4 d_{c}$ and $d_{r}$ is set to $\frac{d _{c}}{2}$ , then the reduction dimension is about $\frac{\frac{9}{2}}{\frac{3}{2}} = 3$ times during inference
- Thus, we’re trading off KV-cache size for arithmetic intensity

KV cache per token

Attention Mechanism	KV Cache per Token (# Element)	Capability
Multi-Head Attention (MHA)	$2 n_{h} d_{c} l$	Strong
Grouped-Query Attention (GQA)	$2 n_{g} d_{c} l$	Moderate
Multi-Query Attention (MQA)	$2 d_{c} l$	Weak
MLA (Ours)	$(r_{k v} + d_{r}) l \approx \frac{9}{2} d_{h} l$	Stronger

$n_{h}$ denotes the number of attention heads, $d_{c}$ denotes the dimension per attention head, $l$ denotes the number of layers, $n_{g}$ denotes the number of groups in GQA, and $r_{k v}$ and $d_{r}$ denote the KV compression dimension and the per-head dimension of the decoupled queries and key in MLA, respectively.

The amount of KV cache is measured by the number of elements, regardless of the storage precision. For DeepSeek-V2, $r_{k v}$ is set to $4 d_{c}$ and $d_{r}$ is set to $\frac{d _{c}}{2}$ . So, its KV cache is equal to GQA with only 2.25 groups, but its performance is stronger than MHA.

Ablations between MQA, GQA, MHA, and MLA

Inference optimizations - Absorption

Multi-Head Latent Attention (MLA) — Inference Absorption Proof

They key to the proof is to do it on a per-head basis

MLA equations (condensed)

$c_{t}^{Q} = W^{D Q} h_{t},$

$q_{t}^{C} = W^{U Q} c_{t}^{Q}, q_{t}^{R} = RoPE (W^{QR} c_{t}^{Q}), q_{t, i} = [q_{t, i}^{C}; q_{t, i}^{R}],$

$c_{t}^{K V} = W^{DK V} h_{t},$

$k_{t}^{C} = W^{U K} c_{t}^{K V}, k_{t}^{R} = RoPE (W^{K R} h_{t}), k_{t, i} = [k_{t, i}^{C}; k_{t}^{R}],$

$v_{t}^{C} = W^{U V} c_{t}^{K V} .$

Per-head scaled scores use $\frac{1}{d _{c} + d _{r}}$ : $score_{t, j, i} = (q_{t, i}^{C})^{⊤} k_{j, i}^{C} + (q_{t, i}^{R})^{⊤} k_{j}^{R} .$

Claim A (keys): absorb $W^{U K}$ into the query side

For head $i$ , the no-PE part of the dot product is $(q_{t, i}^{C})^{⊤} k_{j, i}^{C} = (W_{i}^{U Q} c_{t}^{Q})^{⊤} (W_{i}^{U K} c_{j}^{K V}) = (c_{t}^{Q})^{⊤} (W_{i}^{U Q})^{⊤} W_{i}^{U K} c_{j}^{K V} .$

Define the composed matrix and the KV-space query $\tilde{W}_{i}^{Q K} = def (W_{i}^{U K})^{⊤} W_{i}^{U Q} \in R^{r_{k v} \times r_{q}}, \tilde{q}_{t, i}^{K V} = \tilde{W}_{i}^{Q K} c_{t}^{Q} \in R^{r_{k v}} .$

Then $(q_{t, i}^{C})^{⊤} k_{j, i}^{C} = (\tilde{q}_{t, i}^{K V})^{⊤} c_{j}^{K V} .$

Consequence. During inference, you only need the cached $c_{j}^{K V}$ (not $k_{j, i}^{C}$ ). Precompute $\tilde{W}_{i}^{Q K}$ once and form $\tilde{q}_{t, i}^{K V}$ for the current token. The total score for head $i$ becomes $score_{t, j, i} = \frac{( q ~ _{t, i}^{K V} ) ^{⊤} c _{j}^{K V} + ( q _{t, i}^{R} ) ^{⊤} k _{j}^{R}}{d _{c} + d _{r}} .$

(If you prefer a single matrix, stack the blocks: $\tilde{W}^{Q K} \in R^{(n_{h} r_{k v}) \times r_{q}}$ .)

Claim B (values): absorb $W^{U V}$ into the output projection $W^{O}$

Let $α_{t, j, i}$ be the attention weights per head. Then $o_{t, i} = \sum_{j} α_{t, j, i} v_{j, i}^{C} W_{i}^{U V} c_{j}^{K V} = W_{i}^{U V} m_{t, i} (j \sum α_{t, j, i} c_{j}^{K V}),$ where we defined the compressed-space mixture $m_{t, i} = \sum_{j} α_{t, j, i} c_{j}^{K V} \in R^{r_{k v}} .$

Stack $m_{t, i}$ across heads into $m_{t} \in R^{n_{h} r_{k v}}$ and set $B = blkdiag (W_{1}^{U V}, \dots, W_{n_{h}}^{U V}) \in R^{(n_{h} d_{c}) \times (n_{h} r_{k v})}$ . The final output is $u_{t} = W^{O} [o_{t, 1}; \dots; o_{t, n_{h}}] = W^{O} B m_{t} = \tilde{W}^{O} \in R^{d_{model} \times (n_{h} r_{k v})} (W^{O} B) m_{t} .$

Consequence. Precompute $\tilde{W}^{O} = W^{O} blkdiag (W_{1}^{U V}, \dots, W_{n_{h}}^{U V})$ . At inference you never materialize $v_{j, i}^{C}$ or $o_{t, i}$ :

build $m_{t, i}$ directly in the compressed space from cached $c_{j}^{K V}$ and weights $α_{t, j, i}$ ,
concatenate to $m_{t}$ , and
apply $\tilde{W}^{O}$ once to get $u_{t}$ .

What to cache (matches the blue boxes)

$c_{j}^{K V} \in R^{r_{k v}}$ for each past token $j$ — the low-rank KV cache.
$k_{j}^{R} \in R^{d_{r}}$ (per head or shared), unchanged by the trick.

No need to cache or recompute $k^{C}$ or $v^{C}$ .

Optimized inference recipe (per new token $t$ )

Compute $c_{t}^{Q} = W^{D Q} h_{t}$ , $q_{t, \cdot}^{R} = RoPE (W^{QR} c_{t}^{Q})$ , and $k_{t}^{R} = RoPE (W^{K R} h_{t})$ (cache $k_{t}^{R}$ ).
For each head $i$ , compute $\tilde{q}_{t, i}^{K V} = \tilde{W}_{i}^{Q K} c_{t}^{Q}$ .
Scores: $s_{t, j, i} = ((\tilde{q}_{t, i}^{K V})^{⊤} c_{j}^{K V} + (q_{t, i}^{R})^{⊤} k_{j}^{R}) / d_{c} + d_{r}$ .
Weights: $α_{t, j, i} = softmax_{j} (s_{t, j, i})$ .
Mix in compressed space: $m_{t, i} = \sum_{j} α_{t, j, i} c_{j}^{K V}$ .
Output: $u_{t} = \tilde{W}^{O} [m_{t, 1}; \dots; m_{t, n_{h}}]$ .

All expensive per-token operations after step 1 happen in the low-rank $r_{k v}$ space.

Code to absorb $W^{U V}$ into $W^{O}$

PR for absorbing w_uk into w_O https://github.com/deepseek-ai/DeepSeek-V3/pull/702
As before, important to reshape the weights on a per-head basis to absorb correctly

import torch
from torch import nn
 
class AbsorbDemo:
 
    def __init__(self, bsz=1, q_len=1, kv_len=4, dim=7168, kv_lora_rank=512, n_heads=128, v_head_dim=128):
        
        self.n_heads = n_heads
        self.v_head_dim = v_head_dim
        self.dim = dim
        
        self.scores = torch.rand(bsz, q_len, n_heads, kv_len)
        self.kv_cache = torch.rand(bsz, kv_len, kv_lora_rank)
        self.w_uv = torch.rand(n_heads, v_head_dim, kv_lora_rank)
        self.wo = nn.Linear(self.n_heads * self.v_head_dim, self.dim, bias=False) 
        self.wo_absorb = None
        
    def run(self, absorb=False):
        x = torch.einsum("bsht,btc->bshc", self.scores, self.kv_cache)
        if absorb:
            if self.wo_absorb is None:
                wo = self.wo.weight
                wo = wo.transpose(0,1).view(self.n_heads, self.v_head_dim, self.dim)
                self.wo_absorb = torch.einsum("hdc,hdi->hci", self.w_uv, wo)
            x = torch.einsum("bshc,hci->bshi", x, self.wo_absorb)
            x = torch.sum(x, dim=2)
        else:
            x = torch.einsum("bshc,hdc->bshd", x, self.w_uv)   # it cost large memeory
            x = self.wo(x.flatten(2))
        return x
 
 
demo = AbsorbDemo()
tensor1 = demo.run(absorb=False)
tensor2 = demo.run(absorb=True)
print("w/o absorb:", tensor1.data)
print("w   absorb:", tensor2.data)
print(torch.allclose(tensor1.data, tensor2.data, atol=1e-03))

MLA Absorption: Why it’s an Inference-Only Optimization

Short answer: It’s great for generation-time reuse (weights frozen; KV cache grows), but awkward or counter-productive for training because of autograd dependencies and compute shape. In the absorbed path the QK reduction per head is $r_{k v} + d_{r}$ , whereas in the canonical training path it’s $d_{c} + d_{r}$ . The inference win is from never materializing or recomputing $k^{C}$ and $v^{C}$ over the growing cache, not from a smaller reduction dimension.

Why inference-only (autograd + compute)

Autograd dependency.
The absorbed matrices are products of learnable weights: $\tilde{W}_{i}^{Q K} = (W_{i}^{U K})^{⊤} W_{i}^{U Q}$ and $\tilde{W}^{O} = W^{O} blkdiag (W_{1}^{U V}, \dots, W_{n_{h}}^{U V})$ . If you precompute/cache these as constants for speed, then in the graph they don’t depend on $W^{U Q}, W^{U K}, W^{U V}, W^{O}$ , so gradients vanish: $\partial L / \partial W^{U Q} = \partial L / \partial W^{U K} = \partial L / \partial W^{U V} = \partial L / \partial W^{O} = 0$ . That’s fine in eval, but during training it kills learning.

You could keep $\tilde{W}^{Q K}, \tilde{W}^{O}$ in-graph (no detach) so the chain rule applies: $\partial L / \partial W^{U K} = (\partial L / \partial \tilde{W}^{Q K}) W^{U Q^{⊤}}$ and $\partial L / \partial W^{U Q} = W^{U K} (\partial L / \partial \tilde{W}^{Q K})$ (and similarly for $W^{U V}, W^{O}$ ). But then you must rebuild these products every forward, keep intermediates for backward, and backprop through extra large matmuls—negating the intended speed/memory win.

🤖 Harold's Notes

Explorer

Multi-Head Latent Attention (MLA)

Full Formula of MLA

Setup and dimensions

Algorithm

Inference optimization

KV cache per token

Ablations between MQA, GQA, MHA, and MLA

Inference optimizations - Absorption

Multi-Head Latent Attention (MLA) — Inference Absorption Proof

MLA equations (condensed)

Claim A (keys): absorb $W^{U K}$ into the query side

Claim B (values): absorb $W^{U V}$ into the output projection $W^{O}$

What to cache (matches the blue boxes)

Optimized inference recipe (per new token $t$ )

Code to absorb $W^{U V}$ into $W^{O}$

MLA Absorption: Why it’s an Inference-Only Optimization

Why inference-only (autograd + compute)

Graph View

Table of Contents

Backlinks

🤖 Harold's Notes

Explorer

Multi-Head Latent Attention (MLA)

Full Formula of MLA

Setup and dimensions

Algorithm

Inference optimization

KV cache per token

Ablations between MQA, GQA, MHA, and MLA

Inference optimizations - Absorption

Multi-Head Latent Attention (MLA) — Inference Absorption Proof

MLA equations (condensed)

Claim A (keys): absorb WUK into the query side

Claim B (values): absorb WUV into the output projection WO

What to cache (matches the blue boxes)

Optimized inference recipe (per new token t)

Code to absorb WUV into WO

MLA Absorption: Why it’s an Inference-Only Optimization

Why inference-only (autograd + compute)

Graph View

Table of Contents

Backlinks

Claim A (keys): absorb $W^{U K}$ into the query side

Claim B (values): absorb $W^{U V}$ into the output projection $W^{O}$

Optimized inference recipe (per new token $t$ )

Code to absorb $W^{U V}$ into $W^{O}$