🤖 Harold's Notes

Search

❯

❯

❯

❯

KV caching

Jul 03, 20241 min read

Related to Self-Attention and vLLM
Definition: Saving the $K$ and $V$ matrices rows corresponding to past tokens (for each attention layer) from the last inference step ⇒ in auto-regressive sampling, at each forward pass, we only need to compute the new rows in $Q, K, V$ corresponding to the new token

Graph View

Backlinks

vLLM
Self-Attention

Created with Quartz v4.2.3 © 2024