• Related to Self-Attention and vLLM
  • Definition: Saving the and matrices rows corresponding to past tokens (for each attention layer) from the last inference step in auto-regressive sampling, at each forward pass, we only need to compute the new rows in corresponding to the new token