🤖 Harold's Notes

Search

❯

❯

❯

❯

vLLM

Jul 03, 20241 min read

Related to KV caching

KV block

A fixed-size contigous chunk memory that can store KV cache from left to right
Paged Attention thinks that the KV blocks are arranged logically
- A block table actually does the mapping to retrieve the physical KV blocks
Dynamic block enables KV cache sharing between requests

Paged Attention

PagedAttention, an attention algorithm that operates on KV cache stored in non-contiguous paged memory, which is inspired by the virtual memory and paging in OS.
Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks. Each block contains the key and value vectors for a fixed number of tokens,1 which we denote as KV block size (B).
Example

vLLM

Efficient management of KV cache is crucial for high-throughput LLM serving

Graph View

Paged Attention
vLLM

Backlinks

KV caching

Created with Quartz v4.2.3 © 2024