KV-cache management
- Each KV-cache block is a fixed number of tokens
- It has
block_id, unique identifierref_cnthow many active requests are relying on this KV cacheblock_hasha hash value. The hash combines the previous block’s hash, the current tokens, and optional metadata. It is used for prefix caching.
Prefix caching
Prefix caching avoids recomputing tokens that multiple prompts share at the beginning - hence prefix.
During the first generate call, in the scheduling stage, inside kv_cache_manager.get_computed_blocks, the engine invokes hash_request_tokens:
-
This function splits the
long_prefix + promptinto KV-blocksize token chunks the . -
For each complete chunk, it computes a hash (using either the built-in hash or SHA-256, which is slower but has fewer collisions). The hash combines the previous block’s hash, the current tokens, and optional metadata.
-
Each result is stored as a
BlockHashobject containing both the hash and its token IDs. We return a list of block hashes. -
Next, the engine calls
find_longest_cache_hitto check if any of these hashes already exist incached_block_hash_to_block