KV-cache management

  • Each KV-cache block is a fixed number of tokens
  • It has
    • block_id, unique identifier
    • ref_cnt how many active requests are relying on this KV cache
    • block_hash a hash value. The hash combines the previous block’s hash, the current tokens, and optional metadata. It is used for prefix caching.

Prefix caching

Prefix caching avoids recomputing tokens that multiple prompts share at the beginning - hence prefix.

During the first generate call, in the scheduling stage, inside kv_cache_manager.get_computed_blocks, the engine invokes hash_request_tokens:

  1. This function splits the long_prefix + prompt into KV-blocksize token chunks the .

  2. For each complete chunk, it computes a hash (using either the built-in hash or SHA-256, which is slower but has fewer collisions). The hash combines the previous block’s hash, the current tokens, and optional metadata.

  3. Each result is stored as a BlockHash object containing both the hash and its token IDs. We return a list of block hashes.

  4. Next, the engine calls find_longest_cache_hit to check if any of these hashes already exist in cached_block_hash_to_block