-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
-
Nice blog on llm inference optimizations: https://vgel.me/posts/faster-inference/
-
Cut LLM costs by mixing GPU types
-
Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
-
SliceGPT: Weight Matrix Compression for LLMs https://huggingface.co/papers/2401.15024
-
vLLM FP8 support https://x.com/anyscalecompute/status/1811059148911693906?s=46
-
vLLM office hours/ videos https://neuralmagic.com/community-office-hours/
-
SGLang (used by xAi people for Grok-mini) https://github.com/sgl-project/sglang
-
You can build a custom torch dynamo backend for super efficient inference
-
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
-
Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574