https://lmsys.org/blog/2023-11-21-lookahead-decoding/
- Lookahead decoding is a new approach to speculative decoding that doesn’t require a draft model. Instead, the model itself is used in two branches:
- a lookahead branch, which predicts and extends candidate N-grams (short sequences of N tokens)
- The lookahead branch is similar to the draft model in regular speculative decoding
- a verification branch, which verifies the candidates
- the verification branch has the same role as the oracle model.
Limitations of speculative decoding
- The maximum speedup that speculative decoding based methods can achieve is limited by the token acceptance rate, or equivalently, how accurately the draft model can predict the main model’s outputs.
- Creating an accurate draft model is non-trivial, often requiring extra training and careful tuning in the face of traffic changes over time.