https://lmsys.org/blog/2023-11-21-lookahead-decoding/

  • Lookahead decoding is a new approach to speculative decoding that doesn’t require a draft model. Instead, the model itself is used in two branches:
  • a lookahead branch, which predicts and extends candidate N-grams (short sequences of N tokens)
    • The lookahead branch is similar to the draft model in regular speculative decoding
  • a verification branch, which verifies the candidates
    • the verification branch has the same role as the oracle model.

Limitations of speculative decoding

  • The maximum speedup that speculative decoding based methods can achieve is limited by the token acceptance rate, or equivalently, how accurately the draft model can predict the main model’s outputs.
  • Creating an accurate draft model is non-trivial, often requiring extra training and careful tuning in the face of traffic changes over time.