Grammar-guided generation lets you constrain a model’s output to follow some grammar, giving you output that guaranteed to match some grammar—such as JSON. At first, this seems unrelated to speed—reliability is nice, but speed too? That can’t be possible! But it is—let’s dig into how it works to see why.

Imagine you’re generating JSON with an LLM, and the generation so far is:

{
    "key": 
 

GPT-4 could generate any of 100k+ tokens here, but only a few are actually valid: whitespace, an open bracket, a quote, a digit, null, etc. During guided generation, the sampler only samples from those valid tokens, ignoring any others, even if the invalid tokens are more likely.

  • Even better, with libraries like Outlines or jsonformer, you can give the guided generation sampler a schema, and it will sample within that schema! For example, if a key requires a digit, it will only sample digits after that key name.
  • For most of the responses, the model would only have one possible token to pick from. In that case, the sampler can just pick that token, bypassing the model entirely!