DISCLAIMER
Major shoutout to xjdr , doomslide, kalomaze, and Minh Nhat Nguyen. Pretty much everything I’m about to share with you comes from them. Consider me the messenger, passing along their wisdom with my own spin.
ENTROPIX
 xjdr has graced us with an opensource implementation of the sampler called entropix Please check out the code for the latest developments & details e.g. dynamic thresholds.
Defining entropy

Entropy = $H(X)=E_{x}[−logp(x)]=−∑_{x∈X}p(x)logp(x)$

varentropy = $var(H(X))=E[(−logp(x))_{2}]−H(X)_{2}$

A uniform distribution $U$ will have high entropy but low varentropy
 $H(U)=log(∣X∣)$
 $var(H(U))=log_{2}(∣X∣)−log_{2}(∣X∣)=0$

A perfect peaked distribution i.e. $p(x)=1{x=x_{0}}$ will have low entropy and low varentropy

An almost peaked distribution i.e. $p(x)=0.991{x=x_{0}}+0.011{x=x_{1}}$ will have low entropy but high varentropy

Consider a distribution over 10 outcomes: p(x₁) = p(x₂) = … = p(x₈) = 0.12 p(x₉) = 0.03 p(x₁₀) = 0.01. This distribution has relatively high entropy because it’s close to uniform for most outcomes. It also has high varentropy because of the presence of the two less likely outcomes.

Caveat: varentropy could be defined differently to better work as a heuristic in the below framework
 e.g. varentropy(logits step0 … step1) vs varentropy(logits step)
Entropy computation
 As seen above, entropy is defined as
sum(probs * logprobs)
 However, given we’re computing in float precision, it’s good to be careful.
 Below, we define a logit noise threshold under which a prob/logprob should be considered “negligible” and not be taken into account in the computation
Logit noise threshold

The relative error exponentiating a bfloat16 is
u = 2^{7}
(mantissa) 
The relative total error of the sum of vocab size is
sqrt(vocab_size) * u
(errors are uncorrelated) 
Hence any prob below
sqrt(vocab_size) * u / vocab_size = u / sqrt(vocab_size)
should be considered negligible 
Hence any logprob below
ln(u)  ln(vocab_size)/2
should be consideredinf

Hence the maximal precision of logits of an llm are :
ln(min_p) = ln(u)  ln(vocab_size)/2

Example:
 For
vocab_size=128256
,ln(min_p) = 10.7329
i.e.min_p ~ 2e5
 For
Final form/framework

Objective is for the sampling to stay within acceptable entropy and varentropy bounds

Why?
Using entropy as a measure of confidence or confusion

Because entropy and varentropy are good indicators of degenerate logits distribution
 e.g. low entropy is synonym to a peaked distribution which can occur in good and bad scenarios such as (1) “doom looping”, the model repeats its previous output with high certainty (2) obvious next token learned from text distribution e.g. “The brown” → “fox” (3) simply the “correct token” e.g. in a Q/A format
 Not all scenarios are bad, but the assumption is eliciting more diversity from your model by entropy sampling should not decrease quality in good scenarios and should increase quality in bad scenarios

Additionally, when doing tree search, varentropy is suprisingly effective/correlative with model confusion/ungrounding ⇒ good heuristic to know when the beam search should stop/continue

Also if you can drive the varentropy low enough, the model will usually spit out an eot token and let you knows its done.
Summary

Let
ent_1
andent_2
be the entropy lower and upper bound 
Let
varent_1
andvarent_2
be the varentropy lower and upper bound 
Let
ent
andvarent
be the current entropy and varentropy of my logits distribution at time stept
. We have generatedx_1, ..., x_(t1)

If
ent < ent_1 & varent < varent_1
(low entropy, low varentropy) the model is confident
 we do normal greedy decoding/argmax

If
ent < ent_2 & varent_2 < varent
(low entropy, high varentropy) we use branching (described below) to inject entropy.

If
ent_2 < ent & varent_2 < varent
(high entropy, high varentropy) we use backspace i.e. we resample the previous token

If
ent_1 < ent & varent < varent_2
(high entropy, low varentropy) we force CoT by injecting a “Wait…” token to tell the model to reevaluate (o1 style)
 important to add cooldown, otherwise the sampling may inject the CoT token over and over
 if we’re still in cooldown, sample with higher temperature (the temperature can be scaled based on attention entropy)

Visual summary (source)
Branching

If entropy is low but high varentropy, it means the model could be in a “local optimum” e.g. doom loop

We use branching to “inject entropy” i.e.
 adjust temperature
 noise the logits distribution
 do beam search to get a “good token”
 stopping criteria is either end of text, max search length or back within entropy bounds

Topk beam search can kind of suck because not every uncertainty situation should have the same truncation criteria.
 Adaptive beam search using either top p or min p can make more sense for selecting how many candidates to have