Resources
 Overview of samplers by author of minp: https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
Basics
 Formally, given a sequence of tokens $x=(x_{1},x_{2},β¦,x_{n})$ and a finite vocabulary $V$, where each token $x_{i}βV$, an autoregressive language model models the probability distribution $P(x_{t}β£x_{1},x_{2},β¦,x_{tβ1})$, for each token $x_{t}$ in the sequence.
How to get the next token
 $x_{t}=gmax_{vβV}P(x_{t}=vβ£x_{1},x_{2},β¦,x_{tβ1})$. (greedy decoding)
 $x_{t}βΌP(x_{t}=x^_{t}β£x_{1},x_{2},β¦,x_{tβ1})$ (stochastic sampling)
Temperature sampling

Temperature sampling first divides the logits by a temperature parameter $Ο>0$ before passing them through the softmax function to obtain a modified probability distribution $P_{Ο}(x_{t}=vβ£x_{1},β¦,x_{tβ1})$: $P_{Ο}(x_{t}=vβ£x_{1},β¦,x_{tβ1})=β_{vβV}exp(z_{v}/Ο)exp(z_{v}/Ο)β$ where $z_{v}=gP(x_{t}=vβ£x_{1},x_{2},β¦,x_{tβ1})$ is the (unnormalized) log probability of token $v$.

The temperature parameter $Ο$ controls the randomness of the sampling process.
 $Ο>1$ result in a more uniform probability distribution, increasing the chances of sampling lowerprobability tokens and generating more diverse output.
 $Ο<1$ make the distribution sharper, favoring higherprobability tokens and more conservative and deterministic outputs.
Topp sampling (nucleus sampling)
 Given $P_{Ο}(x_{t}=vβ£x_{1},x_{2},β¦,x_{tβ1})$ over the vocabulary $V$ at position $t$, topp sampling first sorts the tokens in descending order of their probabilities.
 It then selects the smallest set of tokens whose cumulative probability exceeds a predefined threshold $p$, where $pβ(0,1]$.
 It finally samples according to the renormalized probabilities.
Sampler Orders
 The order in which samplers are applied matters and can meaningfully change the output.
 For example, if Temperature comes first in the order before top P, then your Temperature value would change the output probabilities that top P judges, and it will truncate differently.
 If top P comes before Temperature, then the original probabilities are measured first, which means Temperature will only affect the tokens you decided to keep using top P.
 Itβs usually assumed that temperature sampling is applied first
Beam search
 Beam search is a heuristic search algorithm
 modification of breadthfirst seach (BFS) to reduce memory requirements
 How it proceeds
 Start with an initial state (e.g. partly completed generation)
 Generate the next top K candidates (e.g. for tokens, probabilities define the ranking) where K is the beam width
 From each of those K candidates, score all possible states (i.e. tokens)
 Keep only the top K candidates overall
 Start over until a stopping criterion is met (e.g. maximum length or end of text token)
 A beam search with K=1 is equivalent to greedy decoding
 How do you score candidates deeper in the search tree?
 Sum of log probabilities (may tend to favor shorter sequences)
 Average log prob
 Length normalized score
MinP
Motivation

Objectives for minp are to
 (1) match or outperform the stateoftheart topp sampling across various reasoning and performance benchmarks at standard temperature settings between 0 and 1
 (2) better handle the creativitycoherence tradeoff at higher temperatures
 (3) provide a simple, effective sampling method without relying on additional techniques like repetition penalties to address the externalities of hightemperature sampling.

Topp sampling does not work well for higher temperatures: here, tokens from the βunreliable tailβ can still enter the sampling pool
 When using topp sampling, it is recommended to adjust either the value of p or the temperature Ο , but not both simultaneously (OpenAI API, 2024) as they can lead to conflicting effects.
Definition
 The method uses a relative probability threshold $p_{base}β(0,1]$ to scale the maximum token probability $p_{max}$ to determine the absolute probability threshold $p_{scaled}$. One can think that $p_{base}β0.1$
 Sampling is then performed on tokens with probability greater than or equal to $p_{scaled}$.
 Formally, given the maximum probability over the token distribution $p_{max}=max_{vβV}P(vβ£x_{1},β¦,x_{tβ1})$, the absolute probability threshold $p_{scaled}$ is calculated as: $p_{scaled}=p_{base}Γp_{max}$
 The sampling pool $V_{min}$ is then defined as the set of tokens whose probability is greater than or equal to $p_{scaled}$: $V_{min}={vβV:P(vβ£x_{1},x_{2},β¦,x_{tβ1})β₯p_{scaled}}$ Finally, the next token $x^_{t}$ is randomly sampled from the set $V_{min}$ according to the normalized probabilities: $x^_{t}βΌβ_{vβV}P(v_{β²}β£x_{1},β¦,x_{tβ1})P(vβ£x_{1},β¦,x_{tβ1})βΒ forΒvβV_{min}$
 A lower topp value results in more selective token choices and a bias towards more likely outputs.
Effects
 Highconfidence predictions: When the model assigns high probability to a particular next token, minp filters out lowprobability alternatives. This preserves coherence by avoiding sampling from the βunreliable tailβ that could derail generation
 Lowconfidence predictions: When there is no clear frontrunner, minp relaxes its filter. This allows the model to sample from a more diverse set of plausible continuations, enabling creativity in openended settings.