
Discussion about looping in llama.cpp: https://github.com/ggerganov/llama.cpp/pull/5561

By @kalomaze, sampling from the original distribution avoids looping in almost all cases (1.0 Temperature with nothing else)
 but when you layer things like lower temperature + Top P + etc then that’s when the problem is usually introduced.
 However, you sort of need truncation (or just lower Temp by itself) to keep the model coherent.
 MinP tries to allow the model to stay “sane” at high temperature, while avoiding sampling from the tailend of the distribution
Repetition penalties

This is a technique that applies a small negative bias to all tokens that have appeared so far to avoid repetition. 1.0 implies no change to the scores, while 1.25 is considered somewhat extreme.

Three types of penalties
repeat_penalty
(multiplicative, independent of frequency)present_penalty
(additive, independent of frequency)freq_penalty
(multiplicative, scales with frequency) How they work:
 Create a frequency map to count occurrences of each token
if (candidates>data[i].logit <= 0) {
candidates>data[i].logit *= penalty_repeat;
} else {
candidates>data[i].logit /= penalty_repeat;
}
candidates>data[i].logit = float(count) * penalty_freq + float(count > 0) * penalty_present;
 Those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce.
DRY

https://github.com/oobabooga/textgenerationwebui/pull/5677

DRY penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.

Example:
 In this example,
violets
is penalized in the probability distribution generated by the model because the sequenceroses are red
has previously occurred in the input, and has been continued withviolets
in that previous case.
 In this example,

The penalty for a token is defined as
multiplier * base ^ (n  allowed_length)
 e.g.
multiplier=0.8
,base=1.75
andlength=2
 e.g.
 where
n
is the length of the sequence before that token that matches the end of the input, andmultiplier
,base
, andallowed_length
are configurable parameters. If the length of the matching sequence is less thanallowed_length
, no penalty is applied.

Thus the penalty grows exponentially as the repeated sequence gets longer.
 This will quickly overcome even the strongest tendency of the model to repeat itself. With the right parameter choice, looping is literally impossible with DRY

The DRY penalty should be applied before any truncation samplers.
Entropy
 One can also potentially detect when a model is looping by keeping track of the entropy
 If the entropy of the distribution is too low (according to a set threshold), one can noise the logits to get the model out of its “local optima”
 More details in Entropybased sampling