
proved upon it by noting that sometimes more or
fewer than
k
words should be in the allowed set,
instead allowing the minimal set of words to keep
p
percent of the probability. More recently, Mirostat
adaptively truncates so as to achieve samples of a
given probability (Basu et al.,2021), and typical
decoding truncates so as to locally match an infor-
mativeness criterion (Meister et al.,2022a). We
pursue an understanding of truncation as attempt-
ing to recover (a conservative estimate of) the true
training distribution P∗.
3 Truncation as Desmoothing
3.1 KL-divergence and mode covering
Language models are trained to minimize the KL-
divergence to an empirical approximation of true
distribution
P∗(X)
. Recall that the KL-divergence
for a model’s conditional distribution
Pθ(X|x<i)
to the true conditional distribution
P∗(X|x<i)
is
X
x∈V
P∗(x|x<i) log P∗(x|x<i )
Pθ(x|x<i)(2)
KL-divergence is known to be mode-covering;
it heavily penalizes errors of coverage. When
training from samples, an observed word
xi
in
context
x<i
causes the model to incur a loss of
−log Pθ(xi|x<i)
, which approaches infinity as
the model probability approaches 0.
2
Neural LMs
use shared representations to generalize beyond the
training data, e.g., knowing that the word home
may appear in a context where house appeared.
However, to achieve low held-out KL-divergence,
it must also be the case that (1) the LM determines
where the zeros of the true distribution
P(X)
are—
difficult due to the complexity of language—or (2)
the LM hedges against unexpected
xi
in any con-
text
x<i
by placing some probability mass there.
Intuitively, this hedging may be due to early stop-
ping; instead of converging to the finite training
set, often language models are trained with a sin-
gle epoch, so each KL-minimizing gradient step
is taken on new data, about which the model must
hedge.
3.2 A neural LM as a smoothed distribution
We present a framework for neural LMs wherein
smoothing aids in KL-divergence minimization by
placing a small amount of probability mass on all
2
Likewise during evaluation, the held-out perplexity
2Exi,x<i log Pθ(xi|x<i)
is infinite if zero mass is placed on
an observed word.
words. Consider a true conditional distribution
P∗(Xi|x<i)
over
V
. We think of the LM distri-
bution
Pθ(Xi|x<i)
as the result of smoothing the
true distribution with a distribution
Q(Xi|x<i)
that is like the uniform distribution. Specifically,
we pose that the neural LM is a linear interpolation:
Pθ(Xi|x<i) =λx<i P∗(Xi|x<i)
+ (1 −λx<i )Q(Xi|x<i)(3)
where
λx<i ∈(0,1]
specifies the strength of the
smoothing. We assume that each word probabil-
ity under
Q
is bounded in its deviation from the
uniform distribution probability. For all
x∈ V
,
we assume
Q(x|x<i)∈(1−δ
|V| ,1+δ
|V| )
where
δ
is a constant specifying non-uniformity. We
assume constraints on
λx<i
that reflect how the
amount of smoothing should be (1) small and (2)
dependent on how well-estimated a given condi-
tional distribution is. Specifically, we assume that
λx<i ≥max(¯
λx<i ,¯
λ)
where
¯
λ
is a constant near
1 (e.g.,
0.8
), independent of prefix. The exact
form we use for the context-dependent
¯
λx<i
is:
1−V α exp(−hx<i )
1+δ
. As we will show later, this
form implies that for a distribution of entropy
h
,
words with probability
0
under
P∗
have proba-
bility bounded by
αexp(−h)
under the language
model.
3
A simple intuition for high-entropy distri-
butions having less smoothing is that, e.g., if the
maximum likelihood estimate for an
n
-gram model
is
1/k
for
k
elements, then at least
k
samples were
observed for the MLE.4
3.3 A local measure of truncation quality
Under the smoothing model, we can make precise
the tradeoff between (1) truncating too little, al-
lowing words that are poor continuations, and (2)
truncating too much and losing the diversity of the
true distribution. Let
S∗
x<i ={x∈ V | P∗(x|
x<i)>0}
be the true distribution support (set of
words with non-zero probability) for the prefix
x<i
.
3
Note that
exp(−h)
is the probability in a uniform distri-
bution of entropy h. This entropy is of P∗(Xi|x<i).
4
Even with this argument, the idea that high-entropy distri-
butions are likely better estimated is probably the most tenuous
assumption. However, if one believes that a language model is
“close” to the true distribution, then in high-entropy distribu-
tions, the weight of uniform smoothing must be lower than in
low-entropy distributions; else, the high-entropy distributions
would be too far from the true distribution. Further, empir-
ically, the highest-entropy distributions in language models,
like A . . . or The . . . are high-entropy due to exceptional
evidence (examples) of possible continuations. Put another
way, this suggests the entropy is from epistemic uncertainty
(Osband et al.,2022).