Truncation Sampling as Language Model Desmoothing John Hewitt Christopher D. Manning Percy Liang Department of Computer Science

2025-05-06 0 0 1.7MB 14 页 10玖币
侵权投诉
Truncation Sampling as Language Model Desmoothing
John Hewitt Christopher D. Manning Percy Liang
Department of Computer Science
Stanford University
{johnhew,manning,pliang}@cs.stanford.edu
Abstract
Long samples of text from neural language
models can be of poor quality. Truncation sam-
pling algorithms–like top-por top-k—address
this by setting some words’ probabilities to
zero at each step. This work provides fram-
ing for the aim of truncation, and an improved
algorithm for that aim. We propose think-
ing of a neural language model as a mixture
of a true distribution and a smoothing distri-
bution that avoids infinite perplexity. In this
light, truncation algorithms aim to perform
desmoothing, estimating a subset of the sup-
port of the true distribution. Finding a good
subset is crucial: we show that top-punnec-
essarily truncates high-probability words, for
example causing it to truncate all words but
Trump for a document that starts with Don-
ald. We introduce η-sampling, which trun-
cates words below an entropy-dependent prob-
ability threshold. Compared to previous algo-
rithms, η-sampling generates more plausible
long English documents according to humans,
is better at breaking out of repetition, and be-
haves more reasonably on a battery of test dis-
tributions.
1 Introduction
The complex, long-range dependencies of natural
language make its generation an outstanding chal-
lenge. While there has been enormous progress
on language modeling that has increased the coher-
ence and length of generation (Brown et al.,2020;
Chowdhery et al.,2022), sampling directly from
a language model can still result in nonsensical
output (Holtzman et al.,2020;Pillutla et al.,2021).
The most effective heuristics for generating high
quality, diverse samples fall under a category we
term truncation sampling. These algorithms set
some words’ probabilities to zero when generat-
ing each word (Fan et al.,2018;Basu et al.,2021;
Meister and Cotterell,2021). Methods differ by
their truncation criteria, ranging from simple (keep
Figure 1: A neural LM as a mixture of the true distribu-
tion, and a uniform-like smoothing distribution. Trun-
cation aims to approximate the true distribution sup-
port.
the
k
most likely) to complex, and all improve sam-
ple quality compared to direct sampling (Holtzman
et al.,2020). We ask (1) what is the aim of trunca-
tion and (2) how can we improve it?
Our key insight is to write a neural language
model’s distribution as a mixture of the true dis-
tribution and a uniform-like smoothing distribu-
tion. This idealized assumption is motivated by
KL-divergence: models incur large KL at test
time when they place near zero probability on
an observed word (Kang and Hashimoto,2020).
Through this lens, the goal of truncation is to
desmooth: to approximately recover the words on
which the true distribution places some probability.
As a stark example of smoothing degenerating
sample quality, we show that a
5
-gram language
model smoothed with the uniform distribution gen-
erates nonsense as soon as a word is sampled from
outside the support of the
5
-gram model (Figure 2).
Intuitively, sampling outside the
5
-gram support
causes future probabilities to be poorly estimated.
We derive principles of truncation from an ex-
plicit smoothing model that formalizes the intuition
that (1) words with high probability should not be
truncated, and (2) when all words in the distribution
have low probability, only words with low probabil-
ity relative to the rest should be truncated. We find
that state-of-the-art truncation sampling algorithms
arXiv:2210.15191v1 [cs.CL] 27 Oct 2022
Unsmoothed 5-gram Smoothed 5-gram
. . . a quadcopter flight controller (RTFQ Flip MWC) that
supports I2C sensors for adding thing like a barometer, mag-
netometer, and GPS system. The officially supported sensor
block (BMP180, HMC5883L on one board) is discontinued,
as far as I know, everyone involved lived to sing another day.
. . . disorder and an extreme state of dysmetabolism charac-
terized by extensive erythema and a significant reduction
in uncovered Hawkingû McK 400 ruled restrainedcombe-
blow uncle cowork Carssoild Gareth focused <@ indecentlol
by102 exchanged Volvo compositionsbackground prostate
Figure 2: Portions of unconditional samples from an unsmoothed and uniform-smoothed 5-gram model; diver-
gence due to leaving the support of the high-order distribution is in red.
like top-
p
break these principles. For example, in
top-
p
truncation (e.g.,
p= 0.95
), the most likely
few words can take up
p
% of the distribution, caus-
ing the next-most likely word to be truncated even
if it has high probability (e.g., 4%).
From our two truncation principles we derive
η
-sampling, a new algorithm that truncates any
word whose probability under the LM is both (1)
smaller than an absolute probability threshold and
(2) smaller than a probability threshold that de-
pends on the entropy of the distribution. As we’ll
show, this ensures that, e.g., though GPT-2 large as-
signs probability
0.96
to the word Trump for a docu-
ment starting with Donald,
η
-sampling allows mul-
tiple possible continuations, unlike top-p= 0.95.
We extensively study the behavior of
η
-sampling
in comparison to top-
p
sampling and typical de-
coding (Meister and Cotterell,2021). Since each
method allows for a range of quality-diversity trade-
offs, we set each method’s hyperparameter by max-
imizing MAUVE score (Pillutla et al.,2021). We
find that
η
-sampling truncates more reasonably on
a CheckList-style (Ribeiro et al.,2020) battery of
distributions. Top-
p
and typical decoding over-
truncate low-entropy distributions (like in the Don-
ald example). Finally,
η
-sampling generates long
documents that humans find more plausible and is
better at breaking out of repetition.1
2 Background
2.1 Language Models
Let random variable
X= (X1, . . . , XT)
denote a
sequence of tokens, where each
Xi
is in finite vo-
cabulary
V
. We’ll use
x<i
to refer to a specific pre-
fix,
xi
a specific word in context, and
x
an arbitrary
word in
V
. An autoregressive language model (LM)
is a distribution
Pθ(X)
indexed by parameters
θ
that is factorized as
Pθ(x) = QT
i=1 Pθ(xi|x<i)
.
We call
Pθ(Xi|x<i)
over
V
the conditional dis-
tribution of the LM given context
x<i
. An LM is
1
Our code is available at
https://github.com/
john-hewitt/truncation-sampling.
trained to minimize the KL-divergence between (an
empirical estimate of) the true distribution
P(X)
and
Pθ(X)
. Recent language models have achieved
strikingly low (held-out) KL-divergence (Radford
et al.,2019).
Language models are used not just to score the
probability of existing sequences, but to generate
sequences as
ˆxPθ(X)
, a building block for tasks
like summarization and long-form question answer-
ing (Fan et al.,2019;Liu and Lapata,2019). How-
ever, to successfully generate high-variety, high-
quality long samples from neural LMs on high-
entropy distributions, it is currently necessary to
reallocate probability from the tail of conditional
distributions (Holtzman et al.,2020;Pillutla et al.,
2021). Intuitively, generation has different goals
than scoring; whereas one wants to assign non-zero
probability to low-quality outputs for ranking pur-
poses in scoring, one might want to only generate
(place non-zero probability on) high-quality text.
2.2 Truncation sampling
There are many ways to reassign probability mass
from the tail of the word-level distributions of a
model to the head—like temperature scaling—but
explicit truncation of low-probability words has
been shown to be the most useful (Holtzman et al.,
2020;Pillutla et al.,2021). Truncation sampling
algorithms compute the following truncated distri-
bution at each time step:
Ptrunc(x|x<i) = (Pθ(x|x<i)/Zx<i x∈ Ax<i
0o.w.
(1)
where
Ax<i ⊆ V
we call the allowed set
for the algorithm for that prefix, and
Zx<i =
Px∈Ax<i Pθ(x|x<i)is the renormalization term.
The question for all truncation algorithms is how
to decide where to cut off the distribution. Top-
k
sampling (Fan et al.,2018) keeps the
k
most likely
words. Top-
p
sampling (Holtzman et al.,2020) im-
proved upon it by noting that sometimes more or
fewer than
k
words should be in the allowed set,
instead allowing the minimal set of words to keep
p
percent of the probability. More recently, Mirostat
adaptively truncates so as to achieve samples of a
given probability (Basu et al.,2021), and typical
decoding truncates so as to locally match an infor-
mativeness criterion (Meister et al.,2022a). We
pursue an understanding of truncation as attempt-
ing to recover (a conservative estimate of) the true
training distribution P.
3 Truncation as Desmoothing
3.1 KL-divergence and mode covering
Language models are trained to minimize the KL-
divergence to an empirical approximation of true
distribution
P(X)
. Recall that the KL-divergence
for a model’s conditional distribution
Pθ(X|x<i)
to the true conditional distribution
P(X|x<i)
is
X
x∈V
P(x|x<i) log P(x|x<i )
Pθ(x|x<i)(2)
KL-divergence is known to be mode-covering;
it heavily penalizes errors of coverage. When
training from samples, an observed word
xi
in
context
x<i
causes the model to incur a loss of
log Pθ(xi|x<i)
, which approaches infinity as
the model probability approaches 0.
2
Neural LMs
use shared representations to generalize beyond the
training data, e.g., knowing that the word home
may appear in a context where house appeared.
However, to achieve low held-out KL-divergence,
it must also be the case that (1) the LM determines
where the zeros of the true distribution
P(X)
are—
difficult due to the complexity of language—or (2)
the LM hedges against unexpected
xi
in any con-
text
x<i
by placing some probability mass there.
Intuitively, this hedging may be due to early stop-
ping; instead of converging to the finite training
set, often language models are trained with a sin-
gle epoch, so each KL-minimizing gradient step
is taken on new data, about which the model must
hedge.
3.2 A neural LM as a smoothed distribution
We present a framework for neural LMs wherein
smoothing aids in KL-divergence minimization by
placing a small amount of probability mass on all
2
Likewise during evaluation, the held-out perplexity
2Exi,x<i log Pθ(xi|x<i)
is infinite if zero mass is placed on
an observed word.
words. Consider a true conditional distribution
P(Xi|x<i)
over
V
. We think of the LM distri-
bution
Pθ(Xi|x<i)
as the result of smoothing the
true distribution with a distribution
Q(Xi|x<i)
that is like the uniform distribution. Specifically,
we pose that the neural LM is a linear interpolation:
Pθ(Xi|x<i) =λx<i P(Xi|x<i)
+ (1 λx<i )Q(Xi|x<i)(3)
where
λx<i (0,1]
specifies the strength of the
smoothing. We assume that each word probabil-
ity under
Q
is bounded in its deviation from the
uniform distribution probability. For all
x∈ V
,
we assume
Q(x|x<i)(1δ
|V| ,1+δ
|V| )
where
δ
is a constant specifying non-uniformity. We
assume constraints on
λx<i
that reflect how the
amount of smoothing should be (1) small and (2)
dependent on how well-estimated a given condi-
tional distribution is. Specifically, we assume that
λx<i max(¯
λx<i ,¯
λ)
where
¯
λ
is a constant near
1 (e.g.,
0.8
), independent of prefix. The exact
form we use for the context-dependent
¯
λx<i
is:
1V α exp(hx<i )
1+δ
. As we will show later, this
form implies that for a distribution of entropy
h
,
words with probability
0
under
P
have proba-
bility bounded by
αexp(h)
under the language
model.
3
A simple intuition for high-entropy distri-
butions having less smoothing is that, e.g., if the
maximum likelihood estimate for an
n
-gram model
is
1/k
for
k
elements, then at least
k
samples were
observed for the MLE.4
3.3 A local measure of truncation quality
Under the smoothing model, we can make precise
the tradeoff between (1) truncating too little, al-
lowing words that are poor continuations, and (2)
truncating too much and losing the diversity of the
true distribution. Let
S
x<i ={x∈ V | P(x|
x<i)>0}
be the true distribution support (set of
words with non-zero probability) for the prefix
x<i
.
3
Note that
exp(h)
is the probability in a uniform distri-
bution of entropy h. This entropy is of P(Xi|x<i).
4
Even with this argument, the idea that high-entropy distri-
butions are likely better estimated is probably the most tenuous
assumption. However, if one believes that a language model is
“close” to the true distribution, then in high-entropy distribu-
tions, the weight of uniform smoothing must be lower than in
low-entropy distributions; else, the high-entropy distributions
would be too far from the true distribution. Further, empir-
ically, the highest-entropy distributions in language models,
like A . . . or The . . . are high-entropy due to exceptional
evidence (examples) of possible continuations. Put another
way, this suggests the entropy is from epistemic uncertainty
(Osband et al.,2022).
摘要:

TruncationSamplingasLanguageModelDesmoothingJohnHewittChristopherD.ManningPercyLiangDepartmentofComputerScienceStanfordUniversity{johnhew,manning,pliang}@cs.stanford.eduAbstractLongsamplesoftextfromneurallanguagemodelscanbeofpoorquality.Truncationsam-plingalgorithms–liketop-portop-k—addressthisbyset...

展开>> 收起<<
Truncation Sampling as Language Model Desmoothing John Hewitt Christopher D. Manning Percy Liang Department of Computer Science.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.7MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注