Truncation Sampling as Language Model Desmoothing John Hewitt Christopher D. Manning Percy Liang Department of Computer Science

2025-05-06 0 0 1.7MB 14 页 10玖币

侵权投诉

Truncation Sampling as Language Model Desmoothing

John Hewitt Christopher D. Manning Percy Liang

Department of Computer Science

Stanford University

{johnhew,manning,pliang}@cs.stanford.edu

Abstract

Long samples of text from neural language

models can be of poor quality. Truncation sam-

pling algorithms–like top-por top-k—address

this by setting some words’ probabilities to

zero at each step. This work provides fram-

ing for the aim of truncation, and an improved

algorithm for that aim. We propose think-

ing of a neural language model as a mixture

of a true distribution and a smoothing distri-

bution that avoids inﬁnite perplexity. In this

light, truncation algorithms aim to perform

desmoothing, estimating a subset of the sup-

port of the true distribution. Finding a good

subset is crucial: we show that top-punnec-

essarily truncates high-probability words, for

example causing it to truncate all words but

Trump for a document that starts with Don-

ald. We introduce η-sampling, which trun-

cates words below an entropy-dependent prob-

ability threshold. Compared to previous algo-

rithms, η-sampling generates more plausible

long English documents according to humans,

is better at breaking out of repetition, and be-

haves more reasonably on a battery of test dis-

tributions.

1 Introduction

The complex, long-range dependencies of natural

language make its generation an outstanding chal-

lenge. While there has been enormous progress

on language modeling that has increased the coher-

ence and length of generation (Brown et al.,2020;

Chowdhery et al.,2022), sampling directly from

a language model can still result in nonsensical

output (Holtzman et al.,2020;Pillutla et al.,2021).

The most effective heuristics for generating high

quality, diverse samples fall under a category we

term truncation sampling. These algorithms set

some words’ probabilities to zero when generat-

ing each word (Fan et al.,2018;Basu et al.,2021;

Meister and Cotterell,2021). Methods differ by

their truncation criteria, ranging from simple (keep

Figure 1: A neural LM as a mixture of the true distribu-

tion, and a uniform-like smoothing distribution. Trun-

cation aims to approximate the true distribution sup-

port.

the

most likely) to complex, and all improve sam-

ple quality compared to direct sampling (Holtzman

et al.,2020). We ask (1) what is the aim of trunca-

tion and (2) how can we improve it?

Our key insight is to write a neural language

model’s distribution as a mixture of the true dis-

tribution and a uniform-like smoothing distribu-

tion. This idealized assumption is motivated by

KL-divergence: models incur large KL at test

time when they place near zero probability on

an observed word (Kang and Hashimoto,2020).

Through this lens, the goal of truncation is to

desmooth: to approximately recover the words on

which the true distribution places some probability.

As a stark example of smoothing degenerating

sample quality, we show that a

-gram language

model smoothed with the uniform distribution gen-

erates nonsense as soon as a word is sampled from

outside the support of the

-gram model (Figure 2).

Intuitively, sampling outside the

-gram support

causes future probabilities to be poorly estimated.

We derive principles of truncation from an ex-

plicit smoothing model that formalizes the intuition

that (1) words with high probability should not be

truncated, and (2) when all words in the distribution

have low probability, only words with low probabil-

ity relative to the rest should be truncated. We ﬁnd

that state-of-the-art truncation sampling algorithms

arXiv:2210.15191v1 [cs.CL] 27 Oct 2022

Unsmoothed 5-gram Smoothed 5-gram

. . . a quadcopter ﬂight controller (RTFQ Flip MWC) that

supports I2C sensors for adding thing like a barometer, mag-

netometer, and GPS system. The ofﬁcially supported sensor

block (BMP180, HMC5883L on one board) is discontinued,

as far as I know, everyone involved lived to sing another day.

. . . disorder and an extreme state of dysmetabolism charac-

terized by extensive erythema and a signiﬁcant reduction

in uncovered Hawkingû McK 400 ruled restrainedcombe-

blow uncle cowork Carssoild Gareth focused <@ indecentlol

by102 exchanged Volvo compositionsbackground prostate

Figure 2: Portions of unconditional samples from an unsmoothed and uniform-smoothed 5-gram model; diver-

gence due to leaving the support of the high-order distribution is in red.

like top-

break these principles. For example, in

top-

truncation (e.g.,

p= 0.95

), the most likely

few words can take up

% of the distribution, caus-

ing the next-most likely word to be truncated even

if it has high probability (e.g., 4%).

From our two truncation principles we derive

-sampling, a new algorithm that truncates any

word whose probability under the LM is both (1)

smaller than an absolute probability threshold and

(2) smaller than a probability threshold that de-

pends on the entropy of the distribution. As we’ll

show, this ensures that, e.g., though GPT-2 large as-

signs probability

0.96

to the word Trump for a docu-

ment starting with Donald,

-sampling allows mul-

tiple possible continuations, unlike top-p= 0.95.

We extensively study the behavior of

-sampling

in comparison to top-

sampling and typical de-

coding (Meister and Cotterell,2021). Since each

method allows for a range of quality-diversity trade-

offs, we set each method’s hyperparameter by max-

imizing MAUVE score (Pillutla et al.,2021). We

ﬁnd that

-sampling truncates more reasonably on

a CheckList-style (Ribeiro et al.,2020) battery of

distributions. Top-

and typical decoding over-

truncate low-entropy distributions (like in the Don-

ald example). Finally,

-sampling generates long

documents that humans ﬁnd more plausible and is

better at breaking out of repetition.1

2 Background

2.1 Language Models

Let random variable

X= (X1, . . . , XT)

denote a

sequence of tokens, where each

is in ﬁnite vo-

cabulary

. We’ll use

x<i

to refer to a speciﬁc pre-

ﬁx,

a speciﬁc word in context, and

an arbitrary

word in

. An autoregressive language model (LM)

is a distribution

Pθ(X)

indexed by parameters

that is factorized as

Pθ(x) = QT

i=1 Pθ(xi|x<i)

We call

Pθ(Xi|x<i)

over

the conditional dis-

tribution of the LM given context

x<i

. An LM is

Our code is available at

https://github.com/

john-hewitt/truncation-sampling.

trained to minimize the KL-divergence between (an

empirical estimate of) the true distribution

P∗(X)

and

Pθ(X)

. Recent language models have achieved

strikingly low (held-out) KL-divergence (Radford

et al.,2019).

Language models are used not just to score the

probability of existing sequences, but to generate

sequences as

ˆx∼Pθ(X)

, a building block for tasks

like summarization and long-form question answer-

ing (Fan et al.,2019;Liu and Lapata,2019). How-

ever, to successfully generate high-variety, high-

quality long samples from neural LMs on high-

entropy distributions, it is currently necessary to

reallocate probability from the tail of conditional

distributions (Holtzman et al.,2020;Pillutla et al.,

2021). Intuitively, generation has different goals

than scoring; whereas one wants to assign non-zero

probability to low-quality outputs for ranking pur-

poses in scoring, one might want to only generate

(place non-zero probability on) high-quality text.

2.2 Truncation sampling

There are many ways to reassign probability mass

from the tail of the word-level distributions of a

model to the head—like temperature scaling—but

explicit truncation of low-probability words has

been shown to be the most useful (Holtzman et al.,

2020;Pillutla et al.,2021). Truncation sampling

algorithms compute the following truncated distri-

bution at each time step:

Ptrunc(x|x<i) = (Pθ(x|x<i)/Zx<i x∈ Ax<i

0o.w.

(1)

where

Ax<i ⊆ V

we call the allowed set

for the algorithm for that preﬁx, and

Zx<i =

Px∈Ax<i Pθ(x|x<i)is the renormalization term.

The question for all truncation algorithms is how

to decide where to cut off the distribution. Top-

sampling (Fan et al.,2018) keeps the

most likely

words. Top-

sampling (Holtzman et al.,2020) im-

proved upon it by noting that sometimes more or

fewer than

words should be in the allowed set,

instead allowing the minimal set of words to keep

percent of the probability. More recently, Mirostat

adaptively truncates so as to achieve samples of a

given probability (Basu et al.,2021), and typical

decoding truncates so as to locally match an infor-

mativeness criterion (Meister et al.,2022a). We

pursue an understanding of truncation as attempt-

ing to recover (a conservative estimate of) the true

training distribution P∗.

3 Truncation as Desmoothing

3.1 KL-divergence and mode covering

Language models are trained to minimize the KL-

divergence to an empirical approximation of true

distribution

P∗(X)

. Recall that the KL-divergence

for a model’s conditional distribution

Pθ(X|x<i)

to the true conditional distribution

P∗(X|x<i)

x∈V

P∗(x|x<i) log P∗(x|x<i )

Pθ(x|x<i)(2)

KL-divergence is known to be mode-covering;

it heavily penalizes errors of coverage. When

training from samples, an observed word

context

x<i

causes the model to incur a loss of

−log Pθ(xi|x<i)

, which approaches inﬁnity as

the model probability approaches 0.

Neural LMs

use shared representations to generalize beyond the

training data, e.g., knowing that the word home

may appear in a context where house appeared.

However, to achieve low held-out KL-divergence,

it must also be the case that (1) the LM determines

where the zeros of the true distribution

P(X)

are—

difﬁcult due to the complexity of language—or (2)

the LM hedges against unexpected

in any con-

text

x<i

by placing some probability mass there.

Intuitively, this hedging may be due to early stop-

ping; instead of converging to the ﬁnite training

set, often language models are trained with a sin-

gle epoch, so each KL-minimizing gradient step

is taken on new data, about which the model must

hedge.

3.2 A neural LM as a smoothed distribution

We present a framework for neural LMs wherein

smoothing aids in KL-divergence minimization by

placing a small amount of probability mass on all

Likewise during evaluation, the held-out perplexity

2Exi,x<i log Pθ(xi|x<i)

is inﬁnite if zero mass is placed on

an observed word.

words. Consider a true conditional distribution

P∗(Xi|x<i)

over

. We think of the LM distri-

bution

Pθ(Xi|x<i)

as the result of smoothing the

true distribution with a distribution

Q(Xi|x<i)

that is like the uniform distribution. Speciﬁcally,

we pose that the neural LM is a linear interpolation:

Pθ(Xi|x<i) =λx<i P∗(Xi|x<i)

+ (1 −λx<i )Q(Xi|x<i)(3)

where

λx<i ∈(0,1]

speciﬁes the strength of the

smoothing. We assume that each word probabil-

ity under

is bounded in its deviation from the

uniform distribution probability. For all

x∈ V

we assume

Q(x|x<i)∈(1−δ

|V| ,1+δ

|V| )

where

is a constant specifying non-uniformity. We

assume constraints on

λx<i

that reﬂect how the

amount of smoothing should be (1) small and (2)

dependent on how well-estimated a given condi-

tional distribution is. Speciﬁcally, we assume that

λx<i ≥max(¯

λx<i ,¯

λ)

where

is a constant near

1 (e.g.,

0.8

), independent of preﬁx. The exact

form we use for the context-dependent

λx<i

is:

1−V α exp(−hx<i )

1+δ

. As we will show later, this

form implies that for a distribution of entropy

words with probability

under

P∗

have proba-

bility bounded by

αexp(−h)

under the language

model.

A simple intuition for high-entropy distri-

butions having less smoothing is that, e.g., if the

maximum likelihood estimate for an

-gram model

1/k

for

elements, then at least

samples were

observed for the MLE.4

3.3 A local measure of truncation quality

Under the smoothing model, we can make precise

the tradeoff between (1) truncating too little, al-

lowing words that are poor continuations, and (2)

truncating too much and losing the diversity of the

true distribution. Let

S∗

x<i ={x∈ V | P∗(x|

x<i)>0}

be the true distribution support (set of

words with non-zero probability) for the preﬁx

x<i

Note that

exp(−h)

is the probability in a uniform distri-

bution of entropy h. This entropy is of P∗(Xi|x<i).

Even with this argument, the idea that high-entropy distri-

butions are likely better estimated is probably the most tenuous

assumption. However, if one believes that a language model is

“close” to the true distribution, then in high-entropy distribu-

tions, the weight of uniform smoothing must be lower than in

low-entropy distributions; else, the high-entropy distributions

would be too far from the true distribution. Further, empir-

ically, the highest-entropy distributions in language models,

like A . . . or The . . . are high-entropy due to exceptional

evidence (examples) of possible continuations. Put another

way, this suggests the entropy is from epistemic uncertainty

(Osband et al.,2022).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TruncationSamplingasLanguageModelDesmoothingJohnHewittChristopherD.ManningPercyLiangDepartmentofComputerScienceStanfordUniversity{johnhew,manning,pliang}@cs.stanford.eduAbstractLongsamplesoftextfromneurallanguagemodelscanbeofpoorquality.Truncationsam-plingalgorithmsliketop-portop-kaddressthisbyset...

展开>> 收起<<

Truncation Sampling as Language Model Desmoothing John Hewitt Christopher D. Manning Percy Liang Department of Computer Science.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Truncation Sampling as Language Model Desmoothing John Hewitt Christopher D. Manning Percy Liang Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: