Contrastive Decoding Open-ended Text Generation as Optimization Xiang Lisa Li1 Ari Holtzman2 Daniel Fried3 Percy Liang1 Jason Eisner4 Tatsunori Hashimoto1 Luke Zettlemoyer25 Mike Lewis5

2025-05-08 0 0 1.49MB 25 页 10玖币

侵权投诉

Contrastive Decoding: Open-ended Text Generation as Optimization

Xiang Lisa Li1, Ari Holtzman2, Daniel Fried3, Percy Liang1, Jason Eisner4,

Tatsunori Hashimoto1, Luke Zettlemoyer2,5, Mike Lewis5

Stanford University1, University of Washington2, Carnegie Mellon University3,

Johns Hopkins University4, FAIR5

xlisali@stanford.edu,ahai@cs.washington.edu,dfried@cs.cmu.edu,

pliang@stanford.edu,jason@cs.jhu.edu,thashim@stanford.edu,

lsz@cs.washington.edu,mikelewis@meta.com

Abstract

Given a language model (LM), maximum

probability is a poor decoding objective for

open-ended generation, because it produces

short and repetitive text. On the other hand,

sampling can often produce incoherent text

that drifts from the original topics. We propose

contrastive decoding (CD), a reliable decoding

approach that optimizes a contrastive objective

subject to a plausibility constraint. The

contrastive objective returns the difference

between the likelihood under a large LM

(called the expert, e.g. OPT-13B) and a small

LM (called the amateur, e.g. OPT-125M),

and the constraint ensures that the outputs are

plausible. CD is inspired by the fact that the

failures of larger LMs (e.g., repetition, inco-

herence) are even more prevalent in smaller

LMs, and that this difference signals which

texts should be preferred. CD requires zero

additional training, and produces higher quality

text than decoding from the larger LM alone.

It also works across model scales (OPT-13B

and GPT2-1.5B) and signiﬁcantly outperforms

four strong decoding algorithms (e.g., nucleus,

top-k) in automatic and human evaluations

across wikipedia, news and story domains.1

1 Introduction

Open-ended text generation aims to craft ﬂuent and

coherent textual continuations of given prompts,

laying foundations for various downstream applic-

ations such as writing assistance and story gen-

eration (Brown et al.,2020). The canonical ap-

proaches often sample from large pre-trained lan-

guage models (Holtzman et al.,2020;Fan et al.,

2018;Radford et al.,2019), but the generated text

is prone to incoherence and topic drift as unlucky

sampling choices compound over long sequences

(Eikema and Aziz,2020;Maynez et al.,2020). On

the other hand, searching for the most likely se-

Code is available at

https://github.com/

XiangLi1999/ContrastiveDecoding.git

Figure 1: Contrastive decoding exploits the contrasts

between expert and amateur LM of different sizes by

choosing tokens that maximize their log-likelihood

difference. CD produces high-quality text that ampliﬁes

the good expert behavior and diminishes the undesired

amateur behavior.

quences often results in short, repetitive and tedi-

ous text (Holtzman et al.,2020), indicating that

maximizing probability is a wrong decoding ob-

jective.

We propose a new search-based approach,

contrastive decoding (CD), that can generate ﬂuent

and lexically diverse text without compromising

coherence. As shown in Figure 1, contrastive

decoding takes an off-the-shelf large language

model such as OPT-13B (that we call the expert)

and an off-the-shelf smaller language model such

as OPT-125M (that we call the amateur). CD

searches for text that maximizes the difference

between expert log-probabilities and amateur

log-probabilities, subject to plausibility constraints

which restrict the search space to tokens with

sufﬁciently high probability under the expert LM.

Contrastive Decoding works because many fail-

ure modes of language models (short, repetitive, ir-

relevant or uninteresting strings) are more common

arXiv:2210.15097v2 [cs.CL] 10 Jul 2023

under smaller LMs than under larger LMs. Such

outputs are further deemphasized by taking the

difference between model log-probabilities. Con-

versely, stronger models tend to put more probab-

ility mass on desirable outputs, such as those with

factual knowledge that has not been learnt by the

weaker model, and these strings are emphasized by

contrastive decoding.

Taking Figure 1as an example, the expert model

places signiﬁcant probability mass on previous

tokens such as “Hawaii” and “Honolulu”, lead-

ing to a highly repetitive continuation from greedy

search; and nonsensical tokens such as “Washing-

ton” may be sampled, leading to an incoherent con-

tinuation. A correct continuation “1961” is strongly

preferred by contrastive decoding, despite only hav-

ing a probability of 0.1, and the continuation in-

cludes more correct facts. This example suggests

that contrastive decoding generates outputs that

emphasize the best of the expert LM and remove

its amateur tendencies. Moreover, we provide a

pragmatic interpretation of contrastive decoding in

§4.

Compared to recent training-based methods that

improve generation quality such as unlikelihood

training (Welleck et al.,2020) and contrastive learn-

ing (Su et al.,2022;An et al.,2022), contrastive

decoding requires zero additional training. We ﬁnd

that by simply contrasting two frozen language

models of different sizes, we are able to decode

higher quality text than from the larger LM alone.

Furthermore, we ﬁnd that better performance is

achieved when the scale difference between expert

and amateur is larger (§7.1). As a result, the op-

timal amateur model is also cheap to run and incurs

very little inference time overhead.

We evaluate our contrastive decoding approach

for open-ended text generation in three domains:

Wikipedia, stories, and news, and we evaluate us-

ing different teacher-student combinations, includ-

ing (GPT2-XL v.s. GPT2-small, OPT-13B v.s.

OPT-125M). Compared to four decoding baselines

(nucleus sampling, top-k, typical decoding and

SimCTG) our contrastive decoding method signi-

ﬁcantly improves the coherence of generated text,

and improves or maintains the same ﬂuency levels,

according to both human evaluation and automatic

metrics.

2 Problem Statement

We consider decoding approaches for open-ended

language generation, where the language mod-

els receive an input prompt and aim to generate

a ﬂuent and coherent continuation. Speciﬁcally,

we consider a relatively short prompt of length

, denoted as

xpre =x1· · · xn

, where

is a

token in the vocabulary

. The decoder must

generate continuations of length

, denoted as

xcont =xn+1,· · · , xn+m.

We generate text from a pre-trained autoregress-

ive language model

pLM

. At decoding time, we

iteratively decode one token at a time by condition-

ing on the preceding context:

pLM (xcont |xpre) =

n+m

i=n+1

pLM (xi|x<i).

where

pLM (xi|x<i)

is the next token distribution.

We use different subscripts to denote different LMs:

pAMA

is the amateur LM (e.g., GPT-2 small), and

pEXP is the expert LM (e.g., GPT-2 XL).

One canonical decoding approach is to sample

from a truncated next token distribution at each

time step. For example, nucleus sampling (Holtz-

man et al.,2020) draws from the top

percentile

of the next token distribution; top-k sampling (Fan

et al.,2018) draws from the top

candidates in the

next token distribution. Another common approach

is to search for the most likely text sequence via

greedy decoding or beam search (Wu et al.,2016);

but this leads to repetition and tedious outputs.

3 Contrastive Decoding

We propose contrastive decoding as a search-based

decoding method that optimizes a novel contrastive

objective subject to our plausibility constraint. We

ﬁrst provide intuition and deﬁne the constrastive

objective (§3.1). Second, we discuss the potential

weakness of this objective alone, and introduce the

plausibility constraint to correct for the weakness

(§3.2). Then we deﬁne the full contrastive decoding

method as our contrastive objective subject to the

plausibility constraint (§3.3). Finally, we elaborate

on the design spaces by discussing the choices of

amateurs (§3.4).

3.1 Contrastive Objective

Smaller LMs demonstrate stronger tendencies to

produce undesirable patterns (e.g., repetition, topic

drift, and self contradiction) than larger LMs. For

example, when both expert (larger LM) and ama-

teur (smaller LM) assign highest probability to a re-

petitive token, the expert LM is often less conﬁdent

about this decision and assigns non-trivial probabil-

ity mass to other good, non-repetitive continuations.

Contrastive decoding is inspired by these observa-

tions. The goal is to factor out undesired behaviors

highlighted by the smaller amateur LMs, and gen-

erate text from the remaining good behaviors of

larger expert LMs.

To operationalize this intuition, we propose the

contrastive objective LCD(xcont,xpre):

log pEXP(xcont |xpre)−log pAMA(xcont |xpre)

The CD objective rewards text patterns favored

by the large expert LMs and penalizes patterns

favored by the small amateur LMs. However, ama-

teur LMs are not always mistaken: small language

models still capture many simple aspects of Eng-

lish grammar and common sense (e.g., subject

verb agreement). Thus, penalizing all behaviors

from amateur LMs indiscriminately would penal-

ize these simple aspects that are correct (False neg-

ative), and conversely reward implausible tokens

(False positive). To tackle this issue, we introduce

the plausibility constraint, which complements our

CD objective and avoids these failure modes.

3.2 Vhead: Adaptive Plausibility Constraint

To tackle the aforementioned issue, we propose an

adaptive plausibility constraint (

Vhead

) that exploits

the conﬁdence level of the expert LM to restrict the

effect of the contrastive objective when the expert

LM is highly conﬁdent:

Vhead(x<i) = (1)

{xi∈ V :pEXP (xi|x<i)≥αmax

wpEXP(w|x<i)}

Here,

is a hyperparameter in

[0,1]

that trun-

cates the next token distribution of

pEXP

. Larger

entails more aggressive truncation, keeping only

high probability tokens, whereas smaller

allows

tokens of lower probabilities to be generated. We

set α= 0.1throughout the paper.

This adaptive plausibility constraint corrects for

both false positive and false negative failures of the

contrastive objective:

False positives. An implausible token may be re-

warded with a high score under our unconstrained

contrastive objective. For example, the token “Net-

Message” is highly implausible under the context

of Figure 1, with

3×10−9

pEXP

and

8×10−14

pAMA

; however, it attains the highest contrast of

log pEXP −log pAMA = 10.6

, which is much higher

than plausible tokens “1961” and “Hawaii”. To

handle the false positive problem,

Vhead

ﬁlters out

low probability tokens and only keeps high probab-

ility tokens in the candidate pool.

False negatives. When confronting an easy de-

cision, the correct token that achieves high probab-

ility under both amateur LM and expert LM may

receive a low score under the contrastive objective.

For example, due to tokenization, the word “uni-

corn” consists of two subwords: “unic” and “#orn”,

and the probability of “#orn” given the preﬁx “unic”

is close to 0.99 under both LMs, but the contrast

log pEXP −log pAMA

is only

6×10−4

, which is much

lower than bad continuations.

Here,

Vhead

uses the expert LM’s conﬁdence (as

deﬁned by the

ratio with the max probability

token in the given timestep) to avoid these false

negative cases. The expert LM assigns high conﬁd-

ence to easy decisions, but not to tokens that reﬂect

the undesired behaviors of the amateur, since prob-

ability mass is taken up by other candidate tokens

the expert is able to consider. Our constraint keeps

as few as one token in the candidate pool when the

expert is highly conﬁdent about this token, which

removes the impact of the contrastive objective,

because the single token would always be highest

ranked regardless of the CD objective.

3.3 Full Method

Combining the contrastive objective and the ad-

aptive plausibility constraint, we obtain the full

contrastive decoding formulation:

max

xcont

LCD(xcont,xpre)(2)

subject to xi∈ Vhead(x<i),∀xi∈xcont

The above objective is deﬁned at the sequence

level, which is intractable to optimize. Thus, we

factor the objective to token level scores:

CD-score(xi;x<i)(3)

=(log pEXP (xi|x<i )

pAMA(xi|x<i ),if xi∈ Vhead(x<i),

−inf,otherwise.

We apply beam search to optimize

CD-score

by ﬁrst ﬁltering tokens based on plausibility con-

straints

Vhead(x<i)

, eliminating tokens that fail to

achieve sufﬁciently high probabilities under the ex-

pert LM. Then we score the remaining tokens based

on the amount of contrast they demonstrate, accord-

ing to

log pEXP(xi|x<i)−log pAMA(xi|x<i)

. As

a result, we end up selecting plausible tokens under

the expert LM that least resemble the amateur LM.

3.4 Choice of Amateur

The choice of amateur LM is an important decision

for contrastive decoding. As discussed in §3.1,

we should choose amateur LMs that exhibit the

behaviors we would like to downweight from the

expert LM. Here, we consider three aspects:

Scale. Smaller LMs have lower modeling capa-

city and are more prone to errors. Therefore, we

choose the amateur LM to be the smallest model

in the same family of the expert LM. For example,

for OPT-13B expert, we choose OPT-125M as the

amateur; for GPT-2 XL expert, we choose GPT-2

small as the amateur. We verify this design choice

in §7.1. On the extreme end, employing n-gram

models yields an amateur LM of extremely low

capacity. But this choice hurts generation qual-

ity, because n-gram LMs incur too many errors to

identify similar failure modes of the expert LM.

Temperature. We can manipulate the amateur

LM behavior by tuning its temperature

. For ex-

ample, applying a high temperature (

τ > 1

) to the

amateur LM results in ﬂatter distributions; apply-

ing a low temperature (

close to

) highlights the

mode of the amateur distribution, which is more

prone to errors (e.g. repetition). Therefore, we

manipulate the temperature of the amateur LM to

adjust the amateur behavior that will be penalized

in contrastive decoding. In §7.2, we study the im-

pact of

to generation quality and set

0.5

1.0for our main experiments.

Context window. We can also weaken capacity

by restricting the context window of the amateur

LM (Li et al.,2016). For instance, we can only al-

low the amateur LM to condition on the last token

xpre

, but we allow the expert LM to condition

on the entire

xpre

. In other words, we decode from

log pEXP (xcont|x1:n)

pAMA(xcont|xn)

. By conditioning the amateur LM

only on partial prompts, the coherence of the am-

ateur LM is weakened, and contrastive decoding

produces more coherent text by highlighting the

coherence nature of the expert LM. In §7.5, we

study the impact of this design choice.

4 CD as Pragmatic Communication

Having formally described contrastive decoding,

we now provide a pragmatic interpretation, justify-

ing its validity through pragmatic communication

goals .

A line of work in pragmatics (Grice,1975) char-

acterizes communication as a cooperative process

between speakers and listeners. Several of these

formalisms (Horn,1984;Levinson,2000) describe

a tradeoff between speakers and listeners, where

a speaker should generally produce language that

is high quality (e.g. truthful, ﬂuent, and relevant)

while also being informative to a listener.

Our contrastive objective can be motivated by

this tradeoff, with our expert and amateur LMs

modeling a knowledgable speaker and a less-

informed listener: (1) Upweighting tokens by

pEXP

and using our expert-based plausibility constraints

generates tokens that have high probability under

the expert LM, encouraging generated text to be

ﬂuent and relevant (e.g. upweighting ‘1961’ in Fig-

ure 1). (2) Downweighting tokens by

pAMA

sup-

presses language that is predictable by (i.e. less

informative to) the amateur LM (e.g. downweight-

ing ‘Honolulu’ and ‘Washington’), and by proxy

encourages the language to be informative to a

listener in context. By combining these two cri-

teria, our contrastive decoding method produces

high quality text that satisﬁes the communicative

goal of transferring relevant but not predictable

information.

4.1 Special Cases of Contrastive Decoding

Maximum probability. Setting the amateur LM

to a uniform distribution reduces CD to maximize

log-probabilities under the expert LM.

N-gram blocking. If we set the amateur LM as

an n-gram model whose n-gram counts are updated

to ﬁt the generated preﬁx, this yields a decoding

algorithm with soft n-gram blocking. If we also

set the amateur temperature to be very small, then

it approaches the canonical heuristic of forbidding

repeated n-grams (Paulus et al.,2018).

Diverse decoding. If we use the same LM as

both amateur and expert and restrict the context

window of the amateur LM (§3.4), our method

is equivalant to the MMI decoding objective (Li

et al.,2016) sometimes used in dialog systems,

which explicitly maximizes the pointwise mutual

information between the xpre and xcont.

5 Experimental Setup

5.1 Datasets and Metrics

We evaluate on three domains for open-ended text

generation: news, Wikipedia, and story domains.

For the news domain, we use news articles from

Wikinews;

for the Wikipedia domain, we use the

WikiText-103 dataset (Merity et al.,2017); and for

story domains, we use the BookCorpus (Zhu et al.,

2015) (Project Gutenberg split).

We use the ﬁrst 32 words in the passage as the

prompt, and decode for 256 tokens for the con-

tinuations. We evaluate generated text with both

automatic and human evaluation.

Diversity. This metrics aggregate n-gram repe-

tition rates:

DIV =Q4

n=2

|unique n-grams (xcont)|

total n-grams (xcont)|

. A

low diversity score suggests the model suffers from

repetition, and a high diversity score means the

model generated text is lexically diverse.

MAUVE. MAUVE (Pillutla et al.,2021) score

(the higher the better) measures the distribution

similarity between the set of generated text and the

set of gold reference.

Coherence. We follow Su et al. (2022)

and approximate coherence by cosine sim-

ilarity between the sentence embeddings of

prompt

xpre

and generated continuation

xcont

COH(xcont,xpre) = EMB(xpre)·EMB(xcont)

||EMB(xpre)||·||EMB(xcont )||

, where

EMB(x)

is the pre-trained SimCSE sentence

embedding (Gao et al.,2021).

Human Eval. In order to evaluate the quality of

the generated text, we consider two critical aspects:

ﬂuency and coherence. A ﬂuent piece of text is

written in grammatical English and has a natural

ﬂow (e.g. excluding unnatural repetition or web

formatting). A coherent piece of text should stay

on topic with the prompt and avoid unnatural topic

drift. We ask Amazon Mechanical Turkers to read

two continuations (A and B) of the same prompt,

and choose the more ﬂuent/coherent continuation

or decide they are similar.

5.2 Baselines

We compare contrastive decoding with three

sampling methods, each with the recommended hy-

perparameters: nucleus sampling (

p= 0.95

), top-k

sampling (

k= 50

), typical decoding (Meister et al.,

2022) (

τ= 0.95

); and two search-based methods:

2Wikinews from http://www.wikinews.org

greedy (max prob) decoding that uses

log pEXP

the objective, and contrastive search (CS) (Su et al.,

2022;Su and Collier,2022). Among them, nucleus

sampling is the standard approach for open-ended

text generation whose performance has been veri-

ﬁed in various domains (Holtzman et al.,2020;

DeLucia et al.,2020), and typical decoding is a

recently proposed approach that excels in lexical

diversity (Meister et al.,2022). We therefore con-

duct human evaluation by comparing CD against

these two methods.

5.3 Models and Hyperparameters

In order to demonstrate that our approach gener-

alizes across various LM families and sizes, we

consider GPT-2 XL (1.5B), OPT (6.7B) and OPT

(13B) as expert LMs and employ the smallest LM

in their respective family as the amateurs: GPT-2

small (100M) and OPT (125M).

Recall that contrastive decoding introduces two

hyperparameters:

is the parameter to adjust the

plausibility threshold, and

is the temperature

of the amateur LM. We always set

α= 0.1

for

the main results in the paper — we ﬁnd that

this setting is quite robust and generalizes across

various domains. For OPT experiments, we set

the amateur temperature to

1.0

and for GPT-2

experiments, we set the amateur temperature to

0.5

. We use a beam size of 5. We also study the

impact of these hyperparameters in the ablation

study §7.2, and we ﬁnd that our method is robust

to various hyperparameter values.

6 Main Results

6.1 Automatic Evaluation

As shown in Table 1, contrastive decoding out-

performs all other decoding baselines in MAUVE

score and coherence score (COH) across three dif-

ferent domains (news, Wikipedia, stories) and two

model sizes (1.5B, 13B). Contrastive decoding

achieves comparable or slightly worse diversity

compared to nucleus and typical sampling, but it

achieves substantially better diversity than other

search based methods.

Typical decoding and nucleus sampling produce

lexically diverse text by choosing low probabil-

ity tokens, at the expense of topic drift. For in-

stance, in the story domain we observe the largest

diversity gap between contrastive decoding and

nucleus sampling (0.83 v.s. 0.94) in the 1.5B model,

but we ﬁnd that the gap shrinks (0.89 v.s. 0.93) as

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveDecoding:Open-endedTextGenerationasOptimizationXiangLisaLi1,AriHoltzman2,DanielFried3,PercyLiang1,JasonEisner4,TatsunoriHashimoto1,LukeZettlemoyer2,5,MikeLewis5StanfordUniversity1,UniversityofWashington2,CarnegieMellonUniversity3,JohnsHopkinsUniversity4,FAIR5xlisali@stanford.edu,ahai@cs.w...

展开>> 收起<<

Contrastive Decoding Open-ended Text Generation as Optimization Xiang Lisa Li1 Ari Holtzman2 Daniel Fried3 Percy Liang1 Jason Eisner4 Tatsunori Hashimoto1 Luke Zettlemoyer25 Mike Lewis5.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Decoding Open-ended Text Generation as Optimization Xiang Lisa Li1 Ari Holtzman2 Daniel Fried3 Percy Liang1 Jason Eisner4 Tatsunori Hashimoto1 Luke Zettlemoyer25 Mike Lewis5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: