Contrastive Decoding Open-ended Text Generation as Optimization Xiang Lisa Li1 Ari Holtzman2 Daniel Fried3 Percy Liang1 Jason Eisner4 Tatsunori Hashimoto1 Luke Zettlemoyer25 Mike Lewis5

2025-05-08 0 0 1.49MB 25 页 10玖币
侵权投诉
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li1, Ari Holtzman2, Daniel Fried3, Percy Liang1, Jason Eisner4,
Tatsunori Hashimoto1, Luke Zettlemoyer2,5, Mike Lewis5
Stanford University1, University of Washington2, Carnegie Mellon University3,
Johns Hopkins University4, FAIR5
xlisali@stanford.edu,ahai@cs.washington.edu,dfried@cs.cmu.edu,
pliang@stanford.edu,jason@cs.jhu.edu,thashim@stanford.edu,
lsz@cs.washington.edu,mikelewis@meta.com
Abstract
Given a language model (LM), maximum
probability is a poor decoding objective for
open-ended generation, because it produces
short and repetitive text. On the other hand,
sampling can often produce incoherent text
that drifts from the original topics. We propose
contrastive decoding (CD), a reliable decoding
approach that optimizes a contrastive objective
subject to a plausibility constraint. The
contrastive objective returns the difference
between the likelihood under a large LM
(called the expert, e.g. OPT-13B) and a small
LM (called the amateur, e.g. OPT-125M),
and the constraint ensures that the outputs are
plausible. CD is inspired by the fact that the
failures of larger LMs (e.g., repetition, inco-
herence) are even more prevalent in smaller
LMs, and that this difference signals which
texts should be preferred. CD requires zero
additional training, and produces higher quality
text than decoding from the larger LM alone.
It also works across model scales (OPT-13B
and GPT2-1.5B) and significantly outperforms
four strong decoding algorithms (e.g., nucleus,
top-k) in automatic and human evaluations
across wikipedia, news and story domains.1
1 Introduction
Open-ended text generation aims to craft fluent and
coherent textual continuations of given prompts,
laying foundations for various downstream applic-
ations such as writing assistance and story gen-
eration (Brown et al.,2020). The canonical ap-
proaches often sample from large pre-trained lan-
guage models (Holtzman et al.,2020;Fan et al.,
2018;Radford et al.,2019), but the generated text
is prone to incoherence and topic drift as unlucky
sampling choices compound over long sequences
(Eikema and Aziz,2020;Maynez et al.,2020). On
the other hand, searching for the most likely se-
1
Code is available at
https://github.com/
XiangLi1999/ContrastiveDecoding.git
Figure 1: Contrastive decoding exploits the contrasts
between expert and amateur LM of different sizes by
choosing tokens that maximize their log-likelihood
difference. CD produces high-quality text that amplifies
the good expert behavior and diminishes the undesired
amateur behavior.
quences often results in short, repetitive and tedi-
ous text (Holtzman et al.,2020), indicating that
maximizing probability is a wrong decoding ob-
jective.
We propose a new search-based approach,
contrastive decoding (CD), that can generate fluent
and lexically diverse text without compromising
coherence. As shown in Figure 1, contrastive
decoding takes an off-the-shelf large language
model such as OPT-13B (that we call the expert)
and an off-the-shelf smaller language model such
as OPT-125M (that we call the amateur). CD
searches for text that maximizes the difference
between expert log-probabilities and amateur
log-probabilities, subject to plausibility constraints
which restrict the search space to tokens with
sufficiently high probability under the expert LM.
Contrastive Decoding works because many fail-
ure modes of language models (short, repetitive, ir-
relevant or uninteresting strings) are more common
arXiv:2210.15097v2 [cs.CL] 10 Jul 2023
under smaller LMs than under larger LMs. Such
outputs are further deemphasized by taking the
difference between model log-probabilities. Con-
versely, stronger models tend to put more probab-
ility mass on desirable outputs, such as those with
factual knowledge that has not been learnt by the
weaker model, and these strings are emphasized by
contrastive decoding.
Taking Figure 1as an example, the expert model
places significant probability mass on previous
tokens such as “Hawaii” and “Honolulu”, lead-
ing to a highly repetitive continuation from greedy
search; and nonsensical tokens such as “Washing-
ton” may be sampled, leading to an incoherent con-
tinuation. A correct continuation “1961” is strongly
preferred by contrastive decoding, despite only hav-
ing a probability of 0.1, and the continuation in-
cludes more correct facts. This example suggests
that contrastive decoding generates outputs that
emphasize the best of the expert LM and remove
its amateur tendencies. Moreover, we provide a
pragmatic interpretation of contrastive decoding in
§4.
Compared to recent training-based methods that
improve generation quality such as unlikelihood
training (Welleck et al.,2020) and contrastive learn-
ing (Su et al.,2022;An et al.,2022), contrastive
decoding requires zero additional training. We find
that by simply contrasting two frozen language
models of different sizes, we are able to decode
higher quality text than from the larger LM alone.
Furthermore, we find that better performance is
achieved when the scale difference between expert
and amateur is larger (§7.1). As a result, the op-
timal amateur model is also cheap to run and incurs
very little inference time overhead.
We evaluate our contrastive decoding approach
for open-ended text generation in three domains:
Wikipedia, stories, and news, and we evaluate us-
ing different teacher-student combinations, includ-
ing (GPT2-XL v.s. GPT2-small, OPT-13B v.s.
OPT-125M). Compared to four decoding baselines
(nucleus sampling, top-k, typical decoding and
SimCTG) our contrastive decoding method signi-
ficantly improves the coherence of generated text,
and improves or maintains the same fluency levels,
according to both human evaluation and automatic
metrics.
2 Problem Statement
We consider decoding approaches for open-ended
language generation, where the language mod-
els receive an input prompt and aim to generate
a fluent and coherent continuation. Specifically,
we consider a relatively short prompt of length
n
, denoted as
xpre =x1· · · xn
, where
xi
is a
token in the vocabulary
V
. The decoder must
generate continuations of length
m
, denoted as
xcont =xn+1,· · · , xn+m.
We generate text from a pre-trained autoregress-
ive language model
pLM
. At decoding time, we
iteratively decode one token at a time by condition-
ing on the preceding context:
pLM (xcont |xpre) =
n+m
Y
i=n+1
pLM (xi|x<i).
where
pLM (xi|x<i)
is the next token distribution.
We use different subscripts to denote different LMs:
pAMA
is the amateur LM (e.g., GPT-2 small), and
pEXP is the expert LM (e.g., GPT-2 XL).
One canonical decoding approach is to sample
from a truncated next token distribution at each
time step. For example, nucleus sampling (Holtz-
man et al.,2020) draws from the top
p
percentile
of the next token distribution; top-k sampling (Fan
et al.,2018) draws from the top
k
candidates in the
next token distribution. Another common approach
is to search for the most likely text sequence via
greedy decoding or beam search (Wu et al.,2016);
but this leads to repetition and tedious outputs.
3 Contrastive Decoding
We propose contrastive decoding as a search-based
decoding method that optimizes a novel contrastive
objective subject to our plausibility constraint. We
first provide intuition and define the constrastive
objective (§3.1). Second, we discuss the potential
weakness of this objective alone, and introduce the
plausibility constraint to correct for the weakness
(§3.2). Then we define the full contrastive decoding
method as our contrastive objective subject to the
plausibility constraint (§3.3). Finally, we elaborate
on the design spaces by discussing the choices of
amateurs (§3.4).
3.1 Contrastive Objective
Smaller LMs demonstrate stronger tendencies to
produce undesirable patterns (e.g., repetition, topic
drift, and self contradiction) than larger LMs. For
example, when both expert (larger LM) and ama-
teur (smaller LM) assign highest probability to a re-
petitive token, the expert LM is often less confident
about this decision and assigns non-trivial probabil-
ity mass to other good, non-repetitive continuations.
Contrastive decoding is inspired by these observa-
tions. The goal is to factor out undesired behaviors
highlighted by the smaller amateur LMs, and gen-
erate text from the remaining good behaviors of
larger expert LMs.
To operationalize this intuition, we propose the
contrastive objective LCD(xcont,xpre):
log pEXP(xcont |xpre)log pAMA(xcont |xpre)
The CD objective rewards text patterns favored
by the large expert LMs and penalizes patterns
favored by the small amateur LMs. However, ama-
teur LMs are not always mistaken: small language
models still capture many simple aspects of Eng-
lish grammar and common sense (e.g., subject
verb agreement). Thus, penalizing all behaviors
from amateur LMs indiscriminately would penal-
ize these simple aspects that are correct (False neg-
ative), and conversely reward implausible tokens
(False positive). To tackle this issue, we introduce
the plausibility constraint, which complements our
CD objective and avoids these failure modes.
3.2 Vhead: Adaptive Plausibility Constraint
To tackle the aforementioned issue, we propose an
adaptive plausibility constraint (
Vhead
) that exploits
the confidence level of the expert LM to restrict the
effect of the contrastive objective when the expert
LM is highly confident:
Vhead(x<i) = (1)
{xi∈ V :pEXP (xi|x<i)αmax
wpEXP(w|x<i)}
Here,
α
is a hyperparameter in
[0,1]
that trun-
cates the next token distribution of
pEXP
. Larger
α
entails more aggressive truncation, keeping only
high probability tokens, whereas smaller
α
allows
tokens of lower probabilities to be generated. We
set α= 0.1throughout the paper.
This adaptive plausibility constraint corrects for
both false positive and false negative failures of the
contrastive objective:
False positives. An implausible token may be re-
warded with a high score under our unconstrained
contrastive objective. For example, the token “Net-
Message” is highly implausible under the context
of Figure 1, with
3×109
of
pEXP
and
8×1014
of
pAMA
; however, it attains the highest contrast of
log pEXP log pAMA = 10.6
, which is much higher
than plausible tokens “1961” and “Hawaii”. To
handle the false positive problem,
Vhead
filters out
low probability tokens and only keeps high probab-
ility tokens in the candidate pool.
False negatives. When confronting an easy de-
cision, the correct token that achieves high probab-
ility under both amateur LM and expert LM may
receive a low score under the contrastive objective.
For example, due to tokenization, the word “uni-
corn” consists of two subwords: “unic” and “#orn”,
and the probability of “#orn” given the prefix “unic”
is close to 0.99 under both LMs, but the contrast
log pEXP log pAMA
is only
6×104
, which is much
lower than bad continuations.
Here,
Vhead
uses the expert LM’s confidence (as
defined by the
α
ratio with the max probability
token in the given timestep) to avoid these false
negative cases. The expert LM assigns high confid-
ence to easy decisions, but not to tokens that reflect
the undesired behaviors of the amateur, since prob-
ability mass is taken up by other candidate tokens
the expert is able to consider. Our constraint keeps
as few as one token in the candidate pool when the
expert is highly confident about this token, which
removes the impact of the contrastive objective,
because the single token would always be highest
ranked regardless of the CD objective.
3.3 Full Method
Combining the contrastive objective and the ad-
aptive plausibility constraint, we obtain the full
contrastive decoding formulation:
max
xcont
LCD(xcont,xpre)(2)
subject to xi∈ Vhead(x<i),xixcont
The above objective is defined at the sequence
level, which is intractable to optimize. Thus, we
factor the objective to token level scores:
CD-score(xi;x<i)(3)
=(log pEXP (xi|x<i )
pAMA(xi|x<i ),if xi∈ Vhead(x<i),
inf,otherwise.
We apply beam search to optimize
CD-score
,
by first filtering tokens based on plausibility con-
straints
Vhead(x<i)
, eliminating tokens that fail to
achieve sufficiently high probabilities under the ex-
pert LM. Then we score the remaining tokens based
on the amount of contrast they demonstrate, accord-
ing to
log pEXP(xi|x<i)log pAMA(xi|x<i)
. As
a result, we end up selecting plausible tokens under
the expert LM that least resemble the amateur LM.
3.4 Choice of Amateur
The choice of amateur LM is an important decision
for contrastive decoding. As discussed in §3.1,
we should choose amateur LMs that exhibit the
behaviors we would like to downweight from the
expert LM. Here, we consider three aspects:
Scale. Smaller LMs have lower modeling capa-
city and are more prone to errors. Therefore, we
choose the amateur LM to be the smallest model
in the same family of the expert LM. For example,
for OPT-13B expert, we choose OPT-125M as the
amateur; for GPT-2 XL expert, we choose GPT-2
small as the amateur. We verify this design choice
in §7.1. On the extreme end, employing n-gram
models yields an amateur LM of extremely low
capacity. But this choice hurts generation qual-
ity, because n-gram LMs incur too many errors to
identify similar failure modes of the expert LM.
Temperature. We can manipulate the amateur
LM behavior by tuning its temperature
τ
. For ex-
ample, applying a high temperature (
τ > 1
) to the
amateur LM results in flatter distributions; apply-
ing a low temperature (
τ
close to
0
) highlights the
mode of the amateur distribution, which is more
prone to errors (e.g. repetition). Therefore, we
manipulate the temperature of the amateur LM to
adjust the amateur behavior that will be penalized
in contrastive decoding. In §7.2, we study the im-
pact of
τ
to generation quality and set
τ
to
0.5
or
1.0for our main experiments.
Context window. We can also weaken capacity
by restricting the context window of the amateur
LM (Li et al.,2016). For instance, we can only al-
low the amateur LM to condition on the last token
of
xpre
, but we allow the expert LM to condition
on the entire
xpre
. In other words, we decode from
log pEXP (xcont|x1:n)
pAMA(xcont|xn)
. By conditioning the amateur LM
only on partial prompts, the coherence of the am-
ateur LM is weakened, and contrastive decoding
produces more coherent text by highlighting the
coherence nature of the expert LM. In §7.5, we
study the impact of this design choice.
4 CD as Pragmatic Communication
Having formally described contrastive decoding,
we now provide a pragmatic interpretation, justify-
ing its validity through pragmatic communication
goals .
A line of work in pragmatics (Grice,1975) char-
acterizes communication as a cooperative process
between speakers and listeners. Several of these
formalisms (Horn,1984;Levinson,2000) describe
a tradeoff between speakers and listeners, where
a speaker should generally produce language that
is high quality (e.g. truthful, fluent, and relevant)
while also being informative to a listener.
Our contrastive objective can be motivated by
this tradeoff, with our expert and amateur LMs
modeling a knowledgable speaker and a less-
informed listener: (1) Upweighting tokens by
pEXP
and using our expert-based plausibility constraints
generates tokens that have high probability under
the expert LM, encouraging generated text to be
fluent and relevant (e.g. upweighting ‘1961’ in Fig-
ure 1). (2) Downweighting tokens by
pAMA
sup-
presses language that is predictable by (i.e. less
informative to) the amateur LM (e.g. downweight-
ing ‘Honolulu’ and ‘Washington’), and by proxy
encourages the language to be informative to a
listener in context. By combining these two cri-
teria, our contrastive decoding method produces
high quality text that satisfies the communicative
goal of transferring relevant but not predictable
information.
4.1 Special Cases of Contrastive Decoding
Maximum probability. Setting the amateur LM
to a uniform distribution reduces CD to maximize
log-probabilities under the expert LM.
N-gram blocking. If we set the amateur LM as
an n-gram model whose n-gram counts are updated
to fit the generated prefix, this yields a decoding
algorithm with soft n-gram blocking. If we also
set the amateur temperature to be very small, then
it approaches the canonical heuristic of forbidding
repeated n-grams (Paulus et al.,2018).
Diverse decoding. If we use the same LM as
both amateur and expert and restrict the context
window of the amateur LM (§3.4), our method
is equivalant to the MMI decoding objective (Li
et al.,2016) sometimes used in dialog systems,
which explicitly maximizes the pointwise mutual
information between the xpre and xcont.
5 Experimental Setup
5.1 Datasets and Metrics
We evaluate on three domains for open-ended text
generation: news, Wikipedia, and story domains.
For the news domain, we use news articles from
Wikinews;
2
for the Wikipedia domain, we use the
WikiText-103 dataset (Merity et al.,2017); and for
story domains, we use the BookCorpus (Zhu et al.,
2015) (Project Gutenberg split).
We use the first 32 words in the passage as the
prompt, and decode for 256 tokens for the con-
tinuations. We evaluate generated text with both
automatic and human evaluation.
Diversity. This metrics aggregate n-gram repe-
tition rates:
DIV =Q4
n=2
|unique n-grams (xcont)|
total n-grams (xcont)|
. A
low diversity score suggests the model suffers from
repetition, and a high diversity score means the
model generated text is lexically diverse.
MAUVE. MAUVE (Pillutla et al.,2021) score
(the higher the better) measures the distribution
similarity between the set of generated text and the
set of gold reference.
Coherence. We follow Su et al. (2022)
and approximate coherence by cosine sim-
ilarity between the sentence embeddings of
prompt
xpre
and generated continuation
xcont
:
COH(xcont,xpre) = EMB(xpre)·EMB(xcont)
||EMB(xpre)||·||EMB(xcont )||
, where
EMB(x)
is the pre-trained SimCSE sentence
embedding (Gao et al.,2021).
Human Eval. In order to evaluate the quality of
the generated text, we consider two critical aspects:
fluency and coherence. A fluent piece of text is
written in grammatical English and has a natural
flow (e.g. excluding unnatural repetition or web
formatting). A coherent piece of text should stay
on topic with the prompt and avoid unnatural topic
drift. We ask Amazon Mechanical Turkers to read
two continuations (A and B) of the same prompt,
and choose the more fluent/coherent continuation
or decide they are similar.
5.2 Baselines
We compare contrastive decoding with three
sampling methods, each with the recommended hy-
perparameters: nucleus sampling (
p= 0.95
), top-k
sampling (
k= 50
), typical decoding (Meister et al.,
2022) (
τ= 0.95
); and two search-based methods:
2Wikinews from http://www.wikinews.org
greedy (max prob) decoding that uses
log pEXP
as
the objective, and contrastive search (CS) (Su et al.,
2022;Su and Collier,2022). Among them, nucleus
sampling is the standard approach for open-ended
text generation whose performance has been veri-
fied in various domains (Holtzman et al.,2020;
DeLucia et al.,2020), and typical decoding is a
recently proposed approach that excels in lexical
diversity (Meister et al.,2022). We therefore con-
duct human evaluation by comparing CD against
these two methods.
5.3 Models and Hyperparameters
In order to demonstrate that our approach gener-
alizes across various LM families and sizes, we
consider GPT-2 XL (1.5B), OPT (6.7B) and OPT
(13B) as expert LMs and employ the smallest LM
in their respective family as the amateurs: GPT-2
small (100M) and OPT (125M).
Recall that contrastive decoding introduces two
hyperparameters:
α
is the parameter to adjust the
plausibility threshold, and
τ
is the temperature
of the amateur LM. We always set
α= 0.1
for
the main results in the paper — we find that
this setting is quite robust and generalizes across
various domains. For OPT experiments, we set
the amateur temperature to
1.0
and for GPT-2
experiments, we set the amateur temperature to
0.5
. We use a beam size of 5. We also study the
impact of these hyperparameters in the ablation
study §7.2, and we find that our method is robust
to various hyperparameter values.
6 Main Results
6.1 Automatic Evaluation
As shown in Table 1, contrastive decoding out-
performs all other decoding baselines in MAUVE
score and coherence score (COH) across three dif-
ferent domains (news, Wikipedia, stories) and two
model sizes (1.5B, 13B). Contrastive decoding
achieves comparable or slightly worse diversity
compared to nucleus and typical sampling, but it
achieves substantially better diversity than other
search based methods.
Typical decoding and nucleus sampling produce
lexically diverse text by choosing low probabil-
ity tokens, at the expense of topic drift. For in-
stance, in the story domain we observe the largest
diversity gap between contrastive decoding and
nucleus sampling (0.83 v.s. 0.94) in the 1.5B model,
but we find that the gap shrinks (0.89 v.s. 0.93) as
摘要:

ContrastiveDecoding:Open-endedTextGenerationasOptimizationXiangLisaLi1,AriHoltzman2,DanielFried3,PercyLiang1,JasonEisner4,TatsunoriHashimoto1,LukeZettlemoyer2,5,MikeLewis5StanfordUniversity1,UniversityofWashington2,CarnegieMellonUniversity3,JohnsHopkinsUniversity4,FAIR5xlisali@stanford.edu,ahai@cs.w...

展开>> 收起<<
Contrastive Decoding Open-ended Text Generation as Optimization Xiang Lisa Li1 Ari Holtzman2 Daniel Fried3 Percy Liang1 Jason Eisner4 Tatsunori Hashimoto1 Luke Zettlemoyer25 Mike Lewis5.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.49MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注