
achieve sufficiently high probabilities under the ex-
pert LM. Then we score the remaining tokens based
on the amount of contrast they demonstrate, accord-
ing to
log pEXP(xi|x<i)−log pAMA(xi|x<i)
. As
a result, we end up selecting plausible tokens under
the expert LM that least resemble the amateur LM.
3.4 Choice of Amateur
The choice of amateur LM is an important decision
for contrastive decoding. As discussed in §3.1,
we should choose amateur LMs that exhibit the
behaviors we would like to downweight from the
expert LM. Here, we consider three aspects:
Scale. Smaller LMs have lower modeling capa-
city and are more prone to errors. Therefore, we
choose the amateur LM to be the smallest model
in the same family of the expert LM. For example,
for OPT-13B expert, we choose OPT-125M as the
amateur; for GPT-2 XL expert, we choose GPT-2
small as the amateur. We verify this design choice
in §7.1. On the extreme end, employing n-gram
models yields an amateur LM of extremely low
capacity. But this choice hurts generation qual-
ity, because n-gram LMs incur too many errors to
identify similar failure modes of the expert LM.
Temperature. We can manipulate the amateur
LM behavior by tuning its temperature
τ
. For ex-
ample, applying a high temperature (
τ > 1
) to the
amateur LM results in flatter distributions; apply-
ing a low temperature (
τ
close to
0
) highlights the
mode of the amateur distribution, which is more
prone to errors (e.g. repetition). Therefore, we
manipulate the temperature of the amateur LM to
adjust the amateur behavior that will be penalized
in contrastive decoding. In §7.2, we study the im-
pact of
τ
to generation quality and set
τ
to
0.5
or
1.0for our main experiments.
Context window. We can also weaken capacity
by restricting the context window of the amateur
LM (Li et al.,2016). For instance, we can only al-
low the amateur LM to condition on the last token
of
xpre
, but we allow the expert LM to condition
on the entire
xpre
. In other words, we decode from
log pEXP (xcont|x1:n)
pAMA(xcont|xn)
. By conditioning the amateur LM
only on partial prompts, the coherence of the am-
ateur LM is weakened, and contrastive decoding
produces more coherent text by highlighting the
coherence nature of the expert LM. In §7.5, we
study the impact of this design choice.
4 CD as Pragmatic Communication
Having formally described contrastive decoding,
we now provide a pragmatic interpretation, justify-
ing its validity through pragmatic communication
goals .
A line of work in pragmatics (Grice,1975) char-
acterizes communication as a cooperative process
between speakers and listeners. Several of these
formalisms (Horn,1984;Levinson,2000) describe
a tradeoff between speakers and listeners, where
a speaker should generally produce language that
is high quality (e.g. truthful, fluent, and relevant)
while also being informative to a listener.
Our contrastive objective can be motivated by
this tradeoff, with our expert and amateur LMs
modeling a knowledgable speaker and a less-
informed listener: (1) Upweighting tokens by
pEXP
and using our expert-based plausibility constraints
generates tokens that have high probability under
the expert LM, encouraging generated text to be
fluent and relevant (e.g. upweighting ‘1961’ in Fig-
ure 1). (2) Downweighting tokens by
pAMA
sup-
presses language that is predictable by (i.e. less
informative to) the amateur LM (e.g. downweight-
ing ‘Honolulu’ and ‘Washington’), and by proxy
encourages the language to be informative to a
listener in context. By combining these two cri-
teria, our contrastive decoding method produces
high quality text that satisfies the communicative
goal of transferring relevant but not predictable
information.
4.1 Special Cases of Contrastive Decoding
Maximum probability. Setting the amateur LM
to a uniform distribution reduces CD to maximize
log-probabilities under the expert LM.
N-gram blocking. If we set the amateur LM as
an n-gram model whose n-gram counts are updated
to fit the generated prefix, this yields a decoding
algorithm with soft n-gram blocking. If we also
set the amateur temperature to be very small, then
it approaches the canonical heuristic of forbidding
repeated n-grams (Paulus et al.,2018).
Diverse decoding. If we use the same LM as
both amateur and expert and restrict the context
window of the amateur LM (§3.4), our method
is equivalant to the MMI decoding objective (Li
et al.,2016) sometimes used in dialog systems,
which explicitly maximizes the pointwise mutual
information between the xpre and xcont.