
sify the sentiment of a tweet, we use LMs to ana-
lyze the toxicity of candidate generations in real-
time. Our method can be considered an exemplar-
based method of defining experts that capture de-
sirable and undesirable attributes of generated text.
We term this technique contrastive contexts, and
note that it reduces the problem of creating experts
to one of prompt engineering (Reynolds and Mc-
Donell,2021).
However, our conditioning contexts are quite
large, which motivated this work. We use prompt
compression to mimic an uncompressed prompt
(hereafter referred to as "hard" prompt) as closely
as possible, thereby saving both computation and
space in the context window. Our results demon-
strate that this can be very effective, and, in a very
surprising finding, that complex prompts can be
reduced to a single token and still be useful for tox-
icity reduction, often with better fluency compared
to hard prompts.
The contributions of this paper are three-fold:
first, we introduce and formalize the idea of prompt
compression; second, we introduce and formalize
the method of contrastive contexts in the Bayesian
attribute framework; third, we experimentally eval-
uate our methods, and refine the technique based
on various empirical observations, and contribute a
careful study of effectiveness as model size varies.
2 Background and Related Work
To the best of our knowledge, this is the first work
to directly explore prompt compression. However,
our work is based on the original soft prompt ideas
of (Lester et al.,2021). It is also somewhat related
to distillation, where one model is trained to mimic
another by matching logits (Gou et al.,2021).
Usually, LMs take text as input, which is then to-
kenized into discrete tokens by a tokenizer. Each to-
ken is then mapped to a learned embedding, which
is used as input to a transformer (Vaswani et al.,
2017). The idea of soft prompts (Lester et al.,2021)
is to bypass the need to use discrete tokens with
pre-trained embeddings and instead directly learn
a series of embeddings via backpropagation. These
learned embeddings are then fed directly to the
transformer and do not need to correspond to any
actual language tokens.
As the centerpiece application of prompt com-
pression, we explore generative controllability
(Keskar et al.,2019b) and toxicity reduction in
language models.
Number of tokens in compressed prompt
Expected KL divergence
Figure 2: KL divergence of compressed prompts as
a function of number of tokens n. Prompts are ran-
domly sampled from the Pile (mean words= 916, me-
dian words = 274, median characters = 1849).
Our method is most closely related to decode-
time algorithms, such as GEDI (Krause et al.,
2020), which uses Bayes’ rule and discriminative
models to steer the generation towards a certain
attribute; and PPLM (Dathathri et al.,2019), which
uses an estimated gradient with respect to the de-
sired attribute to steer the LM’s internal representa-
tion at generation time.
Other methods are based on fine-tuning language
models with the classical language modeling ob-
jective to steer generation. DEXPERTs (Liu et al.,
2021) combines experts and anti-experts in a prod-
uct of experts model to reduce toxicity of LMs.
Additionally, reinforcement learning approaches
show strong performance at steering language mod-
els (Stiennon et al.,2020). By providing rewards,
methods such as PPO (Schulman et al.,2017) and
Quark (Lu et al.,2022) represent the current best
performance at reducing LM toxicity while main-
taining fluency. These methods, however, require a
predetermined reward function, which may or may
not be feasible depending on the context.
3 Prompt Compression
Here, we introduce and explore the idea of prompt
compression, whereby the parameters of a soft
prompt (Lester et al.,2021) are trained to mimic a
fixed hard prompt as closely as possible.
The intuition of our idea is simple: condition-
ing a LM on a hard prompt
xh
induces a distri-
bution
p(xt,· · · , xt+k|xh)
over all possible sub-
sequent sequences of tokens
xt,· · · , xt+k
for all
k
. To simplify notation, let
xt:k=xt,· · · , xt+k
.
The schematic of the idea is shown in Fig. 1. For-