Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models David Wingate

2025-05-02 0 0 1.21MB 14 页 10玖币
侵权投诉
Prompt Compression and Contrastive Conditioning for
Controllability and Toxicity Reduction in Language Models
David Wingate
Brigham Young University
wingated@cs.byu.edu
Mohammad Shoeybi
Nvidia, Inc.
mshoeybi@nvidia.com
Taylor Sorensen
University of Washington
tsor13@cs.washington.edu
Abstract
We explore the idea of compressing the
prompts used to condition language models,
and show that compressed prompts can re-
tain a substantive amount of information about
the original prompt. For severely compressed
prompts, while fine-grained information is
lost, abstract information and general senti-
ments can be retained with surprisingly few pa-
rameters, which can be useful in the context
of decode-time algorithms for controllability
and toxicity reduction. We explore contrastive
conditioning to steer language model gener-
ation towards desirable text and away from
undesirable text, and find that some complex
prompts can be effectively compressed into a
single token to guide generation. We also show
that compressed prompts are largely composi-
tional, and can be constructed such that they
can be used to control independent aspects of
generated text.
1 Introduction
Language models (LMs), such as GPT-2 (Radford
et al.,2018,2019a), BERT (Devlin et al.,2018),
T5 (Raffel et al.,2020), or GPT-3 (Brown et al.,
2020), exhibit a remarkable ability to capture pat-
terns of grammar, vocabulary, cultural knowledge,
and conversational rhythms present in natural lan-
guage. Formally, a LM is a conditional distribution
over tokens
p(xt|x1,· · · , xt1)
, with each token
xt∈ V
for some vocabulary
V
. Throughout this
paper, we will refer to
xh=x1,· · · , xt1
as the
prompt.
This paper explores prompt compression: the
idea that the text
xh
used to condition a LM can be
approximately represented by a much smaller set
of carefully chosen weights, using the framework
of soft prompts (Lester et al.,2021). We begin by
establishing some basic properties of compressed
prompts, and importantly show that while highly
Work done while at Nvidia, Inc.
Work done while at Brigham Young University
LM LM
KL
Hard prompt So prompt
Figure 1: Schematic of prompt compression. Weights
of the soft prompt are tuned to minimize the KL diver-
gence between hard and soft prompts, for all xt:k.
compressed prompts lose fine-grained information
about the prompt, they can retain general, abstract
information. This motivates our central application:
to use such compressed prompts in a Bayesian at-
tribute framework to steer text generation, with
specific application to toxicity reduction.
To motivate this more deeply, we briefly sketch
how compressed prompts can be used in toxicity
reduction. Efforts to reduce toxicity and bias gener-
ally follow one of two strategies: the first is to train
or fine-tune LMs on carefully curated data, either
tagging or labelling it in special ways (Keskar et al.,
2019a;Lu et al.,2022) or using data known to be
“clean”. The second is to "steer" the generation
of token probabilities away from toxic generations
(Krause et al.,2020;Liu et al.,2021), and towards
text with known, desirable properties.
Following previous work, we steer LM prob-
abilities by using a Bayesian attribute classifier
framework that involves scoring candidate tokens
with different experts. As an independent contri-
bution, we explore the idea of simply using condi-
tioning text to construct such experts by leveraging
the few-shot modeling abilities of LMs (Radford
et al.,2019a;Brown et al.,2020): given a few
examples of text containing a pattern of interest,
language models are capable of “analyzing” such
examples and assign high probability to subsequent
text exhibiting the same pattern. Thus, in the same
way that language model can, for example, clas-
arXiv:2210.03162v1 [cs.CL] 6 Oct 2022
sify the sentiment of a tweet, we use LMs to ana-
lyze the toxicity of candidate generations in real-
time. Our method can be considered an exemplar-
based method of defining experts that capture de-
sirable and undesirable attributes of generated text.
We term this technique contrastive contexts, and
note that it reduces the problem of creating experts
to one of prompt engineering (Reynolds and Mc-
Donell,2021).
However, our conditioning contexts are quite
large, which motivated this work. We use prompt
compression to mimic an uncompressed prompt
(hereafter referred to as "hard" prompt) as closely
as possible, thereby saving both computation and
space in the context window. Our results demon-
strate that this can be very effective, and, in a very
surprising finding, that complex prompts can be
reduced to a single token and still be useful for tox-
icity reduction, often with better fluency compared
to hard prompts.
The contributions of this paper are three-fold:
first, we introduce and formalize the idea of prompt
compression; second, we introduce and formalize
the method of contrastive contexts in the Bayesian
attribute framework; third, we experimentally eval-
uate our methods, and refine the technique based
on various empirical observations, and contribute a
careful study of effectiveness as model size varies.
2 Background and Related Work
To the best of our knowledge, this is the first work
to directly explore prompt compression. However,
our work is based on the original soft prompt ideas
of (Lester et al.,2021). It is also somewhat related
to distillation, where one model is trained to mimic
another by matching logits (Gou et al.,2021).
Usually, LMs take text as input, which is then to-
kenized into discrete tokens by a tokenizer. Each to-
ken is then mapped to a learned embedding, which
is used as input to a transformer (Vaswani et al.,
2017). The idea of soft prompts (Lester et al.,2021)
is to bypass the need to use discrete tokens with
pre-trained embeddings and instead directly learn
a series of embeddings via backpropagation. These
learned embeddings are then fed directly to the
transformer and do not need to correspond to any
actual language tokens.
As the centerpiece application of prompt com-
pression, we explore generative controllability
(Keskar et al.,2019b) and toxicity reduction in
language models.
Number of tokens in compressed prompt
Expected KL divergence
Figure 2: KL divergence of compressed prompts as
a function of number of tokens n. Prompts are ran-
domly sampled from the Pile (mean words= 916, me-
dian words = 274, median characters = 1849).
Our method is most closely related to decode-
time algorithms, such as GEDI (Krause et al.,
2020), which uses Bayes’ rule and discriminative
models to steer the generation towards a certain
attribute; and PPLM (Dathathri et al.,2019), which
uses an estimated gradient with respect to the de-
sired attribute to steer the LM’s internal representa-
tion at generation time.
Other methods are based on fine-tuning language
models with the classical language modeling ob-
jective to steer generation. DEXPERTs (Liu et al.,
2021) combines experts and anti-experts in a prod-
uct of experts model to reduce toxicity of LMs.
Additionally, reinforcement learning approaches
show strong performance at steering language mod-
els (Stiennon et al.,2020). By providing rewards,
methods such as PPO (Schulman et al.,2017) and
Quark (Lu et al.,2022) represent the current best
performance at reducing LM toxicity while main-
taining fluency. These methods, however, require a
predetermined reward function, which may or may
not be feasible depending on the context.
3 Prompt Compression
Here, we introduce and explore the idea of prompt
compression, whereby the parameters of a soft
prompt (Lester et al.,2021) are trained to mimic a
fixed hard prompt as closely as possible.
The intuition of our idea is simple: condition-
ing a LM on a hard prompt
xh
induces a distri-
bution
p(xt,· · · , xt+k|xh)
over all possible sub-
sequent sequences of tokens
xt,· · · , xt+k
for all
k
. To simplify notation, let
xt:k=xt,· · · , xt+k
.
The schematic of the idea is shown in Fig. 1. For-
mally, a soft prompt is a block of weights
θn
that
is prepended to the embeddings of the tokenized
sequence
xt:k
, and which is then fed through the
transformer layers of the language model. The soft
prompt induces a modified distribution over
xt:k
,
which we represent as
q(xt:k|θn)
. Here,
n
is the
number of tokens in the soft prompt (which do not
necessarily correspond to natural language tokens).
To compress prompt
xh
, we train the soft prompt
weights to minimize the following objective:
min
θn
Ext:k[KL(p(xt:k|xh)||q(xt:k|θn))] (1)
where the sequences
xt:k
s are sentences of various
lengths and content drawn from a diverse training
set. We optimize this objective using the Adam
optimizer for 75,000 steps of training with a linear
learning rate schedule starting at 0.1, and
xt:k
s
drawn randomly from The Pile (Gao et al.,2021),
requiring about 1-4 GPU-hours to train a single
prompt, depending on computational complexity
of running the LM. All prompt training was done
using either a single A100 or V100 GPU.
While training a compressed prompt can be ex-
pensive, the gains are found at inference time. Us-
ing a compressed prompt over a hard prompt re-
duces the length of the context. This scales down
the needed computation according to the trans-
former’s attention mechanism, which is
O(n2)
.
This also could allow long contexts to be com-
pressed and appended to longer inputs than was pre-
viously possible. Once trained, these compressed
prompts could be shared to create a library of effi-
cient contexts.
4 Experiment Set #1: Establishing Basic
Properties of Compressed Prompts
We begin by exploring the properties of com-
pressed prompts. First, we show that condition-
ing on a hard prompt and its compressed prompt
generate qualitatively similar generations, although
this equivalence degrades as the prompt is com-
pressed more and more. Second, we qualitatively
explore what happens to fine-grained information
as a prompt is compressed more and more.
Models and codebase
. All experiments were
conducted using the Huggingface
1
(Wolf et al.,
2019) implementation of GPT-2 (117M parame-
ters), GPT-2 medium (345M), GPT-2 large (774M)
and GPT-2 xl (1.5B) models.
1https://github.com/huggingface/transformers
Number of tokens in compressed prompt
QA Accuracy
Figure 3: Reading comprehension performance by
question as context is more and more compressed. Ac-
curacy is averaged over 1000 completions and each line
represents a single question. As expected, performance
degrades nearly monotonically as the number of tokens
in the compressed prompt is decreased. General ques-
tions degrade less than questions about specific details.
We used GPT-2 xl for this experiment.
4.1 Comparing hard and compressed
prompts
Fig. 2shows the KL divergence between the orig-
inal prompt and the compressed prompts’ output
distribution for randomly sampled sentences from
the pile (Gao et al.,2021). As the figure shows,
as the size of the compressed prompt increases,
the KL divergence monotonically decreases for all
models. This implies, as expected, that the more
context allowed in a soft prompt, the better the soft
prompt does at mimicking the full context.
Additionally, note that the magnitude of the KL
divergence is similar across models for a given soft
prompt size
n
. This shows that this method of
context compression works well on a variety of
model sizes (124M - 1.5B parameters).
4.2 Exploring information retention
As a prompt is compressed more and more, infor-
mation in the original prompt must be lost. As
the training process specifically attempts to match
the predictive distribution over completions for a
prompt, the question arises: what information is
preserved, and what is discarded?
Reading Comprehension Task.
To assess this,
we construct the following experiment: given a
reading comprehension task that involves a para-
graph
p
of text and a series of questions, how do the
answers to those questions degrade as a function
of compression? Specifically, we look at questions
about fine-grained information (specific details that
摘要:

PromptCompressionandContrastiveConditioningforControllabilityandToxicityReductioninLanguageModelsDavidWingateBrighamYoungUniversitywingated@cs.byu.eduMohammadShoeybiNvidia,Inc.mshoeybi@nvidia.comTaylorSorensenUniversityofWashingtonytsor13@cs.washington.eduAbstractWeexploretheideaofcompressingthepro...

展开>> 收起<<
Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models David Wingate.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.21MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注