Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models David Wingate

2025-05-02 0 0 1.21MB 14 页 10玖币

侵权投诉

Prompt Compression and Contrastive Conditioning for

Controllability and Toxicity Reduction in Language Models

David Wingate

Brigham Young University∗

wingated@cs.byu.edu

Mohammad Shoeybi

Nvidia, Inc.

mshoeybi@nvidia.com

Taylor Sorensen

University of Washington†

tsor13@cs.washington.edu

Abstract

We explore the idea of compressing the

prompts used to condition language models,

and show that compressed prompts can re-

tain a substantive amount of information about

the original prompt. For severely compressed

prompts, while ﬁne-grained information is

lost, abstract information and general senti-

ments can be retained with surprisingly few pa-

rameters, which can be useful in the context

of decode-time algorithms for controllability

and toxicity reduction. We explore contrastive

conditioning to steer language model gener-

ation towards desirable text and away from

undesirable text, and ﬁnd that some complex

prompts can be effectively compressed into a

single token to guide generation. We also show

that compressed prompts are largely composi-

tional, and can be constructed such that they

can be used to control independent aspects of

generated text.

1 Introduction

Language models (LMs), such as GPT-2 (Radford

et al.,2018,2019a), BERT (Devlin et al.,2018),

T5 (Raffel et al.,2020), or GPT-3 (Brown et al.,

2020), exhibit a remarkable ability to capture pat-

terns of grammar, vocabulary, cultural knowledge,

and conversational rhythms present in natural lan-

guage. Formally, a LM is a conditional distribution

over tokens

p(xt|x1,· · · , xt−1)

, with each token

xt∈ V

for some vocabulary

. Throughout this

paper, we will refer to

xh=x1,· · · , xt−1

as the

prompt.

This paper explores prompt compression: the

idea that the text

used to condition a LM can be

approximately represented by a much smaller set

of carefully chosen weights, using the framework

of soft prompts (Lester et al.,2021). We begin by

establishing some basic properties of compressed

prompts, and importantly show that while highly

∗Work done while at Nvidia, Inc.

†Work done while at Brigham Young University

LM LM

Hard prompt So prompt

Figure 1: Schematic of prompt compression. Weights

of the soft prompt are tuned to minimize the KL diver-

gence between hard and soft prompts, for all xt:k.

compressed prompts lose ﬁne-grained information

about the prompt, they can retain general, abstract

information. This motivates our central application:

to use such compressed prompts in a Bayesian at-

tribute framework to steer text generation, with

speciﬁc application to toxicity reduction.

To motivate this more deeply, we brieﬂy sketch

how compressed prompts can be used in toxicity

reduction. Efforts to reduce toxicity and bias gener-

ally follow one of two strategies: the ﬁrst is to train

or ﬁne-tune LMs on carefully curated data, either

tagging or labelling it in special ways (Keskar et al.,

2019a;Lu et al.,2022) or using data known to be

“clean”. The second is to "steer" the generation

of token probabilities away from toxic generations

(Krause et al.,2020;Liu et al.,2021), and towards

text with known, desirable properties.

Following previous work, we steer LM prob-

abilities by using a Bayesian attribute classiﬁer

framework that involves scoring candidate tokens

with different experts. As an independent contri-

bution, we explore the idea of simply using condi-

tioning text to construct such experts by leveraging

the few-shot modeling abilities of LMs (Radford

et al.,2019a;Brown et al.,2020): given a few

examples of text containing a pattern of interest,

language models are capable of “analyzing” such

examples and assign high probability to subsequent

text exhibiting the same pattern. Thus, in the same

way that language model can, for example, clas-

arXiv:2210.03162v1 [cs.CL] 6 Oct 2022

sify the sentiment of a tweet, we use LMs to ana-

lyze the toxicity of candidate generations in real-

time. Our method can be considered an exemplar-

based method of deﬁning experts that capture de-

sirable and undesirable attributes of generated text.

We term this technique contrastive contexts, and

note that it reduces the problem of creating experts

to one of prompt engineering (Reynolds and Mc-

Donell,2021).

However, our conditioning contexts are quite

large, which motivated this work. We use prompt

compression to mimic an uncompressed prompt

(hereafter referred to as "hard" prompt) as closely

as possible, thereby saving both computation and

space in the context window. Our results demon-

strate that this can be very effective, and, in a very

surprising ﬁnding, that complex prompts can be

reduced to a single token and still be useful for tox-

icity reduction, often with better ﬂuency compared

to hard prompts.

The contributions of this paper are three-fold:

ﬁrst, we introduce and formalize the idea of prompt

compression; second, we introduce and formalize

the method of contrastive contexts in the Bayesian

attribute framework; third, we experimentally eval-

uate our methods, and reﬁne the technique based

on various empirical observations, and contribute a

careful study of effectiveness as model size varies.

2 Background and Related Work

To the best of our knowledge, this is the ﬁrst work

to directly explore prompt compression. However,

our work is based on the original soft prompt ideas

of (Lester et al.,2021). It is also somewhat related

to distillation, where one model is trained to mimic

another by matching logits (Gou et al.,2021).

Usually, LMs take text as input, which is then to-

kenized into discrete tokens by a tokenizer. Each to-

ken is then mapped to a learned embedding, which

is used as input to a transformer (Vaswani et al.,

2017). The idea of soft prompts (Lester et al.,2021)

is to bypass the need to use discrete tokens with

pre-trained embeddings and instead directly learn

a series of embeddings via backpropagation. These

learned embeddings are then fed directly to the

transformer and do not need to correspond to any

actual language tokens.

As the centerpiece application of prompt com-

pression, we explore generative controllability

(Keskar et al.,2019b) and toxicity reduction in

language models.

Number of tokens in compressed prompt

Expected KL divergence

Figure 2: KL divergence of compressed prompts as

a function of number of tokens n. Prompts are ran-

domly sampled from the Pile (mean words= 916, me-

dian words = 274, median characters = 1849).

Our method is most closely related to decode-

time algorithms, such as GEDI (Krause et al.,

2020), which uses Bayes’ rule and discriminative

models to steer the generation towards a certain

attribute; and PPLM (Dathathri et al.,2019), which

uses an estimated gradient with respect to the de-

sired attribute to steer the LM’s internal representa-

tion at generation time.

Other methods are based on ﬁne-tuning language

models with the classical language modeling ob-

jective to steer generation. DEXPERTs (Liu et al.,

2021) combines experts and anti-experts in a prod-

uct of experts model to reduce toxicity of LMs.

Additionally, reinforcement learning approaches

show strong performance at steering language mod-

els (Stiennon et al.,2020). By providing rewards,

methods such as PPO (Schulman et al.,2017) and

Quark (Lu et al.,2022) represent the current best

performance at reducing LM toxicity while main-

taining ﬂuency. These methods, however, require a

predetermined reward function, which may or may

not be feasible depending on the context.

3 Prompt Compression

Here, we introduce and explore the idea of prompt

compression, whereby the parameters of a soft

prompt (Lester et al.,2021) are trained to mimic a

ﬁxed hard prompt as closely as possible.

The intuition of our idea is simple: condition-

ing a LM on a hard prompt

induces a distri-

bution

p(xt,· · · , xt+k|xh)

over all possible sub-

sequent sequences of tokens

xt,· · · , xt+k

for all

. To simplify notation, let

xt:k=xt,· · · , xt+k

The schematic of the idea is shown in Fig. 1. For-

mally, a soft prompt is a block of weights

θn

that

is prepended to the embeddings of the tokenized

sequence

xt:k

, and which is then fed through the

transformer layers of the language model. The soft

prompt induces a modiﬁed distribution over

xt:k

which we represent as

q(xt:k|θn)

. Here,

is the

number of tokens in the soft prompt (which do not

necessarily correspond to natural language tokens).

To compress prompt

, we train the soft prompt

weights to minimize the following objective:

min

θn

Ext:k[KL(p(xt:k|xh)||q(xt:k|θn))] (1)

where the sequences

xt:k

’s are sentences of various

lengths and content drawn from a diverse training

set. We optimize this objective using the Adam

optimizer for 75,000 steps of training with a linear

learning rate schedule starting at 0.1, and

xt:k

’s

drawn randomly from The Pile (Gao et al.,2021),

requiring about 1-4 GPU-hours to train a single

prompt, depending on computational complexity

of running the LM. All prompt training was done

using either a single A100 or V100 GPU.

While training a compressed prompt can be ex-

pensive, the gains are found at inference time. Us-

ing a compressed prompt over a hard prompt re-

duces the length of the context. This scales down

the needed computation according to the trans-

former’s attention mechanism, which is

O(n2)

This also could allow long contexts to be com-

pressed and appended to longer inputs than was pre-

viously possible. Once trained, these compressed

prompts could be shared to create a library of efﬁ-

cient contexts.

4 Experiment Set #1: Establishing Basic

Properties of Compressed Prompts

We begin by exploring the properties of com-

pressed prompts. First, we show that condition-

ing on a hard prompt and its compressed prompt

generate qualitatively similar generations, although

this equivalence degrades as the prompt is com-

pressed more and more. Second, we qualitatively

explore what happens to ﬁne-grained information

as a prompt is compressed more and more.

Models and codebase

. All experiments were

conducted using the Huggingface

(Wolf et al.,

2019) implementation of GPT-2 (117M parame-

ters), GPT-2 medium (345M), GPT-2 large (774M)

and GPT-2 xl (1.5B) models.

1https://github.com/huggingface/transformers

Number of tokens in compressed prompt

QA Accuracy

Figure 3: Reading comprehension performance by

question as context is more and more compressed. Ac-

curacy is averaged over 1000 completions and each line

represents a single question. As expected, performance

degrades nearly monotonically as the number of tokens

in the compressed prompt is decreased. General ques-

tions degrade less than questions about speciﬁc details.

We used GPT-2 xl for this experiment.

4.1 Comparing hard and compressed

prompts

Fig. 2shows the KL divergence between the orig-

inal prompt and the compressed prompts’ output

distribution for randomly sampled sentences from

the pile (Gao et al.,2021). As the ﬁgure shows,

as the size of the compressed prompt increases,

the KL divergence monotonically decreases for all

models. This implies, as expected, that the more

context allowed in a soft prompt, the better the soft

prompt does at mimicking the full context.

Additionally, note that the magnitude of the KL

divergence is similar across models for a given soft

prompt size

. This shows that this method of

context compression works well on a variety of

model sizes (124M - 1.5B parameters).

4.2 Exploring information retention

As a prompt is compressed more and more, infor-

mation in the original prompt must be lost. As

the training process speciﬁcally attempts to match

the predictive distribution over completions for a

prompt, the question arises: what information is

preserved, and what is discarded?

Reading Comprehension Task.

To assess this,

we construct the following experiment: given a

reading comprehension task that involves a para-

graph

of text and a series of questions, how do the

answers to those questions degrade as a function

of compression? Speciﬁcally, we look at questions

about ﬁne-grained information (speciﬁc details that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PromptCompressionandContrastiveConditioningforControllabilityandToxicityReductioninLanguageModelsDavidWingateBrighamYoungUniversitywingated@cs.byu.eduMohammadShoeybiNvidia,Inc.mshoeybi@nvidia.comTaylorSorensenUniversityofWashingtonytsor13@cs.washington.eduAbstractWeexploretheideaofcompressingthepro...

展开>> 收起<<

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models David Wingate.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models David Wingate

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: