REFEREE Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi13

2025-04-29 0 0 766.76KB 20 页 10玖币

侵权投诉

REFEREE: Reference-Free Sentence Summarization

with Sharper Controllability through Symbolic Knowledge Distillation

Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi1,3

1Paul G. Allen School of Computer Science & Engineering, University of Washington

2Language Technologies Institute, Carnegie Mellon University

3Allen Institute for Artiﬁcial Intelligence

msclar@cs.washington.edu

Abstract

We present REFEREE, a novel framework for

sentence summarization that can be trained

reference-free (i.e., requiring no gold sum-

maries for supervision), while allowing direct

control for compression ratio. Our work is the

ﬁrst to demonstrate that reference-free, con-

trolled sentence summarization is feasible via

the conceptual framework of Symbolic Knowl-

edge Distillation (West et al.,2022), where la-

tent knowledge in pre-trained language models

is distilled via explicit examples sampled from

the teacher models, further puriﬁed with three

types of ﬁlters: length, ﬁdelity, and Informa-

tion Bottleneck. Moreover, we uniquely pro-

pose iterative distillation of knowledge, where

student models from the previous iteration of

distillation serve as teacher models in the next

iteration. Starting off from a relatively modest

set of GPT3-generated summaries, we demon-

strate how iterative knowledge distillation can

lead to considerably smaller, but better sum-

marizers with sharper controllability. A useful

by-product of this iterative distillation process

is a high-quality dataset of sentence-summary

pairs with varying degrees of compression ra-

tios. Empirical results demonstrate that the ﬁ-

nal student models vastly outperform the much

larger GPT3-Instruct model in terms of the

controllability of compression ratios, without

compromising the quality of resulting summa-

rization.1

1 Introduction

We introduce REFEREE, a new framework for sen-

tence summarization that works by iteratively gen-

erating and distilling knowledge into successively

better models. This allows REFEREE to be [Re-

fer]ence fr[ee]—beginning by distilling from a

large language model rather than with supervised

data. Yet, our method results in a more efﬁcient,

See

https://github.com/msclar/referee

for code,

models, and data.

REFEREE!

DISTILL

control !

codes

NLI

Length

REFEREE!

CONTROL

GPT-3!

Figure 1: Our method results in high quality, reference-

free compact summarizers. We begin by using a

large language model (e.g. GPT-3) to generate many

summaries that demonstrate different aspects we may

want in a summary—gray represents an aspect well-

represented in these generations, while black is under-

represented. We ﬁrst use REFEREE-DISTILL to itera-

tively ﬁlter and train summarizers that better represent

these desirable aspects, e.g. shorter summary length.

We then use generations from REFEREE-DISTILL to

train a model in which these aspects are controllable:

this is REFEREE-CONTROL.

compact, and controllable summarization model

than what we start with.

Our work follows the paradigm of Symbolic

Knowledge Distillation (West et al.,2022), which

transfers implicit knowledge from a massive lan-

guage model to a considerably smaller student

model by explicitly generating knowledge in tex-

tual form. Unlike traditional knowledge distillation

(Hinton et al.,2015) where the teacher model and

the student model are of the same type, symbolic

knowledge distillation allows for the student model

to be of a different type.

Our work differs from West et al. (2022) in three

arXiv:2210.13800v1 [cs.CL] 25 Oct 2022

key aspects. First, our distillation is iterative: each

student model becomes a teacher in successive

rounds, reﬁning and improving summarization at

every step. Second, REFEREE controls for more

than just overall quality, improving multiple model

aspects in each round such as length, ﬁdelity, and

information bottleneck (Tishby et al.,1999), then

allowing explicit length control at generation time.

Third, our work is the ﬁrst to show that reference-

free, controlled sentence summarization can be for-

mulated as symbolic knowledge distillation.

REFEREE works in two phases, illustrated in

Figure 1. First, REFEREE-DISTILL uses a mod-

est number of generated summaries from GPT-3

(Brown et al.,2020) to produce high quality and

compact summarizers (Goyal et al.,2022). We

follow an iterative approach; in each iteration we

ﬁlter generations for desirable qualities, re-train a

new and better summarizer, and ﬁnally generate

new summaries for the next round. Each round

ampliﬁes effects of the previous rounds, improv-

ing notions of summary quality like entailment

or shorter length. Second, REFEREE-CONTROL

uses these iteratively distilled summaries to train a

model with explicit control: in our experiments, we

use progressively shortened generations from each

iteration to train a ﬁnal summarizer with explicit

length control.

We ﬁnd that REFEREE demonstrates compelling

empirical results compared to competitive base-

lines. REFEREE-DISTILL, even without explicit

length control, is able to generate shorter sum-

maries with more consistency and equal quality

compared with the original teacher model (GPT-3,

16x larger in size) as well as a supervised model.

Moreover, REFEREE-CONTROL, which has more

direct length control baked in, demonstrates a sharp

degree of control in length, and succeeds at gener-

ating high quality summaries at speciﬁed lengths

with signiﬁcantly higher accuracy than GPT-3. In

sum, the promising empirical results of REFEREE

encourages further future investigation to extend

the framework of symbolic knowledge distillation

for reference-free, controlled text summarization.

2 Methods

We ﬁrst describe REFEREE-DISTILL (see §2.1),

an iterative procedure to promote speciﬁc behav-

iors that may not be prevalent in the original data,

while maintaining summary quality. We explore

two different ﬁlters, detailed in §2.2. We then de-

tail REFEREE-CONTROL (see §2.3), a model that

separates summaries into categorical variables and

is iteratively trained to, summarize a given sen-

tence within the desired category (e.g., a range of

compression ratio). In this work we only consider

categories that reﬂect different compression ratio,

but the same approach could be applied to other

types of control categories, such as style.

2.1 Iterative Symbolic Knowledge

Distillation: REFEREE-DISTILL

Let

D=D0∪. . . ∪ Dt

denote a sentence cor-

pus without reference summaries. We start with a

teacher model (GPT3-Instruct Curie) from which

we want to distill summarization knowledge under

a ﬁxed budget. Using

—a small subset of

—

we ﬁrst generate a dataset of sentence-summary

pairs (

C0)

by few-shot prompting the teacher and

automatically ﬁltering low-quality generations. Fil-

ters will be detailed in Section 2.2. Throughout

the whole training procedure, we store each entry

(s, s0)

as “

sTL;DR: s0<eos>

”. Here,

<eos>

de-

notes end of sequence and

TL;DR;

is a separator

that has been shown to encourage summarization

behavior (Radford et al.,2019).

Let

be a pre-trained model signiﬁcantly

smaller than GPT-3 (GPT2-Large in our experi-

ments). Using the seed dataset

, we train a stu-

dent model

by ﬁne-tuning

with language

modeling loss. We then iteratively reﬁne this model

by (1) using it to generate summaries for a subset

, (2) ﬁltering them to remove undesired be-

haviors, and (3) training another student model on

the ﬁltered dataset, essentially distilling a better

summarizer. More precisely,

Ci:= ﬁlteri(generate(Mi,Di))

Mi+1 := ﬁnetune(Mi,Ci)

We execute this procedure for

steps, creating

t+1

different summarization datasets in the process:

C0,C1,...,Ct

We discuss two possible instantia-

tions of the ﬁlteribelow.

2.2 Filters

There is no one summary that is better than all oth-

ers; depending on the desiderata of the end users,

some might prefer shorter but less informative sum-

maries, while others might prefer longer, and more

informative ones. While some of these goals are

Note that this process would stay identical if a user de-

cided to use a human-generated summarization dataset as

universal and always desired (for example, a sum-

mary should be accurate, in that it should not con-

tain information not present in the input), others

can be tailored to the end task. We use binary

ﬁlters (

ﬁlteri

) to operationalize these goals. We

experiment with the following ﬁlters.

Summary Fidelity Filter

To encourage accurate

summaries, we employ a simple but effective crite-

rion: the summary should be entailed by the input

sentence. More formally, we deﬁne a binary ﬁlter,

fNLI(s, s0) := 1{s⇒s0}

, and discard all non-

entailed sentence-summary pairs to avoid using

these samples when training the next iteration’s stu-

dent. We measure entailment using an off-the-shelf

state-of-the-art NLI model (Liu et al.,2022a).

Summary Length Filter

While underexplored

in prior work, constraining for the length of written

text, especially in summarization, is a desirable fea-

ture to support real world applications with limited

screen space. To obtain a corpus of summaries of

varying lengths, at each distillation step

, we en-

courage the student

to generate progressively

shorter outputs. We achieve this by constraining

to contain only summaries with a predeﬁned

compression ratio ri∈[0,1]. More precisely,

fcompress(s, s0, ri) = 1n|s0|

|s|≤rio

where

ri> ri+1

for all

, to progressively summa-

rize more succinctly.

|s0|

|s|

is commonly referred to

as compression ratio. In theory, one could generate

data for all desired compression ratios directly from

. However, since the seed dataset

is heavily

skewed towards longer summaries, the ﬁnal corpus

after ﬁltering with

fNLI

would be extremely small

for lower compression ratios. We ﬁnd that combin-

ing the two ﬁlters and iteratively reﬁning models

to produce shorter, accurate summaries leads to a

more diverse and still high-quality ﬁnal corpus.

Contextual Filter

For many applications, the

sentences we need to summarize are part of a larger

piece of text, such as a paragraph or a document

(e.g. emails, articles). This contextual information

may further improve sentence summary quality,

since depending on the larger context, different in-

formation could be more important to be preserved,

and inter-sentence redundancies could be removed.

Inspired from West et al. (2019)’s interpretation of

the Information Bottleneck principle (Tishby et al.,

1999), we consider the following ﬁlter:

fNSP =1np(snext|s0)

p(snext|s)≥lo

where NSP refers to “next sentence prediction”,

is an oracle language model (which we approxi-

mate by GPT2-Large),

snext

denotes the sentence

immediately following the input sentence

, and

l∈[0,1]

is a hyperparameter. Intuitively, we want

to ﬁnd summaries which are good predictors of the

next sentence, to select the most crucial informa-

tion and preserve coherence.

allows us to strike a

balance between sacriﬁcing some of the informa-

tion in sand maintaining enough to predict snext.

Adding

fNSP

requires expanding the input se-

quence to also include the next sentence throughout

the iterative distillation process deﬁned in §2.1.

Final REFEREE-DISTILL Filters Deﬁnition

We experiment with two ﬁlters,

and

(or #1

and #2, as we will refer to during experiments).

does not assume the existence of any context, and

so it only ﬁlters for inaccuracies and length:

f1(s, s0;snext, ri) = fNLI ∧fcompress

This allows

to be applied in broader contexts.

We also deﬁne f2, which adds contextual ﬁltering:

f2(s, s0;snext, ri, l) = fNLI ∧fcompress ∧fNSP

Fluency Filter

To ensure ﬂuency over several

self-training iterations, we consider an additional

ﬁlter only to be used in REFEREE-CONTROL.

Given a sentence

x= (x1, . . . , x`)

, we deﬁne

AvgNLL(x) := −1

`Pi≤`log p(xi|x<i)

. We de-

termine a summary as ﬂuent if and only if its mean

Negative Log Likelihood (NLL) does not exceed

that of source sentence, leading to the ﬁlter:

fAvgNLL(s, s0) = 1nAvgNLL(s0)≤AvgNLL(s)o

2.3 REFEREE-CONTROL

Using the high quality corpora of varying compres-

sion ratios obtained using REFEREE-DISTILL, we

train REFEREE-CONTROL, a summarization model

that allows explicit control for desired compression

ratio. We divide all possible compression ratios

into

buckets, where each bucket

bi=i

n,i+1

n

for

0≤i < n

. Using

as control codes, we train a

model that, when prompted with it, can summarize

at a compression ratio within bi.

Similar to

, we start with a corpus

F=F0∪

. . . ∪ Ft

of sentences without reference summaries.

Additionally, we create a seed corpus labeled with

compression ratios,

E0=C0∪. . . ∪ Ct

(

F0=

D0∪. . .∪Dt

) now representing each example

(s, s0)

as “

s<sep> <bucket_tok j> TL;DR: s0<eos>

”,

where

<bucket_tok j>

corresponds to the bucket

in which the example lies, that is

|s0|

|s|∈bj

<sep>

is a special token. We denote each subset of

corresponding to bucket

E(j)

. This seed dataset

is ﬁltered to remove low-quality generations, with

the same ﬁlter as all the subsequent iterations.

Similar to REFEREE-DISTILL, starting with a

pre-trained model

(GPT2-Large), we train stu-

dent models via iterative distillation. In each itera-

tion

, (1) we ﬁne-tune the student model using the

bucket labeled corpus

, (2) generate summaries

for

for all buckets, (3) ﬁlter them to create a new

labeled corpus

Ei+1

. We do not reinitialize the stu-

dent at each iteration, but rather ﬁne-tune starting

from the teacher’s current local optima. We use

h(s, s0) = fNLI ∧fAvgNLL as the ﬁlter. Formally,

Ni+1 := ﬁnetune(Ni,E(0)

i,E(1)

i,...,E(n−1)

E(j)

i+1 := h(generate(Ni,Fi, j)) ∀0≤j <n

2.4 Primal-Dual Problem Interpretation of

Summarization

Assuming summaries are ﬂuent and factual, sen-

tence summaries trade off between two variables:

level of compression and level of information

preservation. We are able to effectively ﬁx the

level of compression by introducing control codes,

and then develop models to maximize information

preservation. This is our primal problem. Thanks

to length-control codes, we can now also solve the

dual problem: “what is the best shortest summary

we could write?”. Written more precisely, given

a ﬁxed level of tolerance for losing information

from the original sentence, what is the shortest

summary we could write? Furthermore, comparing

similar-lengthed summaries also allows for fairer

comparisons, since we are effectively measuring

changes in only one variable.

3 On GPT3’s Fidelity and Length

Control

We analyze GPT3-Instruct Curie’s (Brown et al.,

2020) sentence summarization capabilities. We

promote GPT3 to summarize at different compres-

sion ratios by few-shot prompting with high-quality

Model

Avg.

c.r. %

(stdev)

WANLI

Entailment

c.r. ≥1

GPT-3, c.r. 60-80% 82 (24) 88 31

GPT-3, c.r. 40-60% 78 (28) 81 33

GPT-3, c.r. 20-40% 59 (28) 76 11

Supervised baseline 55 (29) 71 7

Referee-Distill, ﬁlter #1 46 (15) 89 1

Referee-Distill, ﬁlter #2 49 (18) 91 2

Table 1: Statistics of automatically-generated datasets

(ours and GPT-3). The ﬁrst three rows refer to three

different datasets generated through three-shot prompt-

ing GPT3-Instruct Curie with summaries from different

compression ratios (c.r.). Following rows show results

for the third and last iteration of our models, using each

one of the two described ﬁlters (f1,f2, see §2.2). Sen-

tences correspond to a held-out set during training.

sentence-summary pairs in the desired compression

ratios. More precisely, we do three-shot prompt-

ing with three different sets of summaries: one set

of sentence-summary pairs has all three pairs with

compression ratios in the interval

[0.6,0.8]

, another

set in [0.4,0.6], and another in [0.2,0.4].

We show that average compression ratio (c.r.)

correlates with the prompts’ compression ratio (al-

though variance is large), and up to 33% of the time

models generate summaries longer than the original

sentence (see Table 1). Qualitatively, this seems to

be because of punctuation edits or hallucinations.

Besides using prompts that encourage shorter

summaries, one can iteratively summarize through

few-shot prompting. If

fp(s)

is the summary

GPT-3 generates when prompted with

, then

fp(fp(. . . fp(s)))=fn

p(s)

may also be a summary

, possibly shorter. We ﬁnd that successive appli-

cation of the same prompt did not result in shorter

summaries, i.e.

|fn

p(s)|'|fp(s)|

, suggesting

roughly idempotent in terms of length (see A.1).

These experiments motivate the need of more

sophisticated approaches to length control and reli-

ably summarizing without supervision.

4 Experiments

Dataset

We create the corpora

and

by sam-

pling contiguous sentence pairs from RealNews

(Zellers et al.,2019) news articles. We ﬁlter out

sentences shorter than 50 characters. Using GPT-3

as the teacher, we summarize sentences in

and

use the outputs with 60-80% compression ratio as

our initial dataset

, since it was the best one quan-

titatively and qualitatively. Although this implies

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REFEREE:Reference-FreeSentenceSummarizationwithSharperControllabilitythroughSymbolicKnowledgeDistillationMelanieSclar1PeterWest1SachinKumar2YuliaTsvetkov1YejinChoi1,31PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashington2LanguageTechnologiesInstitute,CarnegieMellonUniversity3AllenIns...

展开>> 收起<<

REFEREE Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi13.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REFEREE Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: