REFEREE Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi13

2025-04-29 0 0 766.76KB 20 页 10玖币
侵权投诉
REFEREE: Reference-Free Sentence Summarization
with Sharper Controllability through Symbolic Knowledge Distillation
Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi1,3
1Paul G. Allen School of Computer Science & Engineering, University of Washington
2Language Technologies Institute, Carnegie Mellon University
3Allen Institute for Artificial Intelligence
msclar@cs.washington.edu
Abstract
We present REFEREE, a novel framework for
sentence summarization that can be trained
reference-free (i.e., requiring no gold sum-
maries for supervision), while allowing direct
control for compression ratio. Our work is the
first to demonstrate that reference-free, con-
trolled sentence summarization is feasible via
the conceptual framework of Symbolic Knowl-
edge Distillation (West et al.,2022), where la-
tent knowledge in pre-trained language models
is distilled via explicit examples sampled from
the teacher models, further purified with three
types of filters: length, fidelity, and Informa-
tion Bottleneck. Moreover, we uniquely pro-
pose iterative distillation of knowledge, where
student models from the previous iteration of
distillation serve as teacher models in the next
iteration. Starting off from a relatively modest
set of GPT3-generated summaries, we demon-
strate how iterative knowledge distillation can
lead to considerably smaller, but better sum-
marizers with sharper controllability. A useful
by-product of this iterative distillation process
is a high-quality dataset of sentence-summary
pairs with varying degrees of compression ra-
tios. Empirical results demonstrate that the fi-
nal student models vastly outperform the much
larger GPT3-Instruct model in terms of the
controllability of compression ratios, without
compromising the quality of resulting summa-
rization.1
1 Introduction
We introduce REFEREE, a new framework for sen-
tence summarization that works by iteratively gen-
erating and distilling knowledge into successively
better models. This allows REFEREE to be [Re-
fer]ence fr[ee]—beginning by distilling from a
large language model rather than with supervised
data. Yet, our method results in a more efficient,
1
See
https://github.com/msclar/referee
for code,
models, and data.
!
!
REFEREE!
DISTILL
!
!
control !
codes
NLI
Length
IB
REFEREE!
CONTROL
GPT-3!
!
Figure 1: Our method results in high quality, reference-
free compact summarizers. We begin by using a
large language model (e.g. GPT-3) to generate many
summaries that demonstrate different aspects we may
want in a summary—gray represents an aspect well-
represented in these generations, while black is under-
represented. We first use REFEREE-DISTILL to itera-
tively filter and train summarizers that better represent
these desirable aspects, e.g. shorter summary length.
We then use generations from REFEREE-DISTILL to
train a model in which these aspects are controllable:
this is REFEREE-CONTROL.
compact, and controllable summarization model
than what we start with.
Our work follows the paradigm of Symbolic
Knowledge Distillation (West et al.,2022), which
transfers implicit knowledge from a massive lan-
guage model to a considerably smaller student
model by explicitly generating knowledge in tex-
tual form. Unlike traditional knowledge distillation
(Hinton et al.,2015) where the teacher model and
the student model are of the same type, symbolic
knowledge distillation allows for the student model
to be of a different type.
Our work differs from West et al. (2022) in three
arXiv:2210.13800v1 [cs.CL] 25 Oct 2022
key aspects. First, our distillation is iterative: each
student model becomes a teacher in successive
rounds, refining and improving summarization at
every step. Second, REFEREE controls for more
than just overall quality, improving multiple model
aspects in each round such as length, fidelity, and
information bottleneck (Tishby et al.,1999), then
allowing explicit length control at generation time.
Third, our work is the first to show that reference-
free, controlled sentence summarization can be for-
mulated as symbolic knowledge distillation.
REFEREE works in two phases, illustrated in
Figure 1. First, REFEREE-DISTILL uses a mod-
est number of generated summaries from GPT-3
(Brown et al.,2020) to produce high quality and
compact summarizers (Goyal et al.,2022). We
follow an iterative approach; in each iteration we
filter generations for desirable qualities, re-train a
new and better summarizer, and finally generate
new summaries for the next round. Each round
amplifies effects of the previous rounds, improv-
ing notions of summary quality like entailment
or shorter length. Second, REFEREE-CONTROL
uses these iteratively distilled summaries to train a
model with explicit control: in our experiments, we
use progressively shortened generations from each
iteration to train a final summarizer with explicit
length control.
We find that REFEREE demonstrates compelling
empirical results compared to competitive base-
lines. REFEREE-DISTILL, even without explicit
length control, is able to generate shorter sum-
maries with more consistency and equal quality
compared with the original teacher model (GPT-3,
16x larger in size) as well as a supervised model.
Moreover, REFEREE-CONTROL, which has more
direct length control baked in, demonstrates a sharp
degree of control in length, and succeeds at gener-
ating high quality summaries at specified lengths
with significantly higher accuracy than GPT-3. In
sum, the promising empirical results of REFEREE
encourages further future investigation to extend
the framework of symbolic knowledge distillation
for reference-free, controlled text summarization.
2 Methods
We first describe REFEREE-DISTILL (see §2.1),
an iterative procedure to promote specific behav-
iors that may not be prevalent in the original data,
while maintaining summary quality. We explore
two different filters, detailed in §2.2. We then de-
tail REFEREE-CONTROL (see §2.3), a model that
separates summaries into categorical variables and
is iteratively trained to, summarize a given sen-
tence within the desired category (e.g., a range of
compression ratio). In this work we only consider
categories that reflect different compression ratio,
but the same approach could be applied to other
types of control categories, such as style.
2.1 Iterative Symbolic Knowledge
Distillation: REFEREE-DISTILL
Let
D=D0. . . ∪ Dt
denote a sentence cor-
pus without reference summaries. We start with a
teacher model (GPT3-Instruct Curie) from which
we want to distill summarization knowledge under
a fixed budget. Using
D0
—a small subset of
D
we first generate a dataset of sentence-summary
pairs (
C0)
by few-shot prompting the teacher and
automatically filtering low-quality generations. Fil-
ters will be detailed in Section 2.2. Throughout
the whole training procedure, we store each entry
(s, s0)
as “
sTL;DR: s0<eos>
”. Here,
<eos>
de-
notes end of sequence and
TL;DR;
is a separator
that has been shown to encourage summarization
behavior (Radford et al.,2019).
Let
M0
be a pre-trained model significantly
smaller than GPT-3 (GPT2-Large in our experi-
ments). Using the seed dataset
C0
, we train a stu-
dent model
M1
by fine-tuning
M0
with language
modeling loss. We then iteratively refine this model
by (1) using it to generate summaries for a subset
of
D
, (2) filtering them to remove undesired be-
haviors, and (3) training another student model on
the filtered dataset, essentially distilling a better
summarizer. More precisely,
Ci:= filteri(generate(Mi,Di))
Mi+1 := finetune(Mi,Ci)
We execute this procedure for
t
steps, creating
t+1
different summarization datasets in the process:
C0,C1,...,Ct
.
2
We discuss two possible instantia-
tions of the filteribelow.
2.2 Filters
There is no one summary that is better than all oth-
ers; depending on the desiderata of the end users,
some might prefer shorter but less informative sum-
maries, while others might prefer longer, and more
informative ones. While some of these goals are
2
Note that this process would stay identical if a user de-
cided to use a human-generated summarization dataset as
C0
.
universal and always desired (for example, a sum-
mary should be accurate, in that it should not con-
tain information not present in the input), others
can be tailored to the end task. We use binary
filters (
filteri
) to operationalize these goals. We
experiment with the following filters.
Summary Fidelity Filter
To encourage accurate
summaries, we employ a simple but effective crite-
rion: the summary should be entailed by the input
sentence. More formally, we define a binary filter,
fNLI(s, s0) := 1{ss0}
, and discard all non-
entailed sentence-summary pairs to avoid using
these samples when training the next iteration’s stu-
dent. We measure entailment using an off-the-shelf
state-of-the-art NLI model (Liu et al.,2022a).
Summary Length Filter
While underexplored
in prior work, constraining for the length of written
text, especially in summarization, is a desirable fea-
ture to support real world applications with limited
screen space. To obtain a corpus of summaries of
varying lengths, at each distillation step
i
, we en-
courage the student
Mi
to generate progressively
shorter outputs. We achieve this by constraining
Ci
to contain only summaries with a predefined
compression ratio ri[0,1]. More precisely,
fcompress(s, s0, ri) = 1n|s0|
|s|rio
where
ri> ri+1
for all
i
, to progressively summa-
rize more succinctly.
|s0|
|s|
is commonly referred to
as compression ratio. In theory, one could generate
data for all desired compression ratios directly from
M1
. However, since the seed dataset
C0
is heavily
skewed towards longer summaries, the final corpus
after filtering with
fNLI
would be extremely small
for lower compression ratios. We find that combin-
ing the two filters and iteratively refining models
to produce shorter, accurate summaries leads to a
more diverse and still high-quality final corpus.
Contextual Filter
For many applications, the
sentences we need to summarize are part of a larger
piece of text, such as a paragraph or a document
(e.g. emails, articles). This contextual information
may further improve sentence summary quality,
since depending on the larger context, different in-
formation could be more important to be preserved,
and inter-sentence redundancies could be removed.
Inspired from West et al. (2019)’s interpretation of
the Information Bottleneck principle (Tishby et al.,
1999), we consider the following filter:
fNSP =1np(snext|s0)
p(snext|s)lo
where NSP refers to “next sentence prediction”,
p
is an oracle language model (which we approxi-
mate by GPT2-Large),
snext
denotes the sentence
immediately following the input sentence
s
, and
l[0,1]
is a hyperparameter. Intuitively, we want
to find summaries which are good predictors of the
next sentence, to select the most crucial informa-
tion and preserve coherence.
l
allows us to strike a
balance between sacrificing some of the informa-
tion in sand maintaining enough to predict snext.
Adding
fNSP
requires expanding the input se-
quence to also include the next sentence throughout
the iterative distillation process defined in §2.1.
Final REFEREE-DISTILL Filters Definition
We experiment with two filters,
f1
and
f2
(or #1
and #2, as we will refer to during experiments).
f1
does not assume the existence of any context, and
so it only filters for inaccuracies and length:
f1(s, s0;snext, ri) = fNLI fcompress
This allows
f1
to be applied in broader contexts.
We also define f2, which adds contextual filtering:
f2(s, s0;snext, ri, l) = fNLI fcompress fNSP
Fluency Filter
To ensure fluency over several
self-training iterations, we consider an additional
filter only to be used in REFEREE-CONTROL.
Given a sentence
x= (x1, . . . , x`)
, we define
AvgNLL(x) := 1
`Pi`log p(xi|x<i)
. We de-
termine a summary as fluent if and only if its mean
Negative Log Likelihood (NLL) does not exceed
that of source sentence, leading to the filter:
fAvgNLL(s, s0) = 1nAvgNLL(s0)AvgNLL(s)o
2.3 REFEREE-CONTROL
Using the high quality corpora of varying compres-
sion ratios obtained using REFEREE-DISTILL, we
train REFEREE-CONTROL, a summarization model
that allows explicit control for desired compression
ratio. We divide all possible compression ratios
into
n
buckets, where each bucket
bi=i
n,i+1
n
for
0i < n
. Using
bi
as control codes, we train a
model that, when prompted with it, can summarize
at a compression ratio within bi.
Similar to
D
, we start with a corpus
F=F0
. . . ∪ Ft
of sentences without reference summaries.
Additionally, we create a seed corpus labeled with
compression ratios,
E0=C0. . . ∪ Ct
(
F0=
D0. . .∪Dt
) now representing each example
(s, s0)
as “
s<sep> <bucket_tok j> TL;DR: s0<eos>
”,
where
<bucket_tok j>
corresponds to the bucket
in which the example lies, that is
|s0|
|s|bj
.
<sep>
is a special token. We denote each subset of
E0
corresponding to bucket
j
as
E(j)
0
. This seed dataset
is filtered to remove low-quality generations, with
the same filter as all the subsequent iterations.
Similar to REFEREE-DISTILL, starting with a
pre-trained model
N0
(GPT2-Large), we train stu-
dent models via iterative distillation. In each itera-
tion
i
, (1) we fine-tune the student model using the
bucket labeled corpus
Ei
, (2) generate summaries
for
Fi
for all buckets, (3) filter them to create a new
labeled corpus
Ei+1
. We do not reinitialize the stu-
dent at each iteration, but rather fine-tune starting
from the teacher’s current local optima. We use
h(s, s0) = fNLI fAvgNLL as the filter. Formally,
Ni+1 := finetune(Ni,E(0)
i,E(1)
i,...,E(n1)
i)
E(j)
i+1 := h(generate(Ni,Fi, j)) 0j <n
2.4 Primal-Dual Problem Interpretation of
Summarization
Assuming summaries are fluent and factual, sen-
tence summaries trade off between two variables:
level of compression and level of information
preservation. We are able to effectively fix the
level of compression by introducing control codes,
and then develop models to maximize information
preservation. This is our primal problem. Thanks
to length-control codes, we can now also solve the
dual problem: “what is the best shortest summary
we could write?”. Written more precisely, given
a fixed level of tolerance for losing information
from the original sentence, what is the shortest
summary we could write? Furthermore, comparing
similar-lengthed summaries also allows for fairer
comparisons, since we are effectively measuring
changes in only one variable.
3 On GPT3’s Fidelity and Length
Control
We analyze GPT3-Instruct Curie’s (Brown et al.,
2020) sentence summarization capabilities. We
promote GPT3 to summarize at different compres-
sion ratios by few-shot prompting with high-quality
Model
Avg.
c.r. %
(stdev)
WANLI
Entailment
%
c.r. 1
%
GPT-3, c.r. 60-80% 82 (24) 88 31
GPT-3, c.r. 40-60% 78 (28) 81 33
GPT-3, c.r. 20-40% 59 (28) 76 11
Supervised baseline 55 (29) 71 7
Referee-Distill, filter #1 46 (15) 89 1
Referee-Distill, filter #2 49 (18) 91 2
Table 1: Statistics of automatically-generated datasets
(ours and GPT-3). The first three rows refer to three
different datasets generated through three-shot prompt-
ing GPT3-Instruct Curie with summaries from different
compression ratios (c.r.). Following rows show results
for the third and last iteration of our models, using each
one of the two described filters (f1,f2, see §2.2). Sen-
tences correspond to a held-out set during training.
sentence-summary pairs in the desired compression
ratios. More precisely, we do three-shot prompt-
ing with three different sets of summaries: one set
of sentence-summary pairs has all three pairs with
compression ratios in the interval
[0.6,0.8]
, another
set in [0.4,0.6], and another in [0.2,0.4].
We show that average compression ratio (c.r.)
correlates with the prompts’ compression ratio (al-
though variance is large), and up to 33% of the time
models generate summaries longer than the original
sentence (see Table 1). Qualitatively, this seems to
be because of punctuation edits or hallucinations.
Besides using prompts that encourage shorter
summaries, one can iteratively summarize through
few-shot prompting. If
fp(s)
is the summary
GPT-3 generates when prompted with
p
, then
fp(fp(. . . fp(s)))=fn
p(s)
may also be a summary
of
s
, possibly shorter. We find that successive appli-
cation of the same prompt did not result in shorter
summaries, i.e.
|fn
p(s)|'|fp(s)|
, suggesting
fp
is
roughly idempotent in terms of length (see A.1).
These experiments motivate the need of more
sophisticated approaches to length control and reli-
ably summarizing without supervision.
4 Experiments
Dataset
We create the corpora
D
and
F
by sam-
pling contiguous sentence pairs from RealNews
(Zellers et al.,2019) news articles. We filter out
sentences shorter than 50 characters. Using GPT-3
as the teacher, we summarize sentences in
D0
and
use the outputs with 60-80% compression ratio as
our initial dataset
C0
, since it was the best one quan-
titatively and qualitatively. Although this implies
摘要:

REFEREE:Reference-FreeSentenceSummarizationwithSharperControllabilitythroughSymbolicKnowledgeDistillationMelanieSclar1PeterWest1SachinKumar2YuliaTsvetkov1YejinChoi1,31PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashington2LanguageTechnologiesInstitute,CarnegieMellonUniversity3AllenIns...

展开>> 收起<<
REFEREE Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar1Peter West1Sachin Kumar2Yulia Tsvetkov1Yejin Choi13.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:766.76KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注