
key aspects. First, our distillation is iterative: each
student model becomes a teacher in successive
rounds, refining and improving summarization at
every step. Second, REFEREE controls for more
than just overall quality, improving multiple model
aspects in each round such as length, fidelity, and
information bottleneck (Tishby et al.,1999), then
allowing explicit length control at generation time.
Third, our work is the first to show that reference-
free, controlled sentence summarization can be for-
mulated as symbolic knowledge distillation.
REFEREE works in two phases, illustrated in
Figure 1. First, REFEREE-DISTILL uses a mod-
est number of generated summaries from GPT-3
(Brown et al.,2020) to produce high quality and
compact summarizers (Goyal et al.,2022). We
follow an iterative approach; in each iteration we
filter generations for desirable qualities, re-train a
new and better summarizer, and finally generate
new summaries for the next round. Each round
amplifies effects of the previous rounds, improv-
ing notions of summary quality like entailment
or shorter length. Second, REFEREE-CONTROL
uses these iteratively distilled summaries to train a
model with explicit control: in our experiments, we
use progressively shortened generations from each
iteration to train a final summarizer with explicit
length control.
We find that REFEREE demonstrates compelling
empirical results compared to competitive base-
lines. REFEREE-DISTILL, even without explicit
length control, is able to generate shorter sum-
maries with more consistency and equal quality
compared with the original teacher model (GPT-3,
16x larger in size) as well as a supervised model.
Moreover, REFEREE-CONTROL, which has more
direct length control baked in, demonstrates a sharp
degree of control in length, and succeeds at gener-
ating high quality summaries at specified lengths
with significantly higher accuracy than GPT-3. In
sum, the promising empirical results of REFEREE
encourages further future investigation to extend
the framework of symbolic knowledge distillation
for reference-free, controlled text summarization.
2 Methods
We first describe REFEREE-DISTILL (see §2.1),
an iterative procedure to promote specific behav-
iors that may not be prevalent in the original data,
while maintaining summary quality. We explore
two different filters, detailed in §2.2. We then de-
tail REFEREE-CONTROL (see §2.3), a model that
separates summaries into categorical variables and
is iteratively trained to, summarize a given sen-
tence within the desired category (e.g., a range of
compression ratio). In this work we only consider
categories that reflect different compression ratio,
but the same approach could be applied to other
types of control categories, such as style.
2.1 Iterative Symbolic Knowledge
Distillation: REFEREE-DISTILL
Let
D=D0∪. . . ∪ Dt
denote a sentence cor-
pus without reference summaries. We start with a
teacher model (GPT3-Instruct Curie) from which
we want to distill summarization knowledge under
a fixed budget. Using
D0
—a small subset of
D
—
we first generate a dataset of sentence-summary
pairs (
C0)
by few-shot prompting the teacher and
automatically filtering low-quality generations. Fil-
ters will be detailed in Section 2.2. Throughout
the whole training procedure, we store each entry
(s, s0)
as “
sTL;DR: s0<eos>
”. Here,
<eos>
de-
notes end of sequence and
TL;DR;
is a separator
that has been shown to encourage summarization
behavior (Radford et al.,2019).
Let
M0
be a pre-trained model significantly
smaller than GPT-3 (GPT2-Large in our experi-
ments). Using the seed dataset
C0
, we train a stu-
dent model
M1
by fine-tuning
M0
with language
modeling loss. We then iteratively refine this model
by (1) using it to generate summaries for a subset
of
D
, (2) filtering them to remove undesired be-
haviors, and (3) training another student model on
the filtered dataset, essentially distilling a better
summarizer. More precisely,
Ci:= filteri(generate(Mi,Di))
Mi+1 := finetune(Mi,Ci)
We execute this procedure for
t
steps, creating
t+1
different summarization datasets in the process:
C0,C1,...,Ct
.
2
We discuss two possible instantia-
tions of the filteribelow.
2.2 Filters
There is no one summary that is better than all oth-
ers; depending on the desiderata of the end users,
some might prefer shorter but less informative sum-
maries, while others might prefer longer, and more
informative ones. While some of these goals are
2
Note that this process would stay identical if a user de-
cided to use a human-generated summarization dataset as
C0
.