Improving abstractive summarization with energy-based re-ranking Diogo PernesÁçAfonso MendesÁAndré F. T. MartinsÈÉÆ ÁPriberamçUniversidade do Porto

2025-05-08 0 0 763.25KB 17 页 10玖币
侵权投诉
Improving abstractive summarization with energy-based re-ranking
Diogo PernesÁç Afonso MendesÁAndré F. T. MartinsÈÉÆ
ÁPriberam çUniversidade do Porto
ÈInstituto de Telecomunicações ÉLUMLIS (Lisbon ELLIS Unit), Instituto Superior Técnico ÆUnbabel
Lisbon, Portugal
diogo.pernes@priberam.pt,
amm@priberam.pt,andre.t.martins@tecnico.ulisboa.pt.
Abstract
Current abstractive summarization systems
present important weaknesses which prevent
their deployment in real-world applications,
such as the omission of relevant information
and the generation of factual inconsistencies
(also known as hallucinations). At the same
time, automatic evaluation metrics such as
CTC scores (Deng et al.,2021) have been re-
cently proposed that exhibit a higher corre-
lation with human judgments than traditional
lexical-overlap metrics such as ROUGE. In
this work, we intend to close the loop by
leveraging the recent advances in summariza-
tion metrics to create quality-aware abstrac-
tive summarizers. Namely, we propose an
energy-based model that learns to re-rank sum-
maries according to one or a combination of
these metrics. We experiment using several
metrics to train our energy-based re-ranker
and show that it consistently improves the
scores achieved by the predicted summaries.
Nonetheless, human evaluation results show
that the re-ranking approach should be used
with care for highly abstractive summaries, as
the available metrics are not yet sufficiently re-
liable for this purpose.
1 Introduction
In recent years, abstractive methods have greatly
benefited from the development and widespread
availability of large-scale transformer-based lan-
guage generative models (Vaswani et al.,2017;
Lewis et al.,2020;Raffel et al.,2020;Zhang
et al.,2020), which are capable of generating
text with unprecedented fluency. Despite the re-
cent progress, abstractive summarization systems
still suffer from problems that hamper their de-
ployment in real-world applications. Omitting the
most relevant information from the source docu-
ment is one of such problems. Additionally, fac-
tual inconsistencies (also known as hallucinations)
were estimated to be present in around 30% of
the summaries produced by abstractive systems
on the CNN/DailyMail dataset (Kryscinski et al.,
2019). This observation has motivated a consider-
able amount of research on strategies to mitigate
the hallucination problem (Falke et al.,2019;Cao
et al.,2020;Zhao et al.,2020;Zhu et al.,2021),
but the improvements achieved so far are mild.
This is partly due to the difficulty of evaluating
the quality of summaries automatically, leading
to the adoption of metrics that are often insuffi-
cient or even inappropriate. Despite its limitations,
ROUGE (Lin,2004) is still the de facto evaluation
metric for summarization, mostly due to its sim-
plicity and interpretability. However, not only does
it correlate poorly with human-assessed summary
quality (Kané et al.,2019), but it is also unreliable
whenever the reference summary contains halluci-
nations, which unfortunately is not an uncommon
issue in widely adopted summarization datasets
(Kryscinski et al.,2019;Maynez et al.,2020). For
these reasons, the development of more reliable
evaluation metrics with a stronger correlation with
human judgment is also an active area of research
(Kryscinski et al.,2020;Scialom et al.,2021;Deng
et al.,2021).
In this work, we propose a new approach to ab-
stractive summarization via an energy-based model.
In contrast to previous approaches, which use re-
inforcement learning to train models to maximize
ROUGE or BERT scores (Paulus et al.,2018;Li
et al.,2019), our EBM is trained to re-rank the
candidate summaries the same way that the chosen
metric would rank them – a much simpler problem
which is computationally much more efficient. This
way, we are distilling the metric, which presents
as a by-product an additional advantage: a qual-
ity estimation system that can be used to assess the
quality of the summaries on the fly without the need
of reference summaries. It should be remarked that
any reference-free metric, can be used at inference
time for re-ranking candidates from any abstrac-
arXiv:2210.15553v2 [cs.CL] 7 Nov 2022
tive summarization system, hence improving the
quality of the generated summaries. Our re-ranking
model can therefore leverage the advantages of re-
cently proposed evaluation metrics over traditional
ones, which are essentially two-fold: i) being able
to better capture high-level semantic concepts, and
ii) in addition to the target summary, these met-
rics take into account the information present on
the source document, which is crucial to detect
hallucinations. We demonstrate the effectiveness
of our approach on standard benchmark datasets
for abstractive summarization (CNN/DailyMail,
Hermann et al. (2015), and XSum, Narayan et al.
(2018)) and use a variety of summarization metrics
as the target to train our model on, showing the
versatility of the method. We also conduct a hu-
man evaluation experiment, in which we compare
our re-ranking model trained to maximize recent
transformer-based metrics that aim to measure fac-
tual consistency and relevance (CTC scores, Deng
et al. (2021)). Our proposed model yields improve-
ments over the usual beam search on a baseline
model and demonstrates the ability to distill target
metrics. However, the human evaluation results
suggest that re-ranking according to these metrics,
while competitive, may yield lower quality sum-
maries than those obtained by state-of-the-art ab-
stractive systems trained with augmented data and
contrastive learning.
The remainder of the paper is organized as fol-
lows: in Section 2, we discuss the related work; in
Section 3, we do a brief high-level description of
neural abstractive summarization systems and how
different candidate summaries can be generated
from them; in Section 4, we describe our methodol-
ogy in detail, as well as the summarization metrics
that we shall use to train our re-ranking model;
Section 5presents the experimental results of our
model and baselines, which include both automatic
and human evaluation; in Section 6, we discuss the
limitations of our approach and point some direc-
tions for future work, and we conclude this work
with some final remarks in Section 7.
2 Related work
In the context of natural language generation, the
idea of re-ranking candidates has been studied ex-
tensively for neural machine translation (Shen et al.,
2004;Mizumoto and Matsumoto,2016;Ng et al.,
2019;Salazar et al.,2020;Fernandes et al.,2022),
but only seldom explored for abstractive summa-
rization. Among the former, the approach by Bhat-
tacharyya et al. (2021) is the most similar to ours
as they also resort to an energy-based model to
re-rank the candidates. However, they do not ap-
ply their method to abstractive summarization and
their training objective is different than the one we
shall define for our model: at each training step,
they sample a pair of candidates, and the model
is trained so that the difference between the en-
ergies of the two candidates is at least as large
as the difference of their BLEU scores (Papineni
et al.,2002). Thus, their approach only exploits
the information of two candidates at each training
step. Recently, improved learning objectives such
as contrastive losses have been proposed to enhance
the quality of the predicted summaries, especially
their factual consistency. Tang et al. (2022), Cao
and Wang (2021), and Liu et al. (2021) used data
augmentation to generate both factual consistent
and inconsistent sentences and used these in a con-
trastive learning objective to regularize the trans-
former learned representations. In a different line
of work, Cao et al. (2020) and Zhao et al. (2020)
trained separate models on the task of correcting
factual inconsistencies in the predicted summaries.
Zhu et al. (2021) presented a model that learns to
extract a knowledge graph from the source docu-
ment and uses it to condition the decoding step.
Goyal and Durrett (2021) trained a model to de-
tect non-factual tokens and used it to identify and
discard these tokens from the training data of the
summarizer. Aralikatte et al. (2021) modified the
output distribution of the model to put more focus
on the vocabulary tokens that are similar to the at-
tended input tokens. Despite being sensible ideas,
these techniques mostly focus on redefining the
training objective of the model and disregard the
opportunity to improve the summary quality at in-
ference time, either by redesigning the sampling al-
gorithm or using re-ranking. In a somewhat similar
direction to ours, a contemporary work (Liu et al.,
2022) proposes using a ranking objective as an ad-
ditional term on the usual negative log-likelihood
loss. Similar to us, Liu and Liu (2021) and Ravaut
et al. (2022) propose to use a trained re-ranker in as
post-generation step. The former use a contrastive
objective to learn a re-ranker that mimics ROUGE
scores. The latter employs a mixture of experts to
train a re-ranker on the combination of ROUGE,
BERT and BART scores.
3 Abstractive summarization systems
A typical abstractive summarization model approx-
imates the conditional distribution
p(y|x)
, of
summaries
y
given source documents
x
, and works
auto-regressively, exploiting the chain rule of prob-
ability:
p(y|x) =
l+1
Y
i=1
p(y(i)|x, y0:(i1)),(1)
where
y(0)
is a start-of-sequence token, the follow-
ing
y(1), . . . , y(l)
are the tokens in the summary,
from the beginning to the end, and
y(l+1)
is an end-
of-sequence token. Typically, the parameters of
this model are estimated under the maximum like-
lihood criterion, by minimizing the negative log-
likelihood loss for a training dataset
{(xi, yi)}n
i=1
containing source documents
xi
paired with the
respective reference summaries yi.
Usually, the decoding process aims at finding
the most likely sequence
y
for the given
x
, i.e.
y,arg maxyp(y|x)
. Since searching for the
most likely sequence is intractable due to com-
binatorial explosion, mode-search heuristics like
greedy decoding and beam search are used in prac-
tice. Even if one could find the optimal sequence,
it is not guaranteed that this would be the best
summary for the given document. A primary rea-
son for this is that the distribution learned by the
model is only an approximation of the true condi-
tional distribution, and preserves some background
knowledge acquired during the unsupervised pre-
training of the underlying language model. This
is responsible for the presence of additional infor-
mation in the summary that was not in the source
document, which is the most frequent form of hal-
lucination in summarization (Maynez et al.,2020).
Another source of problems is the noise in the train-
ing datasets, which are often scrapped automati-
cally from the web with little human supervision
(Kryscinski et al.,2019).
In essence, finding the optimal training objective
and decoding algorithm to obtain the best summary
remains an open problem. We take a step in this
direction by sampling a set of candidate summaries
{ˆy1,ˆy2,...,ˆyk}
and then using a re-ranking model
to choose the best one. To ensure diverse candi-
dates, we experiment with diverse beam search
(Vijayakumar et al.,2016), a modification of tradi-
tional beam search including a term in the scoring
function that penalizes for repetitions across differ-
ent beams.
4 Energy-based re-ranking
4.1 Formulation
Formally, a summarization metric is a function
φ:X × Y27→ R
that takes as input the source
document
x∈ X
, the human-written reference
summary
y∈ Y
, and the generated summary
ˆy∈ Y
, and outputs a scalar, usually in the unit
interval, measuring the quality of the generated
summary. Without loss of generality, through-
out this work we assume that higher values of
the metric indicate a better summary (as evalu-
ated by the metric). Then, for a given summa-
rization metric
φ
, our goal is to find a reference-
free function
E:X × Y 7→ R
with parameters
θ
such that, for two candidate summaries
ˆy
and
ˆy0
for the same document
x
with reference sum-
mary
y
,
E(x, ˆy;θ)< E(x, ˆy0;θ)
if and only if
φ(x, y, ˆy)> φ(x, y, ˆy0)
. In the spirit of energy-
based models (LeCun et al.,2006),
E
should assign
low energy wherever
p(y|x)
is high and high en-
ergy wherever
p(y|x)
is low, but does not need to
be normalized as a proper density. More precisely,
E
should satisfy
p(y|x)exp(E(x, y;θ))
.
Under this perspective, at training time,
φ
works as
a proxy for the true conditional distribution, which
is unknown. At inference time, sampling sum-
maries directly from the distribution defined by
the energy-based model is a non-trivial task since
this model is not defined auto-regressively (Eikema
et al.,2021), unlike standard encoder-decoder mod-
els for summarization. Hence, we use its scores to
re-rank candidate summaries previously obtained
from a baseline summarization model.
4.2 Training and inference
We assume to have access to a training data set
D=
{(xi, yi,ˆ
yi)}n
i=1
, where
xi
and
yi
are respectively
the
i
-th source document and the corresponding
reference summary and
ˆ
yi={ˆyi,1,ˆyi,2,...,ˆyi,k}
is a set of (up to)
k
candidate summaries sam-
pled from a baseline summarization model, such as
BART (Lewis et al.,2020) or PEGASUS (Zhang
et al.,2020). Several techniques have been pro-
posed for training energy-based models that avoid
the explicit computation of the partition function
Z(x;θ),RYexp(E(x, y;θ)) dy
and its gra-
dient, which are usually intractable (Song and
Kingma,2021). Here, given this data and the met-
ric
φ
, we adopt the ListMLE ranking loss (Xia et al.,
2008) as the training objective. Specifically, the
model is trained to minimize:
Lφ(θ),E(x,y,ˆ
y)∼D log
k
Y
i=1
exp(E(x, ˆyi;θ))
Pk
j=iexp(E(x, ˆyj;θ)),
(2)
where
τ > 0
is a temperature hyperparameter and
the candidates
ˆy1,ˆy2,...,ˆyk
are sorted such that if
i<jthen φ(x, y, ˆyi)φ(x, y, ˆyj).
To gain some intuition about this loss function,
let us define: i)
ri
as the random variable corre-
sponding to the
i
-th ranked summary in a list of
k
candidates
ˆy1,ˆy2,...,ˆyk
and ii) the probability
that r1takes the value ˆy1as:
P(r1= ˆy1|x),exp(E(x, ˆy1))
Pk
j=1 exp(E(x, ˆyj)),
(3)
where we have omitted the parameters
θ
for brevity.
Assuming that the first
i1
candidates are ranked
correctly, the probability that the
i
-th candidate is
also ranked correctly is the probability that it is
ranked first in the list ˆyi,ˆyi+1,...,ˆyk, thus:
P(ri= ˆyi|x,r1:(i1) = ˆy1:(i1)) =
=exp(E(x, ˆyi))
Pk
j=iexp(E(x, ˆyj)).(4)
It then follows from the chain rule that the proba-
bility that all the
k
candidates are ranked correctly
is:
P(r1:k= ˆy1:k|x) =
=
k
Y
i=1
P(ri= ˆyi|x, r1:(i1) = ˆy1:(i1))
=
k
Y
i=1
exp(E(x, ˆyi))
Pk
j=iexp(E(x, ˆyj)).(5)
Hence,
P(r1:k|x)
is a distribution over all the pos-
sible permutations of the
k
candidates and the min-
imization of the loss
Lφ
maximizes the likelihood
of the correct permutation, i.e. of the permutation
induced by ranking the candidates
ˆy1,...,ˆyk
ac-
cording to the metric
φ(x, y, ·)
. At inference time,
given an unsorted list
ˆ
y
of
k
candidate summaries
for the document
x
, we choose the candidate
ˆy
that is the most likely to be the top-ranked:
ˆy,arg max
ˆyˆ
y
P(r1= ˆy|x) = arg min
ˆyˆ
y
E(x, ˆy).
(6)
Thus, our energy based-model aims at ranking
a set of candidates the same way that the metric
φ
would rank them, but it does this without having
access to the reference summary
y
. Therefore, this
is a way to distill the information contained in the
metric into a single and reference-free model that
can rank summary hypotheses on the fly.
4.3 Adopted metrics
So far, the definition of summarization metric we
have provided was generic, so now we focus on
describing the particular metrics we have used to
train our model. Summarization metrics can be
divided into two groups: reference-dependent and
reference-free, depending on whether
φ
actually
needs the reference summary or not. In the latter
case,
φ(x, y, ˆy)ϕ(x, ˆy)y
, for some function
ϕ
.
Thus, reference-dependent metrics are mostly used
to evaluate and compare summarization systems,
whereas reference-free metrics can also be used
to assess summary quality on the fly. Therefore,
training our energy-based model using reference-
dependent metrics provides an indirect way to use
these metrics for the latter purpose as well.
Automatically assessing the quality of a sum-
mary is a non-trivial task since it depends on high-
level concepts, such as factual consistency, rele-
vance, coherence, and fluency (Lloret et al.,2018).
These are loosely captured by classical metrics
(Kané et al.,2019;Kryscinski et al.,2019) such
as ROUGE, which essentially measure the
n
-gram
overlap between
ˆy
and
y
. However, in recent years,
the availability of powerful language representa-
tion models like BERT (Devlin et al.,2019) per-
mitted and motivated the development of several
transformer-based automatic metrics.
There are a few metrics based on question gen-
eration (QG) and question answering (QA) models
(Wang et al.,2020;Durmus et al.,2020). Among
these, QuestEval (Scialom et al.,2021) exhibits the
strongest correlation with human judgment. This
metric uses a QG model to generate questions from
both the source document
x
and the candidate sum-
mary
ˆy
and a QA model to get the answers from
both, which are then compared to produce a score
in the unit interval. In addition to the QA and QG
models, QuestEval uses an additional model to de-
termine the importance weight of each question
generated from
x
. Although being reference-free,
this metric is computationally expensive, so it is
important to investigate whether our model can
produce a similar ranking more efficiently.
Following a different paradigm, Deng et al.
摘要:

Improvingabstractivesummarizationwithenergy-basedre-rankingDiogoPernesÁçAfonsoMendesÁAndréF.T.MartinsÈÉÆÁPriberamçUniversidadedoPortoÈInstitutodeTelecomunicaçõesÉLUMLIS(LisbonELLISUnit),InstitutoSuperiorTécnicoÆUnbabelLisbon,Portugaldiogo.pernes@priberam.pt,amm@priberam.pt,andre.t.martins@tecnico.ul...

展开>> 收起<<
Improving abstractive summarization with energy-based re-ranking Diogo PernesÁçAfonso MendesÁAndré F. T. MartinsÈÉÆ ÁPriberamçUniversidade do Porto.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:763.25KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注