Improving abstractive summarization with energy-based re-ranking Diogo PernesÁçAfonso MendesÁAndré F. T. MartinsÈÉÆ ÁPriberamçUniversidade do Porto

2025-05-08 0 0 763.25KB 17 页 10玖币

侵权投诉

Improving abstractive summarization with energy-based re-ranking

Diogo PernesÁç Afonso MendesÁAndré F. T. MartinsÈÉÆ

ÁPriberam çUniversidade do Porto

ÈInstituto de Telecomunicações ÉLUMLIS (Lisbon ELLIS Unit), Instituto Superior Técnico ÆUnbabel

Lisbon, Portugal

diogo.pernes@priberam.pt,

amm@priberam.pt,andre.t.martins@tecnico.ulisboa.pt.

Abstract

Current abstractive summarization systems

present important weaknesses which prevent

their deployment in real-world applications,

such as the omission of relevant information

and the generation of factual inconsistencies

(also known as hallucinations). At the same

time, automatic evaluation metrics such as

CTC scores (Deng et al.,2021) have been re-

cently proposed that exhibit a higher corre-

lation with human judgments than traditional

lexical-overlap metrics such as ROUGE. In

this work, we intend to close the loop by

leveraging the recent advances in summariza-

tion metrics to create quality-aware abstrac-

tive summarizers. Namely, we propose an

energy-based model that learns to re-rank sum-

maries according to one or a combination of

these metrics. We experiment using several

metrics to train our energy-based re-ranker

and show that it consistently improves the

scores achieved by the predicted summaries.

Nonetheless, human evaluation results show

that the re-ranking approach should be used

with care for highly abstractive summaries, as

the available metrics are not yet sufﬁciently re-

liable for this purpose.

1 Introduction

In recent years, abstractive methods have greatly

beneﬁted from the development and widespread

availability of large-scale transformer-based lan-

guage generative models (Vaswani et al.,2017;

Lewis et al.,2020;Raffel et al.,2020;Zhang

et al.,2020), which are capable of generating

text with unprecedented ﬂuency. Despite the re-

cent progress, abstractive summarization systems

still suffer from problems that hamper their de-

ployment in real-world applications. Omitting the

most relevant information from the source docu-

ment is one of such problems. Additionally, fac-

tual inconsistencies (also known as hallucinations)

were estimated to be present in around 30% of

the summaries produced by abstractive systems

on the CNN/DailyMail dataset (Kryscinski et al.,

2019). This observation has motivated a consider-

able amount of research on strategies to mitigate

the hallucination problem (Falke et al.,2019;Cao

et al.,2020;Zhao et al.,2020;Zhu et al.,2021),

but the improvements achieved so far are mild.

This is partly due to the difﬁculty of evaluating

the quality of summaries automatically, leading

to the adoption of metrics that are often insufﬁ-

cient or even inappropriate. Despite its limitations,

ROUGE (Lin,2004) is still the de facto evaluation

metric for summarization, mostly due to its sim-

plicity and interpretability. However, not only does

it correlate poorly with human-assessed summary

quality (Kané et al.,2019), but it is also unreliable

whenever the reference summary contains halluci-

nations, which unfortunately is not an uncommon

issue in widely adopted summarization datasets

(Kryscinski et al.,2019;Maynez et al.,2020). For

these reasons, the development of more reliable

evaluation metrics with a stronger correlation with

human judgment is also an active area of research

(Kryscinski et al.,2020;Scialom et al.,2021;Deng

et al.,2021).

In this work, we propose a new approach to ab-

stractive summarization via an energy-based model.

In contrast to previous approaches, which use re-

inforcement learning to train models to maximize

ROUGE or BERT scores (Paulus et al.,2018;Li

et al.,2019), our EBM is trained to re-rank the

candidate summaries the same way that the chosen

metric would rank them – a much simpler problem

which is computationally much more efﬁcient. This

way, we are distilling the metric, which presents

as a by-product an additional advantage: a qual-

ity estimation system that can be used to assess the

quality of the summaries on the ﬂy without the need

of reference summaries. It should be remarked that

any reference-free metric, can be used at inference

time for re-ranking candidates from any abstrac-

arXiv:2210.15553v2 [cs.CL] 7 Nov 2022

tive summarization system, hence improving the

quality of the generated summaries. Our re-ranking

model can therefore leverage the advantages of re-

cently proposed evaluation metrics over traditional

ones, which are essentially two-fold: i) being able

to better capture high-level semantic concepts, and

ii) in addition to the target summary, these met-

rics take into account the information present on

the source document, which is crucial to detect

hallucinations. We demonstrate the effectiveness

of our approach on standard benchmark datasets

for abstractive summarization (CNN/DailyMail,

Hermann et al. (2015), and XSum, Narayan et al.

(2018)) and use a variety of summarization metrics

as the target to train our model on, showing the

versatility of the method. We also conduct a hu-

man evaluation experiment, in which we compare

our re-ranking model trained to maximize recent

transformer-based metrics that aim to measure fac-

tual consistency and relevance (CTC scores, Deng

et al. (2021)). Our proposed model yields improve-

ments over the usual beam search on a baseline

model and demonstrates the ability to distill target

metrics. However, the human evaluation results

suggest that re-ranking according to these metrics,

while competitive, may yield lower quality sum-

maries than those obtained by state-of-the-art ab-

stractive systems trained with augmented data and

contrastive learning.

The remainder of the paper is organized as fol-

lows: in Section 2, we discuss the related work; in

Section 3, we do a brief high-level description of

neural abstractive summarization systems and how

different candidate summaries can be generated

from them; in Section 4, we describe our methodol-

ogy in detail, as well as the summarization metrics

that we shall use to train our re-ranking model;

Section 5presents the experimental results of our

model and baselines, which include both automatic

and human evaluation; in Section 6, we discuss the

limitations of our approach and point some direc-

tions for future work, and we conclude this work

with some ﬁnal remarks in Section 7.

2 Related work

In the context of natural language generation, the

idea of re-ranking candidates has been studied ex-

tensively for neural machine translation (Shen et al.,

2004;Mizumoto and Matsumoto,2016;Ng et al.,

2019;Salazar et al.,2020;Fernandes et al.,2022),

but only seldom explored for abstractive summa-

rization. Among the former, the approach by Bhat-

tacharyya et al. (2021) is the most similar to ours

as they also resort to an energy-based model to

re-rank the candidates. However, they do not ap-

ply their method to abstractive summarization and

their training objective is different than the one we

shall deﬁne for our model: at each training step,

they sample a pair of candidates, and the model

is trained so that the difference between the en-

ergies of the two candidates is at least as large

as the difference of their BLEU scores (Papineni

et al.,2002). Thus, their approach only exploits

the information of two candidates at each training

step. Recently, improved learning objectives such

as contrastive losses have been proposed to enhance

the quality of the predicted summaries, especially

their factual consistency. Tang et al. (2022), Cao

and Wang (2021), and Liu et al. (2021) used data

augmentation to generate both factual consistent

and inconsistent sentences and used these in a con-

trastive learning objective to regularize the trans-

former learned representations. In a different line

of work, Cao et al. (2020) and Zhao et al. (2020)

trained separate models on the task of correcting

factual inconsistencies in the predicted summaries.

Zhu et al. (2021) presented a model that learns to

extract a knowledge graph from the source docu-

ment and uses it to condition the decoding step.

Goyal and Durrett (2021) trained a model to de-

tect non-factual tokens and used it to identify and

discard these tokens from the training data of the

summarizer. Aralikatte et al. (2021) modiﬁed the

output distribution of the model to put more focus

on the vocabulary tokens that are similar to the at-

tended input tokens. Despite being sensible ideas,

these techniques mostly focus on redeﬁning the

training objective of the model and disregard the

opportunity to improve the summary quality at in-

ference time, either by redesigning the sampling al-

gorithm or using re-ranking. In a somewhat similar

direction to ours, a contemporary work (Liu et al.,

2022) proposes using a ranking objective as an ad-

ditional term on the usual negative log-likelihood

loss. Similar to us, Liu and Liu (2021) and Ravaut

et al. (2022) propose to use a trained re-ranker in as

post-generation step. The former use a contrastive

objective to learn a re-ranker that mimics ROUGE

scores. The latter employs a mixture of experts to

train a re-ranker on the combination of ROUGE,

BERT and BART scores.

3 Abstractive summarization systems

A typical abstractive summarization model approx-

imates the conditional distribution

p(y|x)

, of

summaries

given source documents

, and works

auto-regressively, exploiting the chain rule of prob-

ability:

p(y|x) =

l+1

i=1

p(y(i)|x, y0:(i−1)),(1)

where

y(0)

is a start-of-sequence token, the follow-

ing

y(1), . . . , y(l)

are the tokens in the summary,

from the beginning to the end, and

y(l+1)

is an end-

of-sequence token. Typically, the parameters of

this model are estimated under the maximum like-

lihood criterion, by minimizing the negative log-

likelihood loss for a training dataset

{(xi, yi)}n

i=1

containing source documents

paired with the

respective reference summaries yi.

Usually, the decoding process aims at ﬁnding

the most likely sequence

y∗

for the given

, i.e.

y∗,arg maxyp(y|x)

. Since searching for the

most likely sequence is intractable due to com-

binatorial explosion, mode-search heuristics like

greedy decoding and beam search are used in prac-

tice. Even if one could ﬁnd the optimal sequence,

it is not guaranteed that this would be the best

summary for the given document. A primary rea-

son for this is that the distribution learned by the

model is only an approximation of the true condi-

tional distribution, and preserves some background

knowledge acquired during the unsupervised pre-

training of the underlying language model. This

is responsible for the presence of additional infor-

mation in the summary that was not in the source

document, which is the most frequent form of hal-

lucination in summarization (Maynez et al.,2020).

Another source of problems is the noise in the train-

ing datasets, which are often scrapped automati-

cally from the web with little human supervision

(Kryscinski et al.,2019).

In essence, ﬁnding the optimal training objective

and decoding algorithm to obtain the best summary

remains an open problem. We take a step in this

direction by sampling a set of candidate summaries

{ˆy1,ˆy2,...,ˆyk}

and then using a re-ranking model

to choose the best one. To ensure diverse candi-

dates, we experiment with diverse beam search

(Vijayakumar et al.,2016), a modiﬁcation of tradi-

tional beam search including a term in the scoring

function that penalizes for repetitions across differ-

ent beams.

4 Energy-based re-ranking

4.1 Formulation

Formally, a summarization metric is a function

φ:X × Y27→ R

that takes as input the source

document

x∈ X

, the human-written reference

summary

y∈ Y

, and the generated summary

ˆy∈ Y

, and outputs a scalar, usually in the unit

interval, measuring the quality of the generated

summary. Without loss of generality, through-

out this work we assume that higher values of

the metric indicate a better summary (as evalu-

ated by the metric). Then, for a given summa-

rization metric

, our goal is to ﬁnd a reference-

free function

E:X × Y 7→ R

with parameters

such that, for two candidate summaries

ˆy

and

ˆy0

for the same document

with reference sum-

mary

E(x, ˆy;θ)< E(x, ˆy0;θ)

if and only if

φ(x, y, ˆy)> φ(x, y, ˆy0)

. In the spirit of energy-

based models (LeCun et al.,2006),

should assign

low energy wherever

p(y|x)

is high and high en-

ergy wherever

p(y|x)

is low, but does not need to

be normalized as a proper density. More precisely,

should satisfy

p(y|x)∝exp(−E(x, y;θ))

Under this perspective, at training time,

works as

a proxy for the true conditional distribution, which

is unknown. At inference time, sampling sum-

maries directly from the distribution deﬁned by

the energy-based model is a non-trivial task since

this model is not deﬁned auto-regressively (Eikema

et al.,2021), unlike standard encoder-decoder mod-

els for summarization. Hence, we use its scores to

re-rank candidate summaries previously obtained

from a baseline summarization model.

4.2 Training and inference

We assume to have access to a training data set

{(xi, yi,ˆ

yi)}n

i=1

, where

and

are respectively

the

-th source document and the corresponding

reference summary and

yi={ˆyi,1,ˆyi,2,...,ˆyi,k}

is a set of (up to)

candidate summaries sam-

pled from a baseline summarization model, such as

BART (Lewis et al.,2020) or PEGASUS (Zhang

et al.,2020). Several techniques have been pro-

posed for training energy-based models that avoid

the explicit computation of the partition function

Z(x;θ),RYexp(−E(x, y;θ)) dy

and its gra-

dient, which are usually intractable (Song and

Kingma,2021). Here, given this data and the met-

ric

, we adopt the ListMLE ranking loss (Xia et al.,

2008) as the training objective. Speciﬁcally, the

model is trained to minimize:

Lφ(θ),−E(x,y,ˆ

y)∼D log

i=1

exp(−E(x, ˆyi;θ)/τ)

j=iexp(−E(x, ˆyj;θ)/τ),

(2)

where

τ > 0

is a temperature hyperparameter and

the candidates

ˆy1,ˆy2,...,ˆyk

are sorted such that if

i<jthen φ(x, y, ˆyi)≥φ(x, y, ˆyj).

To gain some intuition about this loss function,

let us deﬁne: i)

as the random variable corre-

sponding to the

-th ranked summary in a list of

candidates

ˆy1,ˆy2,...,ˆyk

and ii) the probability

that r1takes the value ˆy1as:

P(r1= ˆy1|x),exp(−E(x, ˆy1)/τ)

j=1 exp(−E(x, ˆyj)/τ),

(3)

where we have omitted the parameters

for brevity.

Assuming that the ﬁrst

i−1

candidates are ranked

correctly, the probability that the

-th candidate is

also ranked correctly is the probability that it is

ranked ﬁrst in the list ˆyi,ˆyi+1,...,ˆyk, thus:

P(ri= ˆyi|x,r1:(i−1) = ˆy1:(i−1)) =

=exp(−E(x, ˆyi)/τ)

j=iexp(−E(x, ˆyj)/τ).(4)

It then follows from the chain rule that the proba-

bility that all the

candidates are ranked correctly

is:

P(r1:k= ˆy1:k|x) =

i=1

P(ri= ˆyi|x, r1:(i−1) = ˆy1:(i−1))

i=1

exp(−E(x, ˆyi)/τ)

j=iexp(−E(x, ˆyj)/τ).(5)

Hence,

P(r1:k|x)

is a distribution over all the pos-

sible permutations of the

candidates and the min-

imization of the loss

Lφ

maximizes the likelihood

of the correct permutation, i.e. of the permutation

induced by ranking the candidates

ˆy1,...,ˆyk

ac-

cording to the metric

φ(x, y, ·)

. At inference time,

given an unsorted list

candidate summaries

for the document

, we choose the candidate

ˆy∗

that is the most likely to be the top-ranked:

ˆy∗,arg max

ˆy∈ˆ

P(r1= ˆy|x) = arg min

ˆy∈ˆ

E(x, ˆy).

(6)

Thus, our energy based-model aims at ranking

a set of candidates the same way that the metric

would rank them, but it does this without having

access to the reference summary

. Therefore, this

is a way to distill the information contained in the

metric into a single and reference-free model that

can rank summary hypotheses on the ﬂy.

4.3 Adopted metrics

So far, the deﬁnition of summarization metric we

have provided was generic, so now we focus on

describing the particular metrics we have used to

train our model. Summarization metrics can be

divided into two groups: reference-dependent and

reference-free, depending on whether

actually

needs the reference summary or not. In the latter

case,

φ(x, y, ˆy)≡ϕ(x, ˆy)∀y

, for some function

Thus, reference-dependent metrics are mostly used

to evaluate and compare summarization systems,

whereas reference-free metrics can also be used

to assess summary quality on the ﬂy. Therefore,

training our energy-based model using reference-

dependent metrics provides an indirect way to use

these metrics for the latter purpose as well.

Automatically assessing the quality of a sum-

mary is a non-trivial task since it depends on high-

level concepts, such as factual consistency, rele-

vance, coherence, and ﬂuency (Lloret et al.,2018).

These are loosely captured by classical metrics

(Kané et al.,2019;Kryscinski et al.,2019) such

as ROUGE, which essentially measure the

-gram

overlap between

ˆy

and

. However, in recent years,

the availability of powerful language representa-

tion models like BERT (Devlin et al.,2019) per-

mitted and motivated the development of several

transformer-based automatic metrics.

There are a few metrics based on question gen-

eration (QG) and question answering (QA) models

(Wang et al.,2020;Durmus et al.,2020). Among

these, QuestEval (Scialom et al.,2021) exhibits the

strongest correlation with human judgment. This

metric uses a QG model to generate questions from

both the source document

and the candidate sum-

mary

ˆy

and a QA model to get the answers from

both, which are then compared to produce a score

in the unit interval. In addition to the QA and QG

models, QuestEval uses an additional model to de-

termine the importance weight of each question

generated from

. Although being reference-free,

this metric is computationally expensive, so it is

important to investigate whether our model can

produce a similar ranking more efﬁciently.

Following a different paradigm, Deng et al.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Improvingabstractivesummarizationwithenergy-basedre-rankingDiogoPernesÁçAfonsoMendesÁAndréF.T.MartinsÈÉÆÁPriberamçUniversidadedoPortoÈInstitutodeTelecomunicaçõesÉLUMLIS(LisbonELLISUnit),InstitutoSuperiorTécnicoÆUnbabelLisbon,Portugaldiogo.pernes@priberam.pt,amm@priberam.pt,andre.t.martins@tecnico.ul...

展开>> 收起<<

Improving abstractive summarization with energy-based re-ranking Diogo PernesÁçAfonso MendesÁAndré F. T. MartinsÈÉÆ ÁPriberamçUniversidade do Porto.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving abstractive summarization with energy-based re-ranking Diogo PernesÁçAfonso MendesÁAndré F. T. MartinsÈÉÆ ÁPriberamçUniversidade do Porto

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: