tive summarization system, hence improving the
quality of the generated summaries. Our re-ranking
model can therefore leverage the advantages of re-
cently proposed evaluation metrics over traditional
ones, which are essentially two-fold: i) being able
to better capture high-level semantic concepts, and
ii) in addition to the target summary, these met-
rics take into account the information present on
the source document, which is crucial to detect
hallucinations. We demonstrate the effectiveness
of our approach on standard benchmark datasets
for abstractive summarization (CNN/DailyMail,
Hermann et al. (2015), and XSum, Narayan et al.
(2018)) and use a variety of summarization metrics
as the target to train our model on, showing the
versatility of the method. We also conduct a hu-
man evaluation experiment, in which we compare
our re-ranking model trained to maximize recent
transformer-based metrics that aim to measure fac-
tual consistency and relevance (CTC scores, Deng
et al. (2021)). Our proposed model yields improve-
ments over the usual beam search on a baseline
model and demonstrates the ability to distill target
metrics. However, the human evaluation results
suggest that re-ranking according to these metrics,
while competitive, may yield lower quality sum-
maries than those obtained by state-of-the-art ab-
stractive systems trained with augmented data and
contrastive learning.
The remainder of the paper is organized as fol-
lows: in Section 2, we discuss the related work; in
Section 3, we do a brief high-level description of
neural abstractive summarization systems and how
different candidate summaries can be generated
from them; in Section 4, we describe our methodol-
ogy in detail, as well as the summarization metrics
that we shall use to train our re-ranking model;
Section 5presents the experimental results of our
model and baselines, which include both automatic
and human evaluation; in Section 6, we discuss the
limitations of our approach and point some direc-
tions for future work, and we conclude this work
with some final remarks in Section 7.
2 Related work
In the context of natural language generation, the
idea of re-ranking candidates has been studied ex-
tensively for neural machine translation (Shen et al.,
2004;Mizumoto and Matsumoto,2016;Ng et al.,
2019;Salazar et al.,2020;Fernandes et al.,2022),
but only seldom explored for abstractive summa-
rization. Among the former, the approach by Bhat-
tacharyya et al. (2021) is the most similar to ours
as they also resort to an energy-based model to
re-rank the candidates. However, they do not ap-
ply their method to abstractive summarization and
their training objective is different than the one we
shall define for our model: at each training step,
they sample a pair of candidates, and the model
is trained so that the difference between the en-
ergies of the two candidates is at least as large
as the difference of their BLEU scores (Papineni
et al.,2002). Thus, their approach only exploits
the information of two candidates at each training
step. Recently, improved learning objectives such
as contrastive losses have been proposed to enhance
the quality of the predicted summaries, especially
their factual consistency. Tang et al. (2022), Cao
and Wang (2021), and Liu et al. (2021) used data
augmentation to generate both factual consistent
and inconsistent sentences and used these in a con-
trastive learning objective to regularize the trans-
former learned representations. In a different line
of work, Cao et al. (2020) and Zhao et al. (2020)
trained separate models on the task of correcting
factual inconsistencies in the predicted summaries.
Zhu et al. (2021) presented a model that learns to
extract a knowledge graph from the source docu-
ment and uses it to condition the decoding step.
Goyal and Durrett (2021) trained a model to de-
tect non-factual tokens and used it to identify and
discard these tokens from the training data of the
summarizer. Aralikatte et al. (2021) modified the
output distribution of the model to put more focus
on the vocabulary tokens that are similar to the at-
tended input tokens. Despite being sensible ideas,
these techniques mostly focus on redefining the
training objective of the model and disregard the
opportunity to improve the summary quality at in-
ference time, either by redesigning the sampling al-
gorithm or using re-ranking. In a somewhat similar
direction to ours, a contemporary work (Liu et al.,
2022) proposes using a ranking objective as an ad-
ditional term on the usual negative log-likelihood
loss. Similar to us, Liu and Liu (2021) and Ravaut
et al. (2022) propose to use a trained re-ranker in as
post-generation step. The former use a contrastive
objective to learn a re-ranker that mimics ROUGE
scores. The latter employs a mixture of experts to
train a re-ranker on the combination of ROUGE,
BERT and BART scores.