
tion with the summary. Finally, we define the de-
gree of multi-text merging of an MDS summary as
a function of the amount of summary information
not covered by each subset of documents.
We apply our automated measure to evaluate
the degree of multi-text merging in four promi-
nent MDS datasets (DUC, TAC, MultiNews and
WCEP) as well as the output of five recent systems.
Our results show that some existing datasets barely
involve multi-text merging because the reference
summary information mostly appears in a single
document. Unsurprisingly, the length of the sum-
mary has a substantial impact on the amount of
multi-text merging since longer summaries cover
more detailed information which tends to be spread
across documents.
Taken together, our work is the first to measure
and empirically analyze multi-text merging in MDS
datasets and model summaries. We suggest that
future work will use our methodology to develop
better datasets and to improve the degree of multi-
text merging in MDS models.
2 A Measure for Multi-text Merging
2.1 Motivating Analysis
The common dataset structure for an MDS instance
is a topic that consists of a set of source documents
D={D1, ..., Dn}
and a summary
S
. To motivate
our measure, we first analyze the degree of multi-
text merging on a sample of topics. To that end, we
leverage the Summary-Source-Alignment dataset
of Ernst et al. (2021), in which human annotators
aligned all propositions in reference summaries
with corresponding propositions in the source doc-
uments that cover the same information, as exem-
plified in Table 1. Given these alignments on 9
MDS topics from MultiNews (Fabbri et al.,2019),
each composed of 4 source documents, we find that
a single source document suffices to cover alone
70% of the summary propositions while 2 docu-
ments cover 95% of them. The remaining source
documents thus hardly provide any substantial in-
formation to the summary.
Motivated by this analysis, we develop an auto-
mated measure that allows to evaluate the degree
of multi-text merging in entire MDS datasets and
in systems summaries. Our measure operates in
the following steps. We first define the coverage
score for a given subset of source documents (§2.2).
Then, to approximate the minimum number of doc-
uments required to cover increasing portions of the
summary information, we greedily construct, for
each possible number of source documents, the
subset of source documents with the highest cov-
erage score (§2.3). Finally, we measure the total
amount of summary information in all subset sizes,
yielding a corresponding coverage curve (§2.4).
2.2 Relative Coverage Score
Let
D∗
be a subset of source documents
D∗⊆D
.
We define the relative coverage of
D∗
as the pro-
portion of information that is covered by
D∗
, nor-
malized by the information covered by all source
documents D:
cov(D∗, D, S) = s(D∗, S)
s(D, S)(1)
For the absolute coverage score
s(D∗, S)
, we
aim to approximate the human annotation of
summary-source proposition alignment in (Ernst
et al.,2021), which is based on the well estab-
lished Pyramid scheme (Nenkova and Passonneau,
2004). Specifically, we follow their automated
scheme: (1) we extract all propositions from the
summary and all source documents using Ope-
nIE (Banko et al.,2008),
2
(2) we compute the
similarity score between the propositions in the
summary and the source documents using SU-
PERPAL, an NLI model fine-tuned on proposition
alignment (Ernst et al.,2021), (3)
s(D∗, S)
is de-
fined as the number of propositions in
S
that are
aligned with some proposition in D∗.
We consider the proportion
s(D∗, S)/s(D, S)
and not the absolute coverage
s(D∗, S)
for two
main reasons. First, as both reference and system
summaries are known to include hallucinated infor-
mation (Maynez et al.,2020), we need to discard
them in our measure in order to properly estimate
the amount of information that each single source
document actually provides to the summary. Sec-
ond, normalizing the coverage score will mitigate
the potential omissions of the alignment model.
2.3 Maximally-Covering Document Subsets
Given an MDS topic with
n
source documents, we
aim to measure the maximal content coverage of
the summary content by a document subset of size
k6n
. To that end, we form
n
subsets of source
2
We use the AllenNLP implementation of (Stanovsky et al.,
2018) to extract the OpenIE tuples. Following Ernst et al.
(2021,2022), we convert each OpenIE tuple into a proposition
string by concatenating the predicate and its arguments by
their original order.