
2 Related Work
Evaluation based on Gold Standard
This type
of evaluation determines the relevance of a sum-
mary by comparing its overlapping words with a
gold standard summary. Such evaluation is of-
ten labor-intensive, as it requires (multiple) hu-
man written gold standard summaries. Baseline
methods commonly used in gold standard evalua-
tion are ROUGE (Lin,2004) and BLEU (Papineni
et al.,2001). More recent techniques enhance the
ROUGE score with WordNet (ShafieiBavani et al.,
2018), word embeddings (Ng and Abrecht,2015)
or use contextual embeddings (Zhang et al.,2019)
in order to capture semantic similarity. In addition
to that Mrabet and Demner-Fushman (2020) com-
bine lexical similarities with BERT-Embeddings
(Devlin et al.,2019).
Annotation-based Evaluation
Annotation-
based evaluation methods require manually
annotated summary ratings following predefined
guidelines. For example, the Pyramid method
(Nenkova and Passonneau,2004) works by
annotating relevant topics in the source text and
ranking the summaries accordingly. Böhm et al.
(2020) use annotated texts to train a BERT-based
evaluation model using rated summaries as training
data.
Unsupervised Evaluation
Unsupervised ap-
proaches infer a quality score of a summary based
on its reference text without using a gold standard
or manual annotations. Over the past few years,
there have been multiple approaches to unsuper-
vised methods for summary evaluation. For in-
stance, many works explore BERT-Embeddings to
detect semantic similarity between summary and
reference text (Zheng and Lapata,2020;Xenouleas
et al.,2020;Gao et al.,2020). Zheng and Lapata
(2020) propose a method for summary evaluation
called PacSum that combines BERT-embeddings
with a directed graph in which each embedding is
a node and node-wise similarity is computed based
on graph positions. In order to rate a summary,
SUPERT (Gao et al.,2020) uses contextual word
embeddings and soft token alignment techniques.
3 Methodology: Summary Evaluation
Our unsupervised evaluation method is based on
the hypothesis that a generated summary can be
evaluated by how much of the reference docu-
ment’s semantic information it preserves. To in-
Algorithm 1 Calculation of MISEM score m
1: R←encode(reference text).Encode sentences
2: I←encode(summary text).Encode tokens
3: T←cluster(R).Cluster reference text topics
4: for each t∈Tdo
5: wt←|t|
|R|.Compute topic weights
6: ¯
Tt←1
|t|P|t|
i=1 ti.Compute topic centroids
7: end for
8: C←
¯
T·I
k¯
TkkIk.Compute cosine similarity matrix
9: C←softmax(C).Normalize similarity matrix
10: for each t∈Tdo
11: st←Pn
i=1 Cti .Compute topic scores
12: end for
13: S←softmax(S).Normalize topic scores
14: m=W·S . Compute weighted final score
crease the granularity and interpretability of our
approach, we evaluate the preservation of content
per semantic topic of the reference document. The
semantic topics are identified by clustering the ref-
erence document’s sentence embeddings. Next,
inspired by maximum mean discrepancy (MMD)
(Gretton et al.,2012), our method measures the
correspondence of the summary’s contextual token
embeddings to each reference text topic centroid
embedding. As each token is assigned to each topic,
this assignment is weighted by the normalized co-
sine similarity between token and topic embedding.
The pseudocode implementing our method is illus-
trated in Algorithm 1.
Encoding
Following SUPERT (Gao et al.,2020),
we first split both reference text and summary text
into sentences. Then, the pre-trained Sentence-
BERT model (Reimers and Gurevych,2020) is
used to encode the reference text into a set of sen-
tence embeddings
R
and the summary text into a
set of contextual token embeddings
I
. In Algo-
rithm 1, this encoding step is performed in lines
1-2.
Topic Clustering
Our method requires a set of
reference text topics defined as clusters
T
. In our
experiments,
T
are computed using the agglom-
erative clustering algorithm from Pedregosa et al.
(2011). Line 3 of Algorithm 1contains this step.
Furthermore, our method requires topic centroids
¯
T
, which are computed as
¯
Tt=1
|t|P|t|
i=1 ti
, where
t
represents one topic in
T
. Each topic centroid
is calculated by taking the average of its associ-
ated reference text sentence embeddings. As the
MISEM method assumes that the importance of a
topic is determined to some degree by its length, a
topic weight is calculated for each topic in
T
. The