Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics Ben Schaper

2025-05-06 0 0 764.75KB 6 页 10玖币
侵权投诉
Towards Interpretable Summary Evaluation via Allocation of Contextual
Embeddings to Reference Text Topics
Ben Schaper
Technical University of Munich
ben.schaper@tum.de
Christopher Lohse
Trinity College Dublin
lohsec@tcd.ie
Marcell Streile
IBM
streile@de.ibm.com
Andrea Giovannini
IBM
agv@zurich.ibm.com
Richard Osuala
Universitat de Barcelona
richard.osuala@ub.edu
Abstract
Despite extensive recent advances in sum-
mary generation models, evaluation of auto-
generated summaries still widely relies on
single-score systems insufficient for transpar-
ent assessment and in-depth qualitative anal-
ysis. Towards bridging this gap, we pro-
pose the multifaceted interpretable summary
evaluation method (MISEM), which is based
on allocation of a summary’s contextual to-
ken embeddings to semantic topics identi-
fied in the reference text. We further con-
tribute an interpretability toolbox for auto-
mated summary evaluation and interactive vi-
sual analysis of summary scoring, topic identi-
fication, and token-topic allocation. MISEM
achieves a promising .404 Pearson correla-
tion with human judgment on the TAC’08
dataset. Our code and toolbox are available at
https://github.com/IBM/misem
1 Introduction
Auto-generated text summaries are becoming an
increasingly mature, useful, and time-saving tool
in research and industry with multiple applications
such as email summary generation, summarizing
research papers, and simplifying knowledge man-
agement and knowledge transfer in companies (El-
Kassas et al.,2021). It is crucial for production-
ready applications that auto-generated summaries
do not omit critical information.
In this regard, currently used summary evalua-
tion metrics have many known limitations, which
are becoming even more apparent as natural lan-
guage generator (NLG) models evolve (e.g., better
paraphrasing) (Gehrmann et al.,2022). Therefore,
the generated texts are becoming less assessable
based on surface-level (e.g., n-gram overlap based)
methods of older evaluation metrics (Gehrmann
et al.,2022). Recent advances in summary evalu-
ation leverage semantic similarity and often com-
pute a single numerical summary evaluation score
(Zheng and Lapata,2020;Xenouleas et al.,2020;
Figure 1: Sankey diagram visualization of the MISEM
score methodology, which evaluates how well a sum-
mary reflects the topics identified in its reference text.
Gao et al.,2020). However, as noted by Gehrmann
et al. (2022), a single numerical score alone is likely
too narrow to reliably indicate the quality of a text
summary.
Thus, there is a clear need for an NLG evalua-
tion metric that provides a multifaceted and inter-
pretable quality measure. Moreover, such a mea-
sure is needed as quality gate in industrial settings
that NLG-generated summaries need to pass before
being displayed to end-users.
Contributions
Providing a multifaceted view of
quality beyond a single score, our contributions are
two-fold:
Evaluation Method:
We propose an inter-
pretable summary evaluation method that
identifies semantic topics present in the ref-
erence text, assigns summary text tokens to
these topics, and measures the summary’s se-
mantic coverage of each of these topics.
Interpretability Toolbox:
We provide an in-
teractive interpretability toolbox that allows
users to evaluate their summaries, adjust hy-
perparameters, explore topic-wise semantic
overlap with reference texts, and detect miss-
ing parts of the summary.
arXiv:2210.14174v1 [cs.CL] 25 Oct 2022
2 Related Work
Evaluation based on Gold Standard
This type
of evaluation determines the relevance of a sum-
mary by comparing its overlapping words with a
gold standard summary. Such evaluation is of-
ten labor-intensive, as it requires (multiple) hu-
man written gold standard summaries. Baseline
methods commonly used in gold standard evalua-
tion are ROUGE (Lin,2004) and BLEU (Papineni
et al.,2001). More recent techniques enhance the
ROUGE score with WordNet (ShafieiBavani et al.,
2018), word embeddings (Ng and Abrecht,2015)
or use contextual embeddings (Zhang et al.,2019)
in order to capture semantic similarity. In addition
to that Mrabet and Demner-Fushman (2020) com-
bine lexical similarities with BERT-Embeddings
(Devlin et al.,2019).
Annotation-based Evaluation
Annotation-
based evaluation methods require manually
annotated summary ratings following predefined
guidelines. For example, the Pyramid method
(Nenkova and Passonneau,2004) works by
annotating relevant topics in the source text and
ranking the summaries accordingly. Böhm et al.
(2020) use annotated texts to train a BERT-based
evaluation model using rated summaries as training
data.
Unsupervised Evaluation
Unsupervised ap-
proaches infer a quality score of a summary based
on its reference text without using a gold standard
or manual annotations. Over the past few years,
there have been multiple approaches to unsuper-
vised methods for summary evaluation. For in-
stance, many works explore BERT-Embeddings to
detect semantic similarity between summary and
reference text (Zheng and Lapata,2020;Xenouleas
et al.,2020;Gao et al.,2020). Zheng and Lapata
(2020) propose a method for summary evaluation
called PacSum that combines BERT-embeddings
with a directed graph in which each embedding is
a node and node-wise similarity is computed based
on graph positions. In order to rate a summary,
SUPERT (Gao et al.,2020) uses contextual word
embeddings and soft token alignment techniques.
3 Methodology: Summary Evaluation
Our unsupervised evaluation method is based on
the hypothesis that a generated summary can be
evaluated by how much of the reference docu-
ment’s semantic information it preserves. To in-
Algorithm 1 Calculation of MISEM score m
1: Rencode(reference text).Encode sentences
2: Iencode(summary text).Encode tokens
3: Tcluster(R).Cluster reference text topics
4: for each tTdo
5: wt|t|
|R|.Compute topic weights
6: ¯
Tt1
|t|P|t|
i=1 ti.Compute topic centroids
7: end for
8: C
¯
T·I
k¯
TkkIk.Compute cosine similarity matrix
9: Csoftmax(C).Normalize similarity matrix
10: for each tTdo
11: stPn
i=1 Cti .Compute topic scores
12: end for
13: Ssoftmax(S).Normalize topic scores
14: m=W·S . Compute weighted final score
crease the granularity and interpretability of our
approach, we evaluate the preservation of content
per semantic topic of the reference document. The
semantic topics are identified by clustering the ref-
erence document’s sentence embeddings. Next,
inspired by maximum mean discrepancy (MMD)
(Gretton et al.,2012), our method measures the
correspondence of the summary’s contextual token
embeddings to each reference text topic centroid
embedding. As each token is assigned to each topic,
this assignment is weighted by the normalized co-
sine similarity between token and topic embedding.
The pseudocode implementing our method is illus-
trated in Algorithm 1.
Encoding
Following SUPERT (Gao et al.,2020),
we first split both reference text and summary text
into sentences. Then, the pre-trained Sentence-
BERT model (Reimers and Gurevych,2020) is
used to encode the reference text into a set of sen-
tence embeddings
R
and the summary text into a
set of contextual token embeddings
I
. In Algo-
rithm 1, this encoding step is performed in lines
1-2.
Topic Clustering
Our method requires a set of
reference text topics defined as clusters
T
. In our
experiments,
T
are computed using the agglom-
erative clustering algorithm from Pedregosa et al.
(2011). Line 3 of Algorithm 1contains this step.
Furthermore, our method requires topic centroids
¯
T
, which are computed as
¯
Tt=1
|t|P|t|
i=1 ti
, where
t
represents one topic in
T
. Each topic centroid
is calculated by taking the average of its associ-
ated reference text sentence embeddings. As the
MISEM method assumes that the importance of a
topic is determined to some degree by its length, a
topic weight is calculated for each topic in
T
. The
摘要:

TowardsInterpretableSummaryEvaluationviaAllocationofContextualEmbeddingstoReferenceTextTopicsBenSchaperTechnicalUniversityofMunichben.schaper@tum.deChristopherLohseTrinityCollegeDublinlohsec@tcd.ieMarcellStreileIBMstreile@de.ibm.comAndreaGiovanniniIBMagv@zurich.ibm.comRichardOsualaUniversitatdeBarce...

展开>> 收起<<
Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics Ben Schaper.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:764.75KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注