Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics Ben Schaper

2025-05-06 1 0 764.75KB 6 页 10玖币

侵权投诉

Towards Interpretable Summary Evaluation via Allocation of Contextual

Embeddings to Reference Text Topics

Ben Schaper

Technical University of Munich

ben.schaper@tum.de

Christopher Lohse

Trinity College Dublin

lohsec@tcd.ie

Marcell Streile

IBM

streile@de.ibm.com

Andrea Giovannini

IBM

agv@zurich.ibm.com

Richard Osuala

Universitat de Barcelona

richard.osuala@ub.edu

Abstract

Despite extensive recent advances in sum-

mary generation models, evaluation of auto-

generated summaries still widely relies on

single-score systems insufﬁcient for transpar-

ent assessment and in-depth qualitative anal-

ysis. Towards bridging this gap, we pro-

pose the multifaceted interpretable summary

evaluation method (MISEM), which is based

on allocation of a summary’s contextual to-

ken embeddings to semantic topics identi-

ﬁed in the reference text. We further con-

tribute an interpretability toolbox for auto-

mated summary evaluation and interactive vi-

sual analysis of summary scoring, topic identi-

ﬁcation, and token-topic allocation. MISEM

achieves a promising .404 Pearson correla-

tion with human judgment on the TAC’08

dataset. Our code and toolbox are available at

https://github.com/IBM/misem

1 Introduction

Auto-generated text summaries are becoming an

increasingly mature, useful, and time-saving tool

in research and industry with multiple applications

such as email summary generation, summarizing

research papers, and simplifying knowledge man-

agement and knowledge transfer in companies (El-

Kassas et al.,2021). It is crucial for production-

ready applications that auto-generated summaries

do not omit critical information.

In this regard, currently used summary evalua-

tion metrics have many known limitations, which

are becoming even more apparent as natural lan-

guage generator (NLG) models evolve (e.g., better

paraphrasing) (Gehrmann et al.,2022). Therefore,

the generated texts are becoming less assessable

based on surface-level (e.g., n-gram overlap based)

methods of older evaluation metrics (Gehrmann

et al.,2022). Recent advances in summary evalu-

ation leverage semantic similarity and often com-

pute a single numerical summary evaluation score

(Zheng and Lapata,2020;Xenouleas et al.,2020;

Figure 1: Sankey diagram visualization of the MISEM

score methodology, which evaluates how well a sum-

mary reﬂects the topics identiﬁed in its reference text.

Gao et al.,2020). However, as noted by Gehrmann

et al. (2022), a single numerical score alone is likely

too narrow to reliably indicate the quality of a text

summary.

Thus, there is a clear need for an NLG evalua-

tion metric that provides a multifaceted and inter-

pretable quality measure. Moreover, such a mea-

sure is needed as quality gate in industrial settings

that NLG-generated summaries need to pass before

being displayed to end-users.

Contributions

Providing a multifaceted view of

quality beyond a single score, our contributions are

two-fold:

•Evaluation Method:

We propose an inter-

pretable summary evaluation method that

identiﬁes semantic topics present in the ref-

erence text, assigns summary text tokens to

these topics, and measures the summary’s se-

mantic coverage of each of these topics.

•Interpretability Toolbox:

We provide an in-

teractive interpretability toolbox that allows

users to evaluate their summaries, adjust hy-

perparameters, explore topic-wise semantic

overlap with reference texts, and detect miss-

ing parts of the summary.

arXiv:2210.14174v1 [cs.CL] 25 Oct 2022

2 Related Work

Evaluation based on Gold Standard

This type

of evaluation determines the relevance of a sum-

mary by comparing its overlapping words with a

gold standard summary. Such evaluation is of-

ten labor-intensive, as it requires (multiple) hu-

man written gold standard summaries. Baseline

methods commonly used in gold standard evalua-

tion are ROUGE (Lin,2004) and BLEU (Papineni

et al.,2001). More recent techniques enhance the

ROUGE score with WordNet (ShaﬁeiBavani et al.,

2018), word embeddings (Ng and Abrecht,2015)

or use contextual embeddings (Zhang et al.,2019)

in order to capture semantic similarity. In addition

to that Mrabet and Demner-Fushman (2020) com-

bine lexical similarities with BERT-Embeddings

(Devlin et al.,2019).

Annotation-based Evaluation

Annotation-

based evaluation methods require manually

annotated summary ratings following predeﬁned

guidelines. For example, the Pyramid method

(Nenkova and Passonneau,2004) works by

annotating relevant topics in the source text and

ranking the summaries accordingly. Böhm et al.

(2020) use annotated texts to train a BERT-based

evaluation model using rated summaries as training

data.

Unsupervised Evaluation

Unsupervised ap-

proaches infer a quality score of a summary based

on its reference text without using a gold standard

or manual annotations. Over the past few years,

there have been multiple approaches to unsuper-

vised methods for summary evaluation. For in-

stance, many works explore BERT-Embeddings to

detect semantic similarity between summary and

reference text (Zheng and Lapata,2020;Xenouleas

et al.,2020;Gao et al.,2020). Zheng and Lapata

(2020) propose a method for summary evaluation

called PacSum that combines BERT-embeddings

with a directed graph in which each embedding is

a node and node-wise similarity is computed based

on graph positions. In order to rate a summary,

SUPERT (Gao et al.,2020) uses contextual word

embeddings and soft token alignment techniques.

3 Methodology: Summary Evaluation

Our unsupervised evaluation method is based on

the hypothesis that a generated summary can be

evaluated by how much of the reference docu-

ment’s semantic information it preserves. To in-

Algorithm 1 Calculation of MISEM score m

1: R←encode(reference text).Encode sentences

2: I←encode(summary text).Encode tokens

3: T←cluster(R).Cluster reference text topics

4: for each t∈Tdo

5: wt←|t|

|R|.Compute topic weights

6: ¯

Tt←1

|t|P|t|

i=1 ti.Compute topic centroids

7: end for

8: C←

T·I

k¯

TkkIk.Compute cosine similarity matrix

9: C←softmax(C).Normalize similarity matrix

10: for each t∈Tdo

11: st←Pn

i=1 Cti .Compute topic scores

12: end for

13: S←softmax(S).Normalize topic scores

14: m=W·S . Compute weighted ﬁnal score

crease the granularity and interpretability of our

approach, we evaluate the preservation of content

per semantic topic of the reference document. The

semantic topics are identiﬁed by clustering the ref-

erence document’s sentence embeddings. Next,

inspired by maximum mean discrepancy (MMD)

(Gretton et al.,2012), our method measures the

correspondence of the summary’s contextual token

embeddings to each reference text topic centroid

embedding. As each token is assigned to each topic,

this assignment is weighted by the normalized co-

sine similarity between token and topic embedding.

The pseudocode implementing our method is illus-

trated in Algorithm 1.

Encoding

Following SUPERT (Gao et al.,2020),

we ﬁrst split both reference text and summary text

into sentences. Then, the pre-trained Sentence-

BERT model (Reimers and Gurevych,2020) is

used to encode the reference text into a set of sen-

tence embeddings

and the summary text into a

set of contextual token embeddings

. In Algo-

rithm 1, this encoding step is performed in lines

1-2.

Topic Clustering

Our method requires a set of

reference text topics deﬁned as clusters

. In our

experiments,

are computed using the agglom-

erative clustering algorithm from Pedregosa et al.

(2011). Line 3 of Algorithm 1contains this step.

Furthermore, our method requires topic centroids

, which are computed as

Tt=1

|t|P|t|

i=1 ti

, where

represents one topic in

. Each topic centroid

is calculated by taking the average of its associ-

ated reference text sentence embeddings. As the

MISEM method assumes that the importance of a

topic is determined to some degree by its length, a

topic weight is calculated for each topic in

. The

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsInterpretableSummaryEvaluationviaAllocationofContextualEmbeddingstoReferenceTextTopicsBenSchaperTechnicalUniversityofMunichben.schaper@tum.deChristopherLohseTrinityCollegeDublinlohsec@tcd.ieMarcellStreileIBMstreile@de.ibm.comAndreaGiovanniniIBMagv@zurich.ibm.comRichardOsualaUniversitatdeBarce...

展开>> 收起<<

Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics Ben Schaper.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics Ben Schaper

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: