Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang Chen Zhang Yan Zhang Yiming Chen Haizhou Li x National University of Singapore Singapore

2025-04-30 0 0 976.78KB 12 页 10玖币

侵权投诉

Analyzing and Evaluating Faithfulness in Dialogue Summarization

Bin Wang†, Chen Zhang†, Yan Zhang†, Yiming Chen†, Haizhou Li\, †, §

†National University of Singapore, Singapore

\The Chinese University of Hong Kong, Shenzhen, China

§Kriston AI, China

bwang28c@gmail.com

Abstract

Dialogue summarization is abstractive in na-

ture, making it suffer from factual errors.

The factual correctness of summaries has the

highest priority before practical applications.

Many efforts have been made to improve

faithfulness in text summarization. However,

there is a lack of systematic study on dia-

logue summarization systems. In this work,

we ﬁrst perform the ﬁne-grained human anal-

ysis on the faithfulness of dialogue sum-

maries and observe that over 35% of gener-

ated summaries are faithfully inconsistent re-

spective the source dialogues. Furthermore,

we present a new model-level faithfulness

evaluation method. It examines generation

models with multi-choice questions created

by rule-based transformations. Experimen-

tal results show that our evaluation schema

is a strong proxy for the factual correct-

ness of summarization models. The human-

annotated faithfulness samples and the evalu-

ation toolkit are released to facilitate future

research toward faithful dialogue summariza-

tion. Code available: https://github.

com/BinWang28/FacEval.1

1 Introduction

Text summarization aims to condense a document

into a short paragraph or a single sentence while

conveying the core information (El-Kassas et al.,

2021). It can be either extractive or abstractive.

Extractive summarization methods identify salient

sequence spans from the source document and past-

ing them together (Dorr et al.,2003;Kobayashi

et al.,2015;Zhong et al.,2020). Abstractive sum-

marization methods generate completely new sum-

mary in a coherent manner (Paulus et al.,2018;

Lewis et al.,2020;Zhang et al.,2020;Liu et al.,

2021a;Wang et al.,2022). Previous work discov-

ered that abstractive summarization suffers from

1Accepted in EMNLP 2022

Dialogue:

Freddie: Nanna, are you coming to visit us soon? Winnie:

Oh darling, Nanna has broken her leg, you’ll have to visit

me instead. Freddie: I forgott. Well come soon. Winnie:

Good, ask Mummy and Daddy and they will come when

they can. Freddie: Yes love you. Leg better soon? Winnie:

Yes, quite soon. Tell mummy to ring me. Bye darling xxxxx

Summaries:

Human: Winnie has broken her leg and will not visit

any time soon. Freddie will ask mummy to call Winnie up. 3

BART: Nanna has broken her leg, so Freddie will have to visit

her instead. Nanna will get better soon. 3

MV-BART: Nanna has broken her leg and Freddie will have to

visit Winnie instead. Mummy and Daddy will come to visit

them soon. 7

Coref-BART: Freddie wants to visit Winnie, but Nanna has

broken her leg, so he will have to visit her instead. Mummy

and Daddy will come when they can. 7

CondigSum-BART: Winnie’s Nanna has broken her leg and

Freddie will have to visit her instead. 7

Table 1: A real example from SAMSum dataset. Span

of factual errors are marked with underline.

unfaithful outputs, limiting its applicability in real-

world scenarios (Kryscinski et al.,2020;Falke

et al.,2019;Zhu et al.,2021a;Ladhak et al.,2022).

As an essential way of exchanging information,

conversations usually involve multiple participants,

informal language usage, repetition, and negations

(Sacks et al.,1978;Chen and Yang,2020). There-

fore, dialogue summarization is vulnerable to fac-

tual issues due to its abstractive nature. Table 1

gives an example of factually incorrect dialogue

summaries. The problem of factual correctness is

broadly studied for text summarization in news and

article domains (Nallapati et al.,2016;Narayan

et al.,2018). The progress is primarily because of

the availability of factually annotated data at both

summary and token levels (Kryscinski et al.,2020;

Wang et al.,2020;Pagnoni et al.,2021;Cao et al.,

2022). Many studies are proposed to evaluate and

reduce factual errors in the generated summaries.

However, due to the interactive nature of dialogues,

we cannot simply transfer these methods to dia-

arXiv:2210.11777v1 [cs.CL] 21 Oct 2022

logue summarization.

In this work, we ﬁrst categorize the most fre-

quently occurred factual errors for dialogue sum-

marization into

types. Then, we collect ﬁne-

grained factual annotations for human reference

and the output of

recent dialogue summarization

systems (

3). At least two annotators are involved,

and a veriﬁcation process is incorporated to ensure

the annotation quality. As a result, our study on

human-annotated data suggests that over

35%

the generated dialogue summaries contain at least

one factual error. Similar observations have been

made in the news summarization domain where

30%

80%

of generated text are factually inconsis-

tent (Cao et al.,2018;Pagnoni et al.,2021). More

research attention should be made toward faithful

dialogue summarization.

The unavailability of faithful evaluation meth-

ods hinders the development of effective dialogue

summarization models. In this work, we present a

model-level evaluation schema, FacEval, targeting

dialogue summarisation models’ faithfulness (

4).

First, we synthesize a set of positive and negative

summaries for each dialogue with back-translation

or rule-based transformations. Then, a summariza-

tion model is asked to distinguish positive and neg-

ative summaries based on conditional generation

probabilities. More correct judgements indicate the

model is more factually competent.

To compare the model-level performance of eval-

uation methods, we leverage two ad-hoc training

schema to synthesize a series of models with differ-

ent capability ranks. Then, the evaluation methods

are used to predict the ranking of trained models.

Seven non-factual and factual evaluation methods

have been examined, followed by a detailed dis-

cussion of their properties. The effectiveness of

FacEval is also proven by showing a strong correla-

tion with the factual correctness of summarization

models.

2 Related Work

2.1 Summarization Methods

Text summarization is one of the most important

tasks in natural language generation (NLG). With

the development of pre-trained language models, a

lot progress has been made to abstractive text sum-

marization (See et al.,2017;Zhang et al.,2020;Liu

et al.,2022), especially in news domain (Hermann

et al.,2015;Narayan et al.,2018). With the avail-

ability of datasets (Carletta et al.,2005;Gliwa et al.,

2019;Zhu et al.,2021b), dialogue summarization

research has attracted a lot of attention. For dia-

logue summarization, ﬁne-tuning pre-trained gen-

eration models including T5 (Raffel et al.,2020),

PEGASUS (Zhang et al.,2020) and BART (Lewis

et al.,2020) are served as a strong baseline, where

BART achieves the SOTA performance on ROUGE

scores. Some recent works consider the dialogue

properties for more advanced summarization mod-

els. Chen and Yang (2020) and Liu et al. (2021a)

incorporate the conversational structures into the

semantic encoding process of dialogue. Conver-

sations involve lots of co-references. Therefore,

Liu et al. (2021b) proposes injecting co-reference

information into the transformer layers by adapt-

ing attention maps or through graph convolutional

networks (GCN). We include the outputs of recent

dialogue summarization models in our analysis.

2.2 Faithfulness Analysis

Previous works spot that the factual consistency

problem is one key aspect of improving text sum-

marization (Kryscinski et al.,2020;Cao et al.,

2020). The analysis of factual errors in summaries

is mainly performed in the news domain. Kryscin-

ski et al. (2019) and Falke et al. (2019) conducted

the initial crowdsourcing of binary factual annota-

tions and found that nearly 30% of the generated

summaries are factually inconsistent. Recent exten-

sions focus on more ﬁne-grained analysis (Cao and

Wang,2021;Pagnoni et al.,2021) and also discov-

ering factual evidences at entity level (Cao et al.,

2022) or span level (Huang et al.,2020;Maynez

et al.,2020;Goyal and Durrett,2021).

Recently, CONFIT presented the ﬁrst study on

the faithfulness of dialogue summaries (Tang et al.,

2022b). Similar to our work, they also deﬁne a tax-

onomy of factual errors and conduct ﬁne-grained

annotations. However, they focus on comparing ref-

erence summaries and generated summaries with-

out referring to the whole dialogue. It is sub-

optimal because the reference summary cannot

fully represent the entire dialogue and also can

be incorrect according to our analysis in Section 3.

Besides, the missing and redundant information is

categorized as factual errors, which we consider

less proper. More recent advanced dialogue sum-

marization models are also not included in their

analysis.

Speaker 1: Fiona Speaker 2: Jonathan

What should I prepare 4 my dad’s birthday? How old is he?

Turning 50. Wow, a round birthday, it must be sth big.

I know, but I don’t have any idea. What does he like?

He watches a lot of military movies. Well, a movie ticket is probably not what you thought of.

No, not even close. U said he likes military... maybe paintball?

I don’t know how my mum will react but I like it :D

Ref. Summary:Fiona doesn’t know what she should give to her dad as a birthday gift.He likes military. Jonathan suggests a paintball match.

SubObjE:Jonathan doesn’t know what she should give to her dad as a birthday gift. He likes military. Jonathan suggests a paintball match.

ProE: Fiona doesn’t know what he should give to her dad as a birthday gift. He likes military. Jonathan suggests a paintball match.

NegE: Fiona doesn’t know what she should give to her dad as a birthday gift. He hates military. Jonathan suggests a paintball match.

ParE: Fiona doesn’t know what she should give to her dad as a Christmas gift. He likes military. Jonathan suggests a paintball match.

HalE: Fiona doesn’t know what she should give to her dad as a birthday gift. He likes military. Jonathan invites Fiona to watch a military movie.

Table 2: An illustration of the taxonomy on factual error types.

2.3 Faithfulness Evaluation

The default evaluation metric for summarization,

ROUGE, is based on n-gram overlaps between a

generated summary and the corresponding refer-

ences, rendering it less sensitive for capturing fac-

tual errors. Therefore, several new metrics are pro-

posed to evaluate the faithfulness in the news do-

main (Kryscinski et al.,2019;Fabbri et al.,2021;

Tang et al.,2022a). There are two major groups,

one is based on natural language inference, and the

other is based on question-answering. Kryscinski

et al. (2020) and Goyal and Durrett (2020) propose

to leverage entailment relationship. Scialom et al.

(2021) and Wang et al. (2020) involves question

generation, answer generation and answer-overlap

as the factual consistency measure. Zhao et al.

(2021) proposes to evaluate the faithfulness of task-

oriented dialogue summarization by calculating the

amount of overlapped dialogue states, which re-

quires additional human annotations.

3 Fine-grained Faithfulness Analysis

Previous studies of factuality analysis in summa-

rization mainly focus on the news domain. The

typology of factual errors for dialogues can be very

different. Therefore, we ﬁrst deﬁne a taxonomy of

frequently occurred factual errors for dialogue sum-

maries. A ﬁne-grained analysis is then performed

by measuring the factual consistency within dia-

logue summary pairs.

3.1 Taxonomy of Factual Errors

We collect the generated summaries using four

SOTA dialogue summarization models on the pop-

ular dialogue summarization dataset, SAMSum

(Gliwa et al.,2019). The selected models are

BART (Lewis et al.,2020), MV-BART (Chen and

Yang,2020), Coref-BART (Liu et al.,2021b) and

CondigSum-BART (Liu et al.,2021a). We deﬁne

ﬁve most frequently occurred error types in dia-

logue summaries as below. An example for each

error type is shown in Table 2.

Subject Object Error (SubObjE)

: The subject(s)

or object(s) involved for an event is (partially)

wrong. It includes substitution, addition and dele-

tion of any related subject(s) or object(s).

Pronoun Error (ProE)

: Pronoun references are

frequently occurred in dialogue summarization.

This error includes wrong references and ambigu-

ous ones that cannot be fully understandable rely-

ing on the summary.

Negation Error (NegE)

: Dialogues can contain

conﬁrmation utterances. This error means that

the generated summary makes wrong conclusions

when contradictory or unconﬁrmed events are pre-

sented in the dialogue.

Particulars Error (ParE)

: The summary presents

related events, but some details are inaccurate or

faulty. It can include incorrect information like

date, time and location.

Hallucination Error (HalE)

: Generation models

have the imaginary ability and can be triggered by

certain prompt words in the dialogue. The halluci-

nation error refers to the cases where the summary

contains events not presented in the dialogue.

Other Error (OthE)

: It is used to classify factual

errors that do not belong to any of the above types.

Note that the above-mentioned error types are

not exclusive to each other. That is, one summary

may contain multiple error types.

3.2 Annotation Process

We random sample 150 dialogues from the test set

of SAMSum. Five summaries are listed for each

dialogue, including the human-written one and four

model-generated summaries.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnalyzingandEvaluatingFaithfulnessinDialogueSummarizationBinWang,ChenZhang,YanZhang,YimingChen,HaizhouLi\,,xNationalUniversityofSingapore,Singapore\TheChineseUniversityofHongKong,Shenzhen,ChinaxKristonAI,Chinabwang28c@gmail.comAbstractDialoguesummarizationisabstractiveinna-ture,makingitsufferf...

展开>> 收起<<

Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang Chen Zhang Yan Zhang Yiming Chen Haizhou Li x National University of Singapore Singapore.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang Chen Zhang Yan Zhang Yiming Chen Haizhou Li x National University of Singapore Singapore

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: