Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang Chen Zhang Yan Zhang Yiming Chen Haizhou Li x National University of Singapore Singapore

2025-04-30 0 0 976.78KB 12 页 10玖币
侵权投诉
Analyzing and Evaluating Faithfulness in Dialogue Summarization
Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, Haizhou Li\, †, §
National University of Singapore, Singapore
\The Chinese University of Hong Kong, Shenzhen, China
§Kriston AI, China
bwang28c@gmail.com
Abstract
Dialogue summarization is abstractive in na-
ture, making it suffer from factual errors.
The factual correctness of summaries has the
highest priority before practical applications.
Many efforts have been made to improve
faithfulness in text summarization. However,
there is a lack of systematic study on dia-
logue summarization systems. In this work,
we first perform the fine-grained human anal-
ysis on the faithfulness of dialogue sum-
maries and observe that over 35% of gener-
ated summaries are faithfully inconsistent re-
spective the source dialogues. Furthermore,
we present a new model-level faithfulness
evaluation method. It examines generation
models with multi-choice questions created
by rule-based transformations. Experimen-
tal results show that our evaluation schema
is a strong proxy for the factual correct-
ness of summarization models. The human-
annotated faithfulness samples and the evalu-
ation toolkit are released to facilitate future
research toward faithful dialogue summariza-
tion. Code available: https://github.
com/BinWang28/FacEval.1
1 Introduction
Text summarization aims to condense a document
into a short paragraph or a single sentence while
conveying the core information (El-Kassas et al.,
2021). It can be either extractive or abstractive.
Extractive summarization methods identify salient
sequence spans from the source document and past-
ing them together (Dorr et al.,2003;Kobayashi
et al.,2015;Zhong et al.,2020). Abstractive sum-
marization methods generate completely new sum-
mary in a coherent manner (Paulus et al.,2018;
Lewis et al.,2020;Zhang et al.,2020;Liu et al.,
2021a;Wang et al.,2022). Previous work discov-
ered that abstractive summarization suffers from
1Accepted in EMNLP 2022
Dialogue:
Freddie: Nanna, are you coming to visit us soon? Winnie:
Oh darling, Nanna has broken her leg, you’ll have to visit
me instead. Freddie: I forgott. Well come soon. Winnie:
Good, ask Mummy and Daddy and they will come when
they can. Freddie: Yes love you. Leg better soon? Winnie:
Yes, quite soon. Tell mummy to ring me. Bye darling xxxxx
Summaries:
Human: Winnie has broken her leg and will not visit
any time soon. Freddie will ask mummy to call Winnie up. 3
BART: Nanna has broken her leg, so Freddie will have to visit
her instead. Nanna will get better soon. 3
MV-BART: Nanna has broken her leg and Freddie will have to
visit Winnie instead. Mummy and Daddy will come to visit
them soon. 7
Coref-BART: Freddie wants to visit Winnie, but Nanna has
broken her leg, so he will have to visit her instead. Mummy
and Daddy will come when they can. 7
CondigSum-BART: Winnie’s Nanna has broken her leg and
Freddie will have to visit her instead. 7
Table 1: A real example from SAMSum dataset. Span
of factual errors are marked with underline.
unfaithful outputs, limiting its applicability in real-
world scenarios (Kryscinski et al.,2020;Falke
et al.,2019;Zhu et al.,2021a;Ladhak et al.,2022).
As an essential way of exchanging information,
conversations usually involve multiple participants,
informal language usage, repetition, and negations
(Sacks et al.,1978;Chen and Yang,2020). There-
fore, dialogue summarization is vulnerable to fac-
tual issues due to its abstractive nature. Table 1
gives an example of factually incorrect dialogue
summaries. The problem of factual correctness is
broadly studied for text summarization in news and
article domains (Nallapati et al.,2016;Narayan
et al.,2018). The progress is primarily because of
the availability of factually annotated data at both
summary and token levels (Kryscinski et al.,2020;
Wang et al.,2020;Pagnoni et al.,2021;Cao et al.,
2022). Many studies are proposed to evaluate and
reduce factual errors in the generated summaries.
However, due to the interactive nature of dialogues,
we cannot simply transfer these methods to dia-
arXiv:2210.11777v1 [cs.CL] 21 Oct 2022
logue summarization.
In this work, we first categorize the most fre-
quently occurred factual errors for dialogue sum-
marization into
6
types. Then, we collect fine-
grained factual annotations for human reference
and the output of
4
recent dialogue summarization
systems (
§
3). At least two annotators are involved,
and a verification process is incorporated to ensure
the annotation quality. As a result, our study on
human-annotated data suggests that over
35%
of
the generated dialogue summaries contain at least
one factual error. Similar observations have been
made in the news summarization domain where
30%
-
80%
of generated text are factually inconsis-
tent (Cao et al.,2018;Pagnoni et al.,2021). More
research attention should be made toward faithful
dialogue summarization.
The unavailability of faithful evaluation meth-
ods hinders the development of effective dialogue
summarization models. In this work, we present a
model-level evaluation schema, FacEval, targeting
dialogue summarisation models’ faithfulness (
§
4).
First, we synthesize a set of positive and negative
summaries for each dialogue with back-translation
or rule-based transformations. Then, a summariza-
tion model is asked to distinguish positive and neg-
ative summaries based on conditional generation
probabilities. More correct judgements indicate the
model is more factually competent.
To compare the model-level performance of eval-
uation methods, we leverage two ad-hoc training
schema to synthesize a series of models with differ-
ent capability ranks. Then, the evaluation methods
are used to predict the ranking of trained models.
Seven non-factual and factual evaluation methods
have been examined, followed by a detailed dis-
cussion of their properties. The effectiveness of
FacEval is also proven by showing a strong correla-
tion with the factual correctness of summarization
models.
2 Related Work
2.1 Summarization Methods
Text summarization is one of the most important
tasks in natural language generation (NLG). With
the development of pre-trained language models, a
lot progress has been made to abstractive text sum-
marization (See et al.,2017;Zhang et al.,2020;Liu
et al.,2022), especially in news domain (Hermann
et al.,2015;Narayan et al.,2018). With the avail-
ability of datasets (Carletta et al.,2005;Gliwa et al.,
2019;Zhu et al.,2021b), dialogue summarization
research has attracted a lot of attention. For dia-
logue summarization, fine-tuning pre-trained gen-
eration models including T5 (Raffel et al.,2020),
PEGASUS (Zhang et al.,2020) and BART (Lewis
et al.,2020) are served as a strong baseline, where
BART achieves the SOTA performance on ROUGE
scores. Some recent works consider the dialogue
properties for more advanced summarization mod-
els. Chen and Yang (2020) and Liu et al. (2021a)
incorporate the conversational structures into the
semantic encoding process of dialogue. Conver-
sations involve lots of co-references. Therefore,
Liu et al. (2021b) proposes injecting co-reference
information into the transformer layers by adapt-
ing attention maps or through graph convolutional
networks (GCN). We include the outputs of recent
dialogue summarization models in our analysis.
2.2 Faithfulness Analysis
Previous works spot that the factual consistency
problem is one key aspect of improving text sum-
marization (Kryscinski et al.,2020;Cao et al.,
2020). The analysis of factual errors in summaries
is mainly performed in the news domain. Kryscin-
ski et al. (2019) and Falke et al. (2019) conducted
the initial crowdsourcing of binary factual annota-
tions and found that nearly 30% of the generated
summaries are factually inconsistent. Recent exten-
sions focus on more fine-grained analysis (Cao and
Wang,2021;Pagnoni et al.,2021) and also discov-
ering factual evidences at entity level (Cao et al.,
2022) or span level (Huang et al.,2020;Maynez
et al.,2020;Goyal and Durrett,2021).
Recently, CONFIT presented the first study on
the faithfulness of dialogue summaries (Tang et al.,
2022b). Similar to our work, they also define a tax-
onomy of factual errors and conduct fine-grained
annotations. However, they focus on comparing ref-
erence summaries and generated summaries with-
out referring to the whole dialogue. It is sub-
optimal because the reference summary cannot
fully represent the entire dialogue and also can
be incorrect according to our analysis in Section 3.
Besides, the missing and redundant information is
categorized as factual errors, which we consider
less proper. More recent advanced dialogue sum-
marization models are also not included in their
analysis.
Speaker 1: Fiona Speaker 2: Jonathan
What should I prepare 4 my dad’s birthday? How old is he?
Turning 50. Wow, a round birthday, it must be sth big.
I know, but I don’t have any idea. What does he like?
He watches a lot of military movies. Well, a movie ticket is probably not what you thought of.
No, not even close. U said he likes military... maybe paintball?
I don’t know how my mum will react but I like it :D
Ref. Summary:Fiona doesn’t know what she should give to her dad as a birthday gift.He likes military. Jonathan suggests a paintball match.
SubObjE:Jonathan doesn’t know what she should give to her dad as a birthday gift. He likes military. Jonathan suggests a paintball match.
ProE: Fiona doesn’t know what he should give to her dad as a birthday gift. He likes military. Jonathan suggests a paintball match.
NegE: Fiona doesn’t know what she should give to her dad as a birthday gift. He hates military. Jonathan suggests a paintball match.
ParE: Fiona doesn’t know what she should give to her dad as a Christmas gift. He likes military. Jonathan suggests a paintball match.
HalE: Fiona doesn’t know what she should give to her dad as a birthday gift. He likes military. Jonathan invites Fiona to watch a military movie.
Table 2: An illustration of the taxonomy on factual error types.
2.3 Faithfulness Evaluation
The default evaluation metric for summarization,
ROUGE, is based on n-gram overlaps between a
generated summary and the corresponding refer-
ences, rendering it less sensitive for capturing fac-
tual errors. Therefore, several new metrics are pro-
posed to evaluate the faithfulness in the news do-
main (Kryscinski et al.,2019;Fabbri et al.,2021;
Tang et al.,2022a). There are two major groups,
one is based on natural language inference, and the
other is based on question-answering. Kryscinski
et al. (2020) and Goyal and Durrett (2020) propose
to leverage entailment relationship. Scialom et al.
(2021) and Wang et al. (2020) involves question
generation, answer generation and answer-overlap
as the factual consistency measure. Zhao et al.
(2021) proposes to evaluate the faithfulness of task-
oriented dialogue summarization by calculating the
amount of overlapped dialogue states, which re-
quires additional human annotations.
3 Fine-grained Faithfulness Analysis
Previous studies of factuality analysis in summa-
rization mainly focus on the news domain. The
typology of factual errors for dialogues can be very
different. Therefore, we first define a taxonomy of
frequently occurred factual errors for dialogue sum-
maries. A fine-grained analysis is then performed
by measuring the factual consistency within dia-
logue summary pairs.
3.1 Taxonomy of Factual Errors
We collect the generated summaries using four
SOTA dialogue summarization models on the pop-
ular dialogue summarization dataset, SAMSum
(Gliwa et al.,2019). The selected models are
BART (Lewis et al.,2020), MV-BART (Chen and
Yang,2020), Coref-BART (Liu et al.,2021b) and
CondigSum-BART (Liu et al.,2021a). We define
ve most frequently occurred error types in dia-
logue summaries as below. An example for each
error type is shown in Table 2.
Subject Object Error (SubObjE)
: The subject(s)
or object(s) involved for an event is (partially)
wrong. It includes substitution, addition and dele-
tion of any related subject(s) or object(s).
Pronoun Error (ProE)
: Pronoun references are
frequently occurred in dialogue summarization.
This error includes wrong references and ambigu-
ous ones that cannot be fully understandable rely-
ing on the summary.
Negation Error (NegE)
: Dialogues can contain
confirmation utterances. This error means that
the generated summary makes wrong conclusions
when contradictory or unconfirmed events are pre-
sented in the dialogue.
Particulars Error (ParE)
: The summary presents
related events, but some details are inaccurate or
faulty. It can include incorrect information like
date, time and location.
Hallucination Error (HalE)
: Generation models
have the imaginary ability and can be triggered by
certain prompt words in the dialogue. The halluci-
nation error refers to the cases where the summary
contains events not presented in the dialogue.
Other Error (OthE)
: It is used to classify factual
errors that do not belong to any of the above types.
Note that the above-mentioned error types are
not exclusive to each other. That is, one summary
may contain multiple error types.
3.2 Annotation Process
We random sample 150 dialogues from the test set
of SAMSum. Five summaries are listed for each
dialogue, including the human-written one and four
model-generated summaries.
摘要:

AnalyzingandEvaluatingFaithfulnessinDialogueSummarizationBinWang†,ChenZhang†,YanZhang†,YimingChen†,HaizhouLi\,†,x†NationalUniversityofSingapore,Singapore\TheChineseUniversityofHongKong,Shenzhen,ChinaxKristonAI,Chinabwang28c@gmail.comAbstractDialoguesummarizationisabstractiveinna-ture,makingitsufferf...

展开>> 收起<<
Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang Chen Zhang Yan Zhang Yiming Chen Haizhou Li x National University of Singapore Singapore.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:976.78KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注