logue summarization.
In this work, we first categorize the most fre-
quently occurred factual errors for dialogue sum-
marization into
6
types. Then, we collect fine-
grained factual annotations for human reference
and the output of
4
recent dialogue summarization
systems (
§
3). At least two annotators are involved,
and a verification process is incorporated to ensure
the annotation quality. As a result, our study on
human-annotated data suggests that over
35%
of
the generated dialogue summaries contain at least
one factual error. Similar observations have been
made in the news summarization domain where
30%
-
80%
of generated text are factually inconsis-
tent (Cao et al.,2018;Pagnoni et al.,2021). More
research attention should be made toward faithful
dialogue summarization.
The unavailability of faithful evaluation meth-
ods hinders the development of effective dialogue
summarization models. In this work, we present a
model-level evaluation schema, FacEval, targeting
dialogue summarisation models’ faithfulness (
§
4).
First, we synthesize a set of positive and negative
summaries for each dialogue with back-translation
or rule-based transformations. Then, a summariza-
tion model is asked to distinguish positive and neg-
ative summaries based on conditional generation
probabilities. More correct judgements indicate the
model is more factually competent.
To compare the model-level performance of eval-
uation methods, we leverage two ad-hoc training
schema to synthesize a series of models with differ-
ent capability ranks. Then, the evaluation methods
are used to predict the ranking of trained models.
Seven non-factual and factual evaluation methods
have been examined, followed by a detailed dis-
cussion of their properties. The effectiveness of
FacEval is also proven by showing a strong correla-
tion with the factual correctness of summarization
models.
2 Related Work
2.1 Summarization Methods
Text summarization is one of the most important
tasks in natural language generation (NLG). With
the development of pre-trained language models, a
lot progress has been made to abstractive text sum-
marization (See et al.,2017;Zhang et al.,2020;Liu
et al.,2022), especially in news domain (Hermann
et al.,2015;Narayan et al.,2018). With the avail-
ability of datasets (Carletta et al.,2005;Gliwa et al.,
2019;Zhu et al.,2021b), dialogue summarization
research has attracted a lot of attention. For dia-
logue summarization, fine-tuning pre-trained gen-
eration models including T5 (Raffel et al.,2020),
PEGASUS (Zhang et al.,2020) and BART (Lewis
et al.,2020) are served as a strong baseline, where
BART achieves the SOTA performance on ROUGE
scores. Some recent works consider the dialogue
properties for more advanced summarization mod-
els. Chen and Yang (2020) and Liu et al. (2021a)
incorporate the conversational structures into the
semantic encoding process of dialogue. Conver-
sations involve lots of co-references. Therefore,
Liu et al. (2021b) proposes injecting co-reference
information into the transformer layers by adapt-
ing attention maps or through graph convolutional
networks (GCN). We include the outputs of recent
dialogue summarization models in our analysis.
2.2 Faithfulness Analysis
Previous works spot that the factual consistency
problem is one key aspect of improving text sum-
marization (Kryscinski et al.,2020;Cao et al.,
2020). The analysis of factual errors in summaries
is mainly performed in the news domain. Kryscin-
ski et al. (2019) and Falke et al. (2019) conducted
the initial crowdsourcing of binary factual annota-
tions and found that nearly 30% of the generated
summaries are factually inconsistent. Recent exten-
sions focus on more fine-grained analysis (Cao and
Wang,2021;Pagnoni et al.,2021) and also discov-
ering factual evidences at entity level (Cao et al.,
2022) or span level (Huang et al.,2020;Maynez
et al.,2020;Goyal and Durrett,2021).
Recently, CONFIT presented the first study on
the faithfulness of dialogue summaries (Tang et al.,
2022b). Similar to our work, they also define a tax-
onomy of factual errors and conduct fine-grained
annotations. However, they focus on comparing ref-
erence summaries and generated summaries with-
out referring to the whole dialogue. It is sub-
optimal because the reference summary cannot
fully represent the entire dialogue and also can
be incorrect according to our analysis in Section 3.
Besides, the missing and redundant information is
categorized as factual errors, which we consider
less proper. More recent advanced dialogue sum-
marization models are also not included in their
analysis.