Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Inﬁlling Vidhisha BalachandranHannaneh Hajishirzi

2025-05-06 0 0 556.71KB 13 页 10玖币

侵权投诉

Correcting Diverse Factual Errors in Abstractive Summarization via

Post-Editing and Language Model Inﬁlling

Vidhisha Balachandran♣Hannaneh Hajishirzi♦♥

William W. Cohen♠Yulia Tsvetkov♥

♣Language Technologies Institute, Carnegie Mellon University

♦Allen Institute for Artiﬁcial Intelligence

♥Paul G. Allen School of Computer Science & Engineering, University of Washington

♠Google Research

vbalacha@cs.cmu.edu, hannaneh@cs.washington.edu, wcohen@google.com, yuliats@cs.washington.edu

Abstract

Abstractive summarization models often gen-

erate inconsistent summaries containing fac-

tual errors or hallucinated content. Recent

works focus on correcting factual errors in

generated summaries via post-editing. Such

correction models are trained using adversar-

ial non-factual summaries constructed using

heuristic rules for injecting errors. How-

ever, generating non-factual summaries using

heuristics often does not generalize well to ac-

tual model errors. In this work, we propose to

generate hard, representative synthetic exam-

ples of non-factual summaries through inﬁll-

ing language models. With this data, we train

a more robust fact-correction model to post-

edit the summaries to improve factual consis-

tency. Through quantitative and qualitative

experiments on two popular summarization

datasets— CNN/DM and XSum—we show

that our approach vastly outperforms prior

methods in correcting erroneous summaries.

Our model—FACTEDIT—improves factuality

scores by over ∼11 points on CNN/DM and

over ∼31 points on XSum on average across

multiple summarization models, producing

more factual summaries while maintaining

competitive summarization quality.1

1 Introduction

While modern summarization models generate

highly ﬂuent summaries that appear realistic (Lewis

et al.,2020;Zhang et al.,2020), these models are

prone to generating non-factual and sometimes en-

tirely fabricated content (Cao et al.,2018;Goodrich

et al.,2019;Maynez et al.,2020). With the in-

creasing adoption of language generation tools in

user-facing products, such unreliability poses se-

vere risks, including the spread of misinformation,

panic and other potentially harmful effects (Ranade

et al.,2021;Hutson et al.,2021).

Code and data available at

https://github.com/

vidhishanair/FactEdit.

The ﬁrst vaccine for Ebola was approved by the FDA in

2019 in the US, ﬁve years after the initial outbreak in

2014. To produce the vaccine, scientists had to sequence

the DNA of Ebola, then identify possible vaccines, and

ﬁnally show successful clinical trials. Scientists say a

vaccine for COVID-19 is unlikely to be ready this year,

although clinical trials have already started.

Scientists believe a

vaccine for Covid-19

might not be ready

this year. The ﬁrst

vaccine for Ebola took

5 years to be

approved by the FDA.

Scientists believe a

vaccine for Ebola

might not be ready

this year. The ﬁrst

vaccine for Ebola took

5 years to be

produced by the CBP.

Incorrect

Entity

Incorrect

Predicate Hallucination

Source

Generated !

Summary

Corrected !

Summary

Error

Correction

Figure 1: Model generated summaries often produce

content which is factually inconsistent w.r.t. to the

source. FACTEDIT rewrites these summaries by main-

taining the abstractiveness but correcting factual errors.

Since it is difﬁcult to control for factuality at

training or inference time (Huang et al.,2021;

Dreyer et al.,2021), a popular approach to ﬁx the

factual inconsistencies is via post-editing generated

summaries (Cao et al.,2020;Dong et al.,2020).

This allows summarization models to focus on ﬂu-

ency and content-relevance while improving fac-

tual consistency. However, there is no suitable

data for training post-editing models to directly

“translate” an incorrect summary to a correct one.

Prior work constructed synthetic training data by

introducing simple heuristic errors like replacing

entities or numbers in reference summaries (Cao

et al.,2020), but it is not clear whether such syn-

thetic errors have sufﬁcient coverage and accurately

represent the types and distribution of actual errors

made by language models. Further, with increas-

ing language generation capabilities, models make

more complex factual errors involving discourse

structures and paraphrasing which cannot be easily

arXiv:2210.12378v2 [cs.CL] 31 Oct 2022

Vaccine for [MASK] is

unlikely to be ready this year.

[SEP] The ﬁrst vaccine for

Ebola …….. ready this year,

although clinical trials have

already started.

Covid-19

Coronavirus

Ebola

the virus

Vaccine for Ebola

is unlikely to be

ready this year.

Vaccine for the virus

is unlikely to be

ready this year.

World leaders met Ban Ki Moon for

UN meeting in 2020[SEP] Pandemic

response … [SEP] UN Sec. Gen.

Antonio Gutteres met the leaders

World leaders met

Antonio Gutteres for

UN meeting in 2020

Input: masked ref summary

(smasked) [SEP] source context

(ctx)

Input: incorrect summary sent (sg’)

[SEP] summary context (g’)

[SEP] relevant source passages

Output: corrected summary

sent (g)

Inﬁlling

Language Model

(MI)

Seq-to-Seq

Model (MC)

Inﬁlling-Based Generation of Adversarial Summaries

train

decode

Output: beam search

candidates

FactEdit: Factual Error Correction

incorrect summary

candidates (r’)

Training

Data

Training

Data

Figure 2: Architecture framework for FACTEDIT. Using masked versions of existing reference summaries, we

use an inﬁlling language model to produce alternative candidates for the mask position. We construct factually

incorrect summaries by replacing the mask with the lower ranked candidates. Finally, we train a sequence-to-

sequence model for fact correction using the synthetically constructed data.

captured with heuristics (Pagnoni et al.,2021). The

goal of our work is to develop post-editing models

that generalize over a wider range of factual errors

(example in Figure 1) in generated summaries from

diverse summarization model types.

We propose FACTEDIT—a novel approach to

post-editing text, to control for content factuality in

generated summaries. Rather than manually deﬁn-

ing a list of heuristic errors, it incorporates a new

algorithm to generate adversarial (non-factual) ex-

amples using inﬁlling language models (Donahue

et al.,2020). We use lower ranked beam-search

candidates from the language model as a source

for potentially factually-incorrect summary facts,

thereby producing a set of plausible, likely, and

ﬂuent, incorrect synthetic summaries for a partic-

ular correct reference summary. In this way, we

leverage the capabilities of large language models

to produce multiple candidates of alternative, er-

roneous summaries. These examples, along with

factually correct references, are then used to train

a sequence-to-sequence fact-correction model that

aims at generating a factually consistent version of

the candidate summary (§2).

We evaluate FACTEDIT on two datasets -

CNN/DailyMail (Hermann et al.,2015) and XSum

(Narayan et al.,2018) and across nine summariza-

tion models with the FRANK benchmark (Pagnoni

et al.,2021) for evaluating various categories of

factual errors in generated summaries (§3). The

two summarization datasets represent varied distri-

butions of factual errors in models trained on them

and hence constitute a good test bed to evaluate

the generalizability of our model. We show that

FACTEDIT substantially improves factuality scores

across two metrics - Ent-DAE (Goyal and Durrett,

2021) and FactCC (Kryscinski et al.,2020). On the

Ent-DAE metric, FACTEDIT improves results by

∼

11 points (CNN/DM) and

∼

31 points (XSum),

and on the FactCC metric we show improvements

∼

6 points (CNN/DM) and

∼

24 (XSum) points

on average across models (§4). Further, our anal-

ysis shows that FACTEDIT effectively corrects di-

verse error categories without the need for special

heuristics or annotations (§5). An important ap-

plication of FACTEDIT is to audit summarization

systems and facilitate their reliability.

2 Model

Assume a summarization model trained to process

a document

and generate a coherent and ﬂuent

summary

2g0

which has been shown to often mis-

We denote incorrect input (to fact correction model) sum-

maries using

and corrected output (from fact correction

model) without the

throughout this paper. For E.g:

is in-

correct summary,

is the incorrect reference summary while

represent facts from the document. FACTEDIT is

a fact correction model

which takes the gener-

ated summary

and document

, identiﬁes factual

errors and generates a rewritten summary

by cor-

recting them (as outlined in Figure 2).

We present an adversarial data generation ap-

proach which leverages the power of pre-trained

language models to produce ﬂuent and complex

factually incorrect summaries. We train an inﬁll-

ing language model

using documents from

summarization training data and use the model to

introduce diverse factual errors in sentences from

them (§2.1). Using the trained model, we intro-

duce factual errors in reference summaries of the

training data

producing an incorrect summary

resulting in a synthetic dataset

{r0, r, d}train

erroneous summaries mapped to their corrected

versions (pink section in Figure 2). We train a

sequence-to-sequence model

for factual er-

ror correction using the generated synthetic data

(§2.2). Finally, we use the trained correction model

to rewrite model generated summaries

produc-

ing a corrected version

(§2.3 - green section in

Figure 2).

2.1 Inﬁlling Data Generator MI

Our data generation process leverages inﬁlling lan-

guage models (Donahue et al.,2020) to produce

candidates to ﬁll masked phrases in a summary

sentence. We mask parts of the input and use the

inﬁlling model to generate multiple candidates for

the masked position. We then use lower order beam

candidates as potential incorrect candidates to gen-

erate an incorrect version of the input. We hypoth-

esize that, given the relevant context of a source

document, a strong language model generates rel-

evant and factual sequences at higher probabili-

ties, compared to lower probability sequences. For

the inﬁlling model, we hypothesize that the lower

ranked candidates are often alternative phrases of

similar types (in case of entities) or parts-of-speech

which are plausible but often not factually correct.

Motivated by prior work (Goyal and Durrett,2020)

using lower ranked beam search candidates as a

source for adversarial data, we use the lower ranked

candidates to construct erroneous summaries from

reference summaries.

Training:

Our inﬁlling model

is trained to

take a masked sentence

smasked

and its relevant

is the corrected summary and

is the corrected reference

summary.

context

ctx

as input and generate a correct phrase to

ﬁll in the masked span. To train

, we construct a

dataset using documents

from the training data of

existing summarization datasets. For each sentence

in the ﬁrst-

(

=5) positional sentences of a doc-

ument

, we identify the subjects, objects and rela-

tions {sub, obj, rel} in them using OpenIE (Banko

et al.,2007). By iteratively masking each phrase

in {sub,obj,rel}, we create a masked query

smasked

and its corresponding context

ctx

by removing the

masked sentence from the document, resulting in

our training data

{smasked, p, ctx}

, where

is the

masked span text. We train a sequence-to-sequence

model

on this data which takes

smasked [SEP]

ctx

as input and learns to generate

as the out-

put. We intentionally use only sentences from the

document as masked queries and do not use sen-

tences from the reference summaries, to ensure

that the model does not memorize phrases from

the references. Thus, when applied to unseen ref-

erence sentences during inference, the model will

produces richer beam search candidates.

Adversarial Data Generation:

We use the

trained inﬁlling model to generate the synthetic

dataset for fact correction using the document

reference pairs

{d, r}train

from the summariza-

tion training data. For each sentence in the ref-

erence

, we use OpenIE to extract {sub, obj, rel}

and iteratively mask one phrase at a time to con-

struct masked sentences

smasked

from the refer-

ences. We provide this masked reference summary

sentence and document

as input to the model

and perform beam-search decoding for generation.

We then consider lower ranked beam candidates

(rank=[5,15])

as non-factual alternatives for the

corresponding masked phrase. We then use these

candidates as the replacements for the mask produc-

ing an erroneous summary

. Running this on the

{d, r}train

training data, we construct a synthetic

data

{r0, r, d}train

of factually incorrect summaries

paired with their correct version where

and

dif-

fer by an incorrect phrase. To train the model to

not perform any corrections on factual summaries,

we keep original reference summaries for 20% of

the data points (r0=r).

We chose this range of ranks based on a manual analysis

of 500 generated adversarial examples where our method

produced factually incorrect replacements over 90% of the

time.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CorrectingDiverseFactualErrorsinAbstractiveSummarizationviaPost-EditingandLanguageModelInllingVidhishaBalachandran|HannanehHajishirzi}~WilliamW.CohenYuliaTsvetkov~|LanguageTechnologiesInstitute,CarnegieMellonUniversity}AllenInstituteforArticialIntelligence~PaulG.AllenSchoolofComputerScience&Engin...

展开>> 收起<<

Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Inﬁlling Vidhisha BalachandranHannaneh Hajishirzi.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Inﬁlling Vidhisha BalachandranHannaneh Hajishirzi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: