Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling Vidhisha BalachandranHannaneh Hajishirzi

2025-05-06 0 0 556.71KB 13 页 10玖币
侵权投诉
Correcting Diverse Factual Errors in Abstractive Summarization via
Post-Editing and Language Model Infilling
Vidhisha BalachandranHannaneh Hajishirzi♦♥
William W. CohenYulia Tsvetkov
Language Technologies Institute, Carnegie Mellon University
Allen Institute for Artificial Intelligence
Paul G. Allen School of Computer Science & Engineering, University of Washington
Google Research
vbalacha@cs.cmu.edu, hannaneh@cs.washington.edu, wcohen@google.com, yuliats@cs.washington.edu
Abstract
Abstractive summarization models often gen-
erate inconsistent summaries containing fac-
tual errors or hallucinated content. Recent
works focus on correcting factual errors in
generated summaries via post-editing. Such
correction models are trained using adversar-
ial non-factual summaries constructed using
heuristic rules for injecting errors. How-
ever, generating non-factual summaries using
heuristics often does not generalize well to ac-
tual model errors. In this work, we propose to
generate hard, representative synthetic exam-
ples of non-factual summaries through infill-
ing language models. With this data, we train
a more robust fact-correction model to post-
edit the summaries to improve factual consis-
tency. Through quantitative and qualitative
experiments on two popular summarization
datasets— CNN/DM and XSum—we show
that our approach vastly outperforms prior
methods in correcting erroneous summaries.
Our model—FACTEDIT—improves factuality
scores by over 11 points on CNN/DM and
over 31 points on XSum on average across
multiple summarization models, producing
more factual summaries while maintaining
competitive summarization quality.1
1 Introduction
While modern summarization models generate
highly fluent summaries that appear realistic (Lewis
et al.,2020;Zhang et al.,2020), these models are
prone to generating non-factual and sometimes en-
tirely fabricated content (Cao et al.,2018;Goodrich
et al.,2019;Maynez et al.,2020). With the in-
creasing adoption of language generation tools in
user-facing products, such unreliability poses se-
vere risks, including the spread of misinformation,
panic and other potentially harmful effects (Ranade
et al.,2021;Hutson et al.,2021).
1
Code and data available at
https://github.com/
vidhishanair/FactEdit.
The first vaccine for Ebola was approved by the FDA in
2019 in the US, five years after the initial outbreak in
2014. To produce the vaccine, scientists had to sequence
the DNA of Ebola, then identify possible vaccines, and
finally show successful clinical trials. Scientists say a
vaccine for COVID-19 is unlikely to be ready this year,
although clinical trials have already started.
Scientists believe a
vaccine for Covid-19
might not be ready
this year. The first
vaccine for Ebola took
5 years to be
approved by the FDA.
Scientists believe a
vaccine for Ebola
might not be ready
this year. The first
vaccine for Ebola took
5 years to be
produced by the CBP.
Incorrect
Entity
Incorrect
Predicate Hallucination
Source
Generated !
Summary
Corrected !
Summary
Error
Correction
Figure 1: Model generated summaries often produce
content which is factually inconsistent w.r.t. to the
source. FACTEDIT rewrites these summaries by main-
taining the abstractiveness but correcting factual errors.
Since it is difficult to control for factuality at
training or inference time (Huang et al.,2021;
Dreyer et al.,2021), a popular approach to fix the
factual inconsistencies is via post-editing generated
summaries (Cao et al.,2020;Dong et al.,2020).
This allows summarization models to focus on flu-
ency and content-relevance while improving fac-
tual consistency. However, there is no suitable
data for training post-editing models to directly
“translate” an incorrect summary to a correct one.
Prior work constructed synthetic training data by
introducing simple heuristic errors like replacing
entities or numbers in reference summaries (Cao
et al.,2020), but it is not clear whether such syn-
thetic errors have sufficient coverage and accurately
represent the types and distribution of actual errors
made by language models. Further, with increas-
ing language generation capabilities, models make
more complex factual errors involving discourse
structures and paraphrasing which cannot be easily
arXiv:2210.12378v2 [cs.CL] 31 Oct 2022
Vaccine for [MASK] is
unlikely to be ready this year.
[SEP] The first vaccine for
Ebola …….. ready this year,
although clinical trials have
already started.
Covid-19
Coronavirus
Ebola
the virus
Vaccine for Ebola
is unlikely to be
ready this year.
Vaccine for the virus
is unlikely to be
ready this year.
World leaders met Ban Ki Moon for
UN meeting in 2020[SEP] Pandemic
response … [SEP] UN Sec. Gen.
Antonio Gutteres met the leaders
World leaders met
Antonio Gutteres for
UN meeting in 2020
Input: masked ref summary
(smasked) [SEP] source context
(ctx)
Input: incorrect summary sent (sg’)
[SEP] summary context (g’)
[SEP] relevant source passages
Output: corrected summary
sent (g)
Infilling
Language Model
(MI)
Seq-to-Seq
Model (MC)
Infilling-Based Generation of Adversarial Summaries
train
decode
Output: beam search
candidates
FactEdit: Factual Error Correction
incorrect summary
candidates (r’)
Training
Data
Training
Data
Figure 2: Architecture framework for FACTEDIT. Using masked versions of existing reference summaries, we
use an infilling language model to produce alternative candidates for the mask position. We construct factually
incorrect summaries by replacing the mask with the lower ranked candidates. Finally, we train a sequence-to-
sequence model for fact correction using the synthetically constructed data.
captured with heuristics (Pagnoni et al.,2021). The
goal of our work is to develop post-editing models
that generalize over a wider range of factual errors
(example in Figure 1) in generated summaries from
diverse summarization model types.
We propose FACTEDIT—a novel approach to
post-editing text, to control for content factuality in
generated summaries. Rather than manually defin-
ing a list of heuristic errors, it incorporates a new
algorithm to generate adversarial (non-factual) ex-
amples using infilling language models (Donahue
et al.,2020). We use lower ranked beam-search
candidates from the language model as a source
for potentially factually-incorrect summary facts,
thereby producing a set of plausible, likely, and
fluent, incorrect synthetic summaries for a partic-
ular correct reference summary. In this way, we
leverage the capabilities of large language models
to produce multiple candidates of alternative, er-
roneous summaries. These examples, along with
factually correct references, are then used to train
a sequence-to-sequence fact-correction model that
aims at generating a factually consistent version of
the candidate summary (§2).
We evaluate FACTEDIT on two datasets -
CNN/DailyMail (Hermann et al.,2015) and XSum
(Narayan et al.,2018) and across nine summariza-
tion models with the FRANK benchmark (Pagnoni
et al.,2021) for evaluating various categories of
factual errors in generated summaries (§3). The
two summarization datasets represent varied distri-
butions of factual errors in models trained on them
and hence constitute a good test bed to evaluate
the generalizability of our model. We show that
FACTEDIT substantially improves factuality scores
across two metrics - Ent-DAE (Goyal and Durrett,
2021) and FactCC (Kryscinski et al.,2020). On the
Ent-DAE metric, FACTEDIT improves results by
11 points (CNN/DM) and
31 points (XSum),
and on the FactCC metric we show improvements
of
6 points (CNN/DM) and
24 (XSum) points
on average across models (§4). Further, our anal-
ysis shows that FACTEDIT effectively corrects di-
verse error categories without the need for special
heuristics or annotations (§5). An important ap-
plication of FACTEDIT is to audit summarization
systems and facilitate their reliability.
2 Model
Assume a summarization model trained to process
a document
d
and generate a coherent and fluent
summary
2g0
which has been shown to often mis-
2
We denote incorrect input (to fact correction model) sum-
maries using
0
and corrected output (from fact correction
model) without the
0
throughout this paper. For E.g:
g
0
is in-
correct summary,
r
0
is the incorrect reference summary while
represent facts from the document. FACTEDIT is
a fact correction model
MC
which takes the gener-
ated summary
g0
and document
d
, identifies factual
errors and generates a rewritten summary
g
by cor-
recting them (as outlined in Figure 2).
We present an adversarial data generation ap-
proach which leverages the power of pre-trained
language models to produce fluent and complex
factually incorrect summaries. We train an infill-
ing language model
MI
using documents from
summarization training data and use the model to
introduce diverse factual errors in sentences from
them (§2.1). Using the trained model, we intro-
duce factual errors in reference summaries of the
training data
r
producing an incorrect summary
r0
resulting in a synthetic dataset
{r0, r, d}train
of
erroneous summaries mapped to their corrected
versions (pink section in Figure 2). We train a
sequence-to-sequence model
MC
for factual er-
ror correction using the generated synthetic data
2.2). Finally, we use the trained correction model
to rewrite model generated summaries
g0
produc-
ing a corrected version
g
2.3 - green section in
Figure 2).
2.1 Infilling Data Generator MI
Our data generation process leverages infilling lan-
guage models (Donahue et al.,2020) to produce
candidates to fill masked phrases in a summary
sentence. We mask parts of the input and use the
infilling model to generate multiple candidates for
the masked position. We then use lower order beam
candidates as potential incorrect candidates to gen-
erate an incorrect version of the input. We hypoth-
esize that, given the relevant context of a source
document, a strong language model generates rel-
evant and factual sequences at higher probabili-
ties, compared to lower probability sequences. For
the infilling model, we hypothesize that the lower
ranked candidates are often alternative phrases of
similar types (in case of entities) or parts-of-speech
which are plausible but often not factually correct.
Motivated by prior work (Goyal and Durrett,2020)
using lower ranked beam search candidates as a
source for adversarial data, we use the lower ranked
candidates to construct erroneous summaries from
reference summaries.
Training:
Our infilling model
MI
is trained to
take a masked sentence
smasked
and its relevant
g
is the corrected summary and
r
0
is the corrected reference
summary.
context
ctx
as input and generate a correct phrase to
fill in the masked span. To train
MI
, we construct a
dataset using documents
d
from the training data of
existing summarization datasets. For each sentence
s
in the first-
k
(
k
=5) positional sentences of a doc-
ument
d
, we identify the subjects, objects and rela-
tions {sub, obj, rel} in them using OpenIE (Banko
et al.,2007). By iteratively masking each phrase
p
in {sub,obj,rel}, we create a masked query
smasked
and its corresponding context
ctx
by removing the
masked sentence from the document, resulting in
our training data
{smasked, p, ctx}
, where
p
is the
masked span text. We train a sequence-to-sequence
model
MI
on this data which takes
smasked [SEP]
ctx
as input and learns to generate
p
as the out-
put. We intentionally use only sentences from the
document as masked queries and do not use sen-
tences from the reference summaries, to ensure
that the model does not memorize phrases from
the references. Thus, when applied to unseen ref-
erence sentences during inference, the model will
produces richer beam search candidates.
Adversarial Data Generation:
We use the
trained infilling model to generate the synthetic
dataset for fact correction using the document
reference pairs
{d, r}train
from the summariza-
tion training data. For each sentence in the ref-
erence
sr
, we use OpenIE to extract {sub, obj, rel}
and iteratively mask one phrase at a time to con-
struct masked sentences
smasked
from the refer-
ences. We provide this masked reference summary
sentence and document
d
as input to the model
and perform beam-search decoding for generation.
We then consider lower ranked beam candidates
(rank=[5,15])
3
as non-factual alternatives for the
corresponding masked phrase. We then use these
candidates as the replacements for the mask produc-
ing an erroneous summary
r0
. Running this on the
{d, r}train
training data, we construct a synthetic
data
{r0, r, d}train
of factually incorrect summaries
paired with their correct version where
r0
and
r
dif-
fer by an incorrect phrase. To train the model to
not perform any corrections on factual summaries,
we keep original reference summaries for 20% of
the data points (r0=r).
3
We chose this range of ranks based on a manual analysis
of 500 generated adversarial examples where our method
produced factually incorrect replacements over 90% of the
time.
摘要:

CorrectingDiverseFactualErrorsinAbstractiveSummarizationviaPost-EditingandLanguageModelInllingVidhishaBalachandran|HannanehHajishirzi}~WilliamW.CohenYuliaTsvetkov~|LanguageTechnologiesInstitute,CarnegieMellonUniversity}AllenInstituteforArticialIntelligence~PaulG.AllenSchoolofComputerScience&Engin...

展开>> 收起<<
Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling Vidhisha BalachandranHannaneh Hajishirzi.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:556.71KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注