Universal Evasion Attacks on Summarization Scoring Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design

2025-05-06 0 0 1.31MB 15 页 10玖币
侵权投诉
Universal Evasion Attacks on Summarization Scoring
Wenchuan Mu Kwan Hui Lim
Singapore University of Technology and Design
{wenchuan_mu,kwanhui_lim}@sutd.edu.sg
Abstract
The automatic scoring of summaries is impor-
tant as it guides the development of summa-
rizers. Scoring is also complex, as it involves
multiple aspects such as fluency, grammar, and
even textual entailment with the source text.
However, summary scoring has not been con-
sidered a machine learning task to study its ac-
curacy and robustness. In this study, we place
automatic scoring in the context of regression
machine learning tasks and perform evasion
attacks to explore its robustness. Attack sys-
tems predict a non-summary string from each
input, and these non-summary strings achieve
competitive scores with good summarizers on
the most popular metrics: ROUGE, METEOR,
and BERTScore. Attack systems also "outper-
form" state-of-the-art summarization methods
on ROUGE-1 and ROUGE-L, and score the
second-highest on METEOR. Furthermore, a
BERTScore backdoor is observed: a simple
trigger can score higher than any automatic
summarization method. The evasion attacks in
this work indicate the low robustness of cur-
rent scoring systems at the system level. We
hope that our highlighting of these proposed
attacks will facilitate the development of sum-
mary scores.
1 Introduction
A long-standing paradox has plagued the task of
automatic summarization. On the one hand, for
about 20 years, there has not been any automatic
scoring available as a sufficient or necessary condi-
tion to demonstrate summary quality, such as ade-
quacy, grammaticality, cohesion, fidelity, etc. On
the other hand, contemporaneous research more
often uses one or several automatic scores to en-
dorse a summarizer as state-of-the-art. More than
90% of works on language generation neural mod-
els choose automatic scoring as the main basis,
and about half of them rely on automatic scoring
only (van der Lee et al.,2021). However, these
I wrote a summary
of a document.
I'm here to rate this
summary you wrote.
Figure 1: Automatic summarization (left) and auto-
matic scoring (right) should be considered as two sys-
tems of the same rank, representing conditional lan-
guage generation and natural language understanding,
respectively. As a stand-alone system, the accuracy and
robustness of automatic scoring are also important. In
this study, we create systems that use bad summaries
to fool existing scoring systems. This work shows that
optimizing towards a flawed scoring does more harm
than good, and flawed scoring methods are not able to
indicate the true performance of summarizers, even at
a system level.
scoring methods have been found to be insuffi-
cient (Novikova et al.,2017), oversimplified (van
der Lee et al.,2021), difficult to interpret (Sai et al.,
2022), inconsistent with the way humans assess
summaries (Rankel et al.,2013;Böhm et al.,2019),
or even contradict each other (Gehrmann et al.,
2021;Bhandari et al.,2020).
Why do we have to deal with this paradox? The
current work may not have suggested that summa-
rizers assessed by automatic scoring are de facto
ineffective. However, optimizing for flawed evalua-
tions (Gehrmann et al.,2021;Peyrard et al.,2017),
directly or indirectly, ultimately harms the develop-
ment of automatic summarization (Narayan et al.,
2018;Kryscinski et al.,2019;Paulus et al.,2018).
One of the most likely drawbacks is shortcut learn-
ing (surface learning, Geirhos et al.,2020), where
summarizing models may fail to generate text with
more widely accepted qualities such as adequacy
arXiv:2210.14260v1 [cs.CL] 25 Oct 2022
and authenticity, but instead pleasing scores. Here,
we quote and adapt
1
this hypothetical story by
Geirhos et al..
"Alice loves
literature
. Always has, proba-
bly always will. At this very moment, how-
ever, she is cursing the subject: After spend-
ing weeks immersing herself in the world of
Shakespeare’s The Tempest
, she is now faced with
a number of exam questions that are (in her opin-
ion) to equal parts dull and difficult. ’How many
times is Duke of Milan addressed
’... Alice notices
that Bob, sitting in front of her, seems to be doing
very well. Bob of all people, who had just boasted
how he had learned the whole book chapter by rote
last night ..."
According to Geirhos et al., Bob might get bet-
ter grades and consequently be considered a better
student than Alice, which is an example of surface
learning. The same could be the case with auto-
matic summarization, where we might end up with
significant differences between expected and ac-
tual learning outcomes (Paulus et al.,2018). To
avoid going astray, it is important to ensure that the
objective is correct.
In addition to understanding the importance of
correct justification, we also need to know what
caused the fallacy of the justification process for
these potentially useful summarizers. There are
three mainstream speculations that are not mutually
exclusive. (1) The transition from extractive sum-
marization to abstractive summarization (Kryscin-
ski et al.,2019) could have been underestimated.
For example, the popular score ROUGE (Lin,2004)
was originally used to judge the ranking of sen-
tences selected from documents. Due to constraints
on sentence integrity, the generated summaries can
always be fluent and undistorted, except some-
times when anaphora is involved. However, when
it comes to free-form language generation, sen-
tence integrity is no longer guaranteed, but the
metric continues to be used. (2) Many metrics,
while flawed in judging individual summaries, of-
ten make sense at the system level (Reiter,2018;
Gehrmann et al.,2021;Böhm et al.,2019). In other
words, it might have been believed that few sum-
marization systems can consistently output poor-
quality but high-scoring strings. (3) Researchers
have not figured out how humans interpret or un-
derstand texts (van der Lee et al.,2021;Gehrmann
et al.,2021;Schluter,2017), thus the decision about
1We underline adaptations.
how good a summary really is varies from person
to person, let alone automated scoring. In fact,
automatic scoring is more of a natural language
understanding (NLU) task, a task that is far from
solved. From this viewpoint, automatic scoring
itself is fairly challenging.
Nevertheless, the current work is not to advocate
(and certainly does not disparage) human evalua-
tion. Instead, we argue that automatic scoring itself
is not just a sub-module of automatic summariza-
tion, and that automatic scoring is a stand-alone
system that needs to be studied for its own accu-
racy and robustness. The primary reason is that
NLU is clearly required to characterize summary
quality, e.g., semantic similarity to determine ade-
quacy (Morris,2020), or textual entailment (Dagan
et al.,2006) to determine fidelity. Besides, sum-
mary scoring is similar to automated essay scor-
ing (AES), which is a 50-year-old task measuring
grammaticality, cohesion, relevance etc. of writ-
ten texts (Ke and Ng,2019). Moreover, recent
advances in automatic scoring also support this
argument well. Automatic scoring is gradually
transitioning from well-established metrics measur-
ing N-gram overlap (BLEU (Papineni et al.,2002),
ROUGE (Lin,2004), METEOR (Banerjee and
Lavie,2005), etc.) to emerging metrics aimed at
computing semantic similarity through pre-trained
neural models (BERTScore (Zhang et al.,2019b),
MoverScore (Zhao et al.,2019), BLEURT (Sellam
et al.,2020), etc.) These emerging scores exhibit
two characteristics that stand-alone machine learn-
ing systems typically have: one is that some can
be fine-tuned for human cognition; the other is that
they still have room to improve and still have to
learn how to match human ratings.
Machine learning systems can be attacked. At-
tacks can help improve their generality, robustness,
and interpretability. In particular, evasion attacks
are an intuitive way to further expose the weak-
nesses of current automatic scoring systems. Eva-
sion attack is the parent task of adversarial attack,
which aims to make the system fail to correctly
identify the input, and thus requires defence against
certain exposed vulnerabilities.
In this work, we try to answer the question: do
current representative automatic scoring systems
really work well at the system level? How hard
is it to say they do not work well at the system
level? In summary, we make the following major
contributions in this study:
System Summary Document
Gold Kevin Pietersen was sacked by England 14 months ago after Ashes defeat. Batsman scored
170 on his county cricket return for Surrey last week. Pietersen wants to make a sensational
return to the England side this year. But Andrew Flintoff thinks time is running out for him to
resurrect career. (ROUGE-1, ROUGE-2, ROUGE-L, METEOR, BERTScore)
Andrew Flintoff fears Kevin Pietersen is
running out of time’ to resurrect his England
career. The dual Ashes-winning all-rounder
is less convinced, however, about Pietersen’s
prospects of forcing his way back into Test
contention. Kevin Pietersen scored 170 for
Surrey in The Parks as he bids to earn a recall
to the England squad... ... Flintoff senses he
no longer has age on his side. Pietersen has
not featured for England since he was
unceremoniously sacked 14 months ago. ... ...
Flintoff said ... ’If he’d started the season last
year with Surrey, and scored run after run and
put himself in the position... whereas now I
think he’s looking at the Ashes ... ... you get
the sense everyone within the England set-up
wants him as captain,’ he said. ... The former
England star is hoping to win back his Test
place with a return to red ball cricket. ... ...
’this stands up as a competition.
Good (Liu
and Liu,
2021)
Kevin pietersen scored 170 for surrey against mccu oxford. Former england star andrew flintoff
fears pietersen is ’running out of time’ to resurrect his england career. Pietersen has been
surplus to requirements since being sacked 14 months ago. Flintoff sees a bright future for
’probably the premier tournament’ in this country. (55.45, 18.18, 41.58, 40.03, 85.56)
Broken Andrew Flintoff fears Kevin Pietersen is running out of time to resurrect his England ca-
reer Flintoff. Pietersen scored 170 for Surrey in The. Former England star Andrew. bats-
man has been . since being sacked 14 months ago after. three in the. the Ashes and he s.
>(56.84,21.51,44.21,47.26, 85.95)
A dot .(0, 0, 0, 0, 88.47)
Scrambled
code
\x03\x18$\x18...\x03$\x03|...\x0f\x01<<$$\x04...\x0e \x04#
$...\x0f\x0f\x0f...\x0e...\x0f...\x0f\x0f$\x0f \x04\x0f\x0f
(many tokens omitted) (0, 0, 0, 0, 87.00)
Scrambled
code +
broken
\x03\x18$\x18...\x03$\x03|...\x0f\x01<<$$\x04...\x0e \x04#
$...\x0f\x0f\x0f...\x0e...\x0f...\x0f\x0f$\x0f \x04\x0f\x0f...
Andrew Flintoff fears Kevin Pietersen is running out of time to resurrect his England career
Flintoff. Pietersen scored 170 for Surrey in The. Former England star Andrew. batsman has
been . since being sacked 14 months ago after. three in the. the Ashes and he s. (many tokens
omitted) (56.84,21.51,44.21,47.26, 87.00)
Table 1: We created non-summarizing systems, each of which produces bad text when processing any docu-
ment. Broken sentences get higher lexical scores; non-alphanumeric characters outperform good summaries on
BERTScore. Concatenating two strings produces equally bad text, but scores high on both. The example is from
CNN/DailyMail (for visualization, document is abridged to keep content most consistent with the corresponding
gold summary).
We are the first to treat automatic summariza-
tion scoring as an NLU regression task and
perform evasion attacks.
We are the first to perform a universal,tar-
geted attack on NLP regression models.
Our evasion attacks support that it is not dif-
ficult to deceive the three most popular auto-
matic scoring systems simultaneously.
The proposed attacks can be directly applied
to test emerging scoring systems.
2 Related Work
2.1 Evasion Attacks in NLP
In an evasion attack, the attacker modifies the input
data so that the NLP model incorrectly identifies
the input. The most widely studied evasion at-
tack is the adversarial attack, in which insignificant
changes are made to the input to make "adversar-
ial examples" that greatly affect the model’s out-
put (Szegedy et al.,2014). There are other types of
evasion attacks, and evasion attacks can be classi-
fied from at least three perspectives. (1) Targeted
evasion attacks and untargeted evasion attacks (Cao
and Gong,2017). The former is intended for the
model to predict a specific wrong output for that ex-
ample. The latter is designed to mislead the model
to predict any incorrect output. (2) Universal at-
tacks and input-dependent attacks (Wallace et al.,
2019;Song et al.,2021). The former, also known
as an "input-agnostic" attack, is a "unique model
analysis tool". They are more threatening and ex-
pose more general input-output patterns learned by
the model. The opposite is often referred to as an
input-dependent attack, and is sometimes referred
to as a local or typical attack. (3) Black-box attacks
and white-box attacks. The difference is whether
the attacker has access to the detailed computation
of the victim model. The former does not, and the
latter does. Often, targeted, universal, black-box
attacks are more challenging. Evasion attacks have
been used to expose vulnerabilities in sentiment
analysis, natural language inference (NLI), auto-
matic short answer grading (ASAG), and natural
language generation (NLG) (Alzantot et al.,2018;
Wallace et al.,2019;Song et al.,2021;Filighera
et al.,2020,2022;Zang et al.,2020;Behjati et al.,
2019).
2.2 Universal Triggers in Attacks on
Classification
A prefix can be a universal trigger. When a prefix
is added to any input, it can cause the classifier to
misclassify sentiment, textual entailment (Wallace
et al.,2019), or if a short answer is correct (Fil-
ighera et al.,2020). These are usually untargeted
attacks in a white-box setting
2
, where the gradients
of neural models are computed during the trigger
2
When the number of categories is small, the line between
targeted and non-targeted attacks is blurred, especially when
there are only two categories.
摘要:

UniversalEvasionAttacksonSummarizationScoringWenchuanMuKwanHuiLimSingaporeUniversityofTechnologyandDesign{wenchuan_mu,kwanhui_lim}@sutd.edu.sgAbstractTheautomaticscoringofsummariesisimpor-tantasitguidesthedevelopmentofsumma-rizers.Scoringisalsocomplex,asitinvolvesmultipleaspectssuchasuency,grammar,...

展开>> 收起<<
Universal Evasion Attacks on Summarization Scoring Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:15 页 大小:1.31MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注