Universal Evasion Attacks on Summarization Scoring Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design

2025-05-06 2 0 1.31MB 15 页 10玖币

侵权投诉

Universal Evasion Attacks on Summarization Scoring

Wenchuan Mu Kwan Hui Lim

Singapore University of Technology and Design

{wenchuan_mu,kwanhui_lim}@sutd.edu.sg

Abstract

The automatic scoring of summaries is impor-

tant as it guides the development of summa-

rizers. Scoring is also complex, as it involves

multiple aspects such as ﬂuency, grammar, and

even textual entailment with the source text.

However, summary scoring has not been con-

sidered a machine learning task to study its ac-

curacy and robustness. In this study, we place

automatic scoring in the context of regression

machine learning tasks and perform evasion

attacks to explore its robustness. Attack sys-

tems predict a non-summary string from each

input, and these non-summary strings achieve

competitive scores with good summarizers on

the most popular metrics: ROUGE, METEOR,

and BERTScore. Attack systems also "outper-

form" state-of-the-art summarization methods

on ROUGE-1 and ROUGE-L, and score the

second-highest on METEOR. Furthermore, a

BERTScore backdoor is observed: a simple

trigger can score higher than any automatic

summarization method. The evasion attacks in

this work indicate the low robustness of cur-

rent scoring systems at the system level. We

hope that our highlighting of these proposed

attacks will facilitate the development of sum-

mary scores.

1 Introduction

A long-standing paradox has plagued the task of

automatic summarization. On the one hand, for

about 20 years, there has not been any automatic

scoring available as a sufﬁcient or necessary condi-

tion to demonstrate summary quality, such as ade-

quacy, grammaticality, cohesion, ﬁdelity, etc. On

the other hand, contemporaneous research more

often uses one or several automatic scores to en-

dorse a summarizer as state-of-the-art. More than

90% of works on language generation neural mod-

els choose automatic scoring as the main basis,

and about half of them rely on automatic scoring

only (van der Lee et al.,2021). However, these

I wrote a summary

of a document.

I'm here to rate this

summary you wrote.

Figure 1: Automatic summarization (left) and auto-

matic scoring (right) should be considered as two sys-

tems of the same rank, representing conditional lan-

guage generation and natural language understanding,

respectively. As a stand-alone system, the accuracy and

robustness of automatic scoring are also important. In

this study, we create systems that use bad summaries

to fool existing scoring systems. This work shows that

optimizing towards a ﬂawed scoring does more harm

than good, and ﬂawed scoring methods are not able to

indicate the true performance of summarizers, even at

a system level.

scoring methods have been found to be insufﬁ-

cient (Novikova et al.,2017), oversimpliﬁed (van

der Lee et al.,2021), difﬁcult to interpret (Sai et al.,

2022), inconsistent with the way humans assess

summaries (Rankel et al.,2013;Böhm et al.,2019),

or even contradict each other (Gehrmann et al.,

2021;Bhandari et al.,2020).

Why do we have to deal with this paradox? The

current work may not have suggested that summa-

rizers assessed by automatic scoring are de facto

ineffective. However, optimizing for ﬂawed evalua-

tions (Gehrmann et al.,2021;Peyrard et al.,2017),

directly or indirectly, ultimately harms the develop-

ment of automatic summarization (Narayan et al.,

2018;Kryscinski et al.,2019;Paulus et al.,2018).

One of the most likely drawbacks is shortcut learn-

ing (surface learning, Geirhos et al.,2020), where

summarizing models may fail to generate text with

more widely accepted qualities such as adequacy

arXiv:2210.14260v1 [cs.CL] 25 Oct 2022

and authenticity, but instead pleasing scores. Here,

we quote and adapt

this hypothetical story by

Geirhos et al..

"Alice loves

literature

. Always has, proba-

bly always will. At this very moment, how-

ever, she is cursing the subject: After spend-

ing weeks immersing herself in the world of

Shakespeare’s The Tempest

, she is now faced with

a number of exam questions that are (in her opin-

ion) to equal parts dull and difﬁcult. ’How many

times is Duke of Milan addressed

’... Alice notices

that Bob, sitting in front of her, seems to be doing

very well. Bob of all people, who had just boasted

how he had learned the whole book chapter by rote

last night ..."

According to Geirhos et al., Bob might get bet-

ter grades and consequently be considered a better

student than Alice, which is an example of surface

learning. The same could be the case with auto-

matic summarization, where we might end up with

signiﬁcant differences between expected and ac-

tual learning outcomes (Paulus et al.,2018). To

avoid going astray, it is important to ensure that the

objective is correct.

In addition to understanding the importance of

correct justiﬁcation, we also need to know what

caused the fallacy of the justiﬁcation process for

these potentially useful summarizers. There are

three mainstream speculations that are not mutually

exclusive. (1) The transition from extractive sum-

marization to abstractive summarization (Kryscin-

ski et al.,2019) could have been underestimated.

For example, the popular score ROUGE (Lin,2004)

was originally used to judge the ranking of sen-

tences selected from documents. Due to constraints

on sentence integrity, the generated summaries can

always be ﬂuent and undistorted, except some-

times when anaphora is involved. However, when

it comes to free-form language generation, sen-

tence integrity is no longer guaranteed, but the

metric continues to be used. (2) Many metrics,

while ﬂawed in judging individual summaries, of-

ten make sense at the system level (Reiter,2018;

Gehrmann et al.,2021;Böhm et al.,2019). In other

words, it might have been believed that few sum-

marization systems can consistently output poor-

quality but high-scoring strings. (3) Researchers

have not ﬁgured out how humans interpret or un-

derstand texts (van der Lee et al.,2021;Gehrmann

et al.,2021;Schluter,2017), thus the decision about

1We underline adaptations.

how good a summary really is varies from person

to person, let alone automated scoring. In fact,

automatic scoring is more of a natural language

understanding (NLU) task, a task that is far from

solved. From this viewpoint, automatic scoring

itself is fairly challenging.

Nevertheless, the current work is not to advocate

(and certainly does not disparage) human evalua-

tion. Instead, we argue that automatic scoring itself

is not just a sub-module of automatic summariza-

tion, and that automatic scoring is a stand-alone

system that needs to be studied for its own accu-

racy and robustness. The primary reason is that

NLU is clearly required to characterize summary

quality, e.g., semantic similarity to determine ade-

quacy (Morris,2020), or textual entailment (Dagan

et al.,2006) to determine ﬁdelity. Besides, sum-

mary scoring is similar to automated essay scor-

ing (AES), which is a 50-year-old task measuring

grammaticality, cohesion, relevance etc. of writ-

ten texts (Ke and Ng,2019). Moreover, recent

advances in automatic scoring also support this

argument well. Automatic scoring is gradually

transitioning from well-established metrics measur-

ing N-gram overlap (BLEU (Papineni et al.,2002),

ROUGE (Lin,2004), METEOR (Banerjee and

Lavie,2005), etc.) to emerging metrics aimed at

computing semantic similarity through pre-trained

neural models (BERTScore (Zhang et al.,2019b),

MoverScore (Zhao et al.,2019), BLEURT (Sellam

et al.,2020), etc.) These emerging scores exhibit

two characteristics that stand-alone machine learn-

ing systems typically have: one is that some can

be ﬁne-tuned for human cognition; the other is that

they still have room to improve and still have to

learn how to match human ratings.

Machine learning systems can be attacked. At-

tacks can help improve their generality, robustness,

and interpretability. In particular, evasion attacks

are an intuitive way to further expose the weak-

nesses of current automatic scoring systems. Eva-

sion attack is the parent task of adversarial attack,

which aims to make the system fail to correctly

identify the input, and thus requires defence against

certain exposed vulnerabilities.

In this work, we try to answer the question: do

current representative automatic scoring systems

really work well at the system level? How hard

is it to say they do not work well at the system

level? In summary, we make the following major

contributions in this study:

System Summary Document

Gold Kevin Pietersen was sacked by England 14 months ago after Ashes defeat. Batsman scored

170 on his county cricket return for Surrey last week. Pietersen wants to make a sensational

return to the England side this year. But Andrew Flintoff thinks time is running out for him to

resurrect career. (ROUGE-1, ROUGE-2, ROUGE-L, METEOR, BERTScore)

Andrew Flintoff fears Kevin Pietersen is

’running out of time’ to resurrect his England

career. The dual Ashes-winning all-rounder

is less convinced, however, about Pietersen’s

prospects of forcing his way back into Test

contention. Kevin Pietersen scored 170 for

Surrey in The Parks as he bids to earn a recall

to the England squad... ... Flintoff senses he

no longer has age on his side. Pietersen has

not featured for England since he was

unceremoniously sacked 14 months ago. ... ...

Flintoff said ... ’If he’d started the season last

year with Surrey, and scored run after run and

put himself in the position... whereas now I

think he’s looking at the Ashes ... ... you get

the sense everyone within the England set-up

wants him as captain,’ he said.’ ... The former

England star is hoping to win back his Test

place with a return to red ball cricket. ... ...

’this stands up as a competition.’

Good (Liu

and Liu,

2021)

Kevin pietersen scored 170 for surrey against mccu oxford. Former england star andrew ﬂintoff

fears pietersen is ’running out of time’ to resurrect his england career. Pietersen has been

surplus to requirements since being sacked 14 months ago. Flintoff sees a bright future for

’probably the premier tournament’ in this country. (55.45, 18.18, 41.58, 40.03, 85.56)

Broken Andrew Flintoff fears Kevin Pietersen is running out of time to resurrect his England ca-

reer Flintoff. Pietersen scored 170 for Surrey in The. Former England star Andrew. bats-

man has been . since being sacked 14 months ago after. three in the. the Ashes and he s.

>(56.84,21.51,44.21,47.26, 85.95)

A dot .(0, 0, 0, 0, 88.47)

Scrambled

code

\x03\x18$\x18...\x03$\x03|...\x0f\x01<<$$\x04...\x0e \x04#

$...\x0f\x0f\x0f...\x0e...\x0f...\x0f\x0f$\x0f \x04\x0f\x0f

(many tokens omitted) (0, 0, 0, 0, 87.00)

Scrambled

code +

broken

\x03\x18$\x18...\x03$\x03|...\x0f\x01<<$$\x04...\x0e \x04#

$...\x0f\x0f\x0f...\x0e...\x0f...\x0f\x0f$\x0f \x04\x0f\x0f...

Andrew Flintoff fears Kevin Pietersen is running out of time to resurrect his England career

Flintoff. Pietersen scored 170 for Surrey in The. Former England star Andrew. batsman has

been . since being sacked 14 months ago after. three in the. the Ashes and he s. (many tokens

omitted) (56.84,21.51,44.21,47.26, 87.00)

Table 1: We created non-summarizing systems, each of which produces bad text when processing any docu-

ment. Broken sentences get higher lexical scores; non-alphanumeric characters outperform good summaries on

BERTScore. Concatenating two strings produces equally bad text, but scores high on both. The example is from

CNN/DailyMail (for visualization, document is abridged to keep content most consistent with the corresponding

gold summary).

•

We are the ﬁrst to treat automatic summariza-

tion scoring as an NLU regression task and

perform evasion attacks.

•

We are the ﬁrst to perform a universal,tar-

geted attack on NLP regression models.

•

Our evasion attacks support that it is not dif-

ﬁcult to deceive the three most popular auto-

matic scoring systems simultaneously.

•

The proposed attacks can be directly applied

to test emerging scoring systems.

2 Related Work

2.1 Evasion Attacks in NLP

In an evasion attack, the attacker modiﬁes the input

data so that the NLP model incorrectly identiﬁes

the input. The most widely studied evasion at-

tack is the adversarial attack, in which insigniﬁcant

changes are made to the input to make "adversar-

ial examples" that greatly affect the model’s out-

put (Szegedy et al.,2014). There are other types of

evasion attacks, and evasion attacks can be classi-

ﬁed from at least three perspectives. (1) Targeted

evasion attacks and untargeted evasion attacks (Cao

and Gong,2017). The former is intended for the

model to predict a speciﬁc wrong output for that ex-

ample. The latter is designed to mislead the model

to predict any incorrect output. (2) Universal at-

tacks and input-dependent attacks (Wallace et al.,

2019;Song et al.,2021). The former, also known

as an "input-agnostic" attack, is a "unique model

analysis tool". They are more threatening and ex-

pose more general input-output patterns learned by

the model. The opposite is often referred to as an

input-dependent attack, and is sometimes referred

to as a local or typical attack. (3) Black-box attacks

and white-box attacks. The difference is whether

the attacker has access to the detailed computation

of the victim model. The former does not, and the

latter does. Often, targeted, universal, black-box

attacks are more challenging. Evasion attacks have

been used to expose vulnerabilities in sentiment

analysis, natural language inference (NLI), auto-

matic short answer grading (ASAG), and natural

language generation (NLG) (Alzantot et al.,2018;

Wallace et al.,2019;Song et al.,2021;Filighera

et al.,2020,2022;Zang et al.,2020;Behjati et al.,

2019).

2.2 Universal Triggers in Attacks on

Classiﬁcation

A preﬁx can be a universal trigger. When a preﬁx

is added to any input, it can cause the classiﬁer to

misclassify sentiment, textual entailment (Wallace

et al.,2019), or if a short answer is correct (Fil-

ighera et al.,2020). These are usually untargeted

attacks in a white-box setting

, where the gradients

of neural models are computed during the trigger

When the number of categories is small, the line between

targeted and non-targeted attacks is blurred, especially when

there are only two categories.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UniversalEvasionAttacksonSummarizationScoringWenchuanMuKwanHuiLimSingaporeUniversityofTechnologyandDesign{wenchuan_mu,kwanhui_lim}@sutd.edu.sgAbstractTheautomaticscoringofsummariesisimpor-tantasitguidesthedevelopmentofsumma-rizers.Scoringisalsocomplex,asitinvolvesmultipleaspectssuchasuency,grammar,...

展开>> 收起<<

Universal Evasion Attacks on Summarization Scoring Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Universal Evasion Attacks on Summarization Scoring Wenchuan Mu Kwan Hui Lim Singapore University of Technology and Design

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: