Balanced Adversarial Training Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models Hannah Chen Yangfeng Ji David Evans

2025-05-02 0 0 1.03MB 16 页 10玖币

侵权投诉

Balanced Adversarial Training: Balancing Tradeoffs between

Fickleness and Obstinacy in NLP Models

Hannah Chen, Yangfeng Ji, David Evans

Department of Computer Science

University of Virginia

Charlottesville, VA 22904

{yc4dx,yangfeng,evans}@virginia.edu

Abstract

Traditional (ﬁckle) adversarial examples in-

volve ﬁnding a small perturbation that does

not change an input’s true label but confuses

the classiﬁer into outputting a different predic-

tion. Conversely, obstinate adversarial exam-

ples occur when an adversary ﬁnds a small

perturbation that preserves the classiﬁer’s pre-

diction but changes the true label of an in-

put. Adversarial training and certiﬁed robust

training have shown some effectiveness in im-

proving the robustness of machine learnt mod-

els to ﬁckle adversarial examples. We show

that standard adversarial training methods fo-

cused on reducing vulnerability to ﬁckle ad-

versarial examples may make a model more

vulnerable to obstinate adversarial examples,

with experiments for both natural language in-

ference and paraphrase identiﬁcation tasks. To

counter this phenomenon, we introduce Bal-

anced Adversarial Training, which incorpo-

rates contrastive learning to increase robust-

ness against both ﬁckle and obstinate adversar-

ial examples.

1 Introduction

Interpreted broadly, an adversarial example is an in-

put crafted intentionally to confuse a model. Most

research on adversarial examples, however, focuses

on a deﬁnition of an adversarial example as an in-

put that is constructed by making minimal perturba-

tions to a normal input that change the model’s out-

put, assuming that the small perturbations preserve

the original true label (Goodfellow et al.,2015).

Such adversarial examples occur when a model is

overly inﬂuenced by small changes in the input.

Attackers can also target the opposite objective—to

ﬁnd inputs with minimal changes that change the

ground truth label but for which the model retains

its prior prediction (Jacobsen et al.,2019b).

Various names have been used in the research

literature for these two types of adversarial ex-

amples including perturbation or sensitivity-based

and invariance-based examples (Jacobsen et al.,

2019b,a), and over-sensitive and over-stable exam-

ples (Niu and Bansal,2018;Kumar and Boulanger,

2020). To avoid confusions associated with these

names, we refer them as ﬁckle adversarial ex-

amples (the model changes its output too easily)

and obstinate adversarial examples (the model

doesn’t change its output even though the input

has changed in a way that it should).

In NLP, synonym-based word substitution is a

common method for constructing ﬁckle adversarial

examples (Alzantot et al.,2018;Jin et al.,2020)

since synonym substitutions are assumed to not

change the true label for an input. These methods

target a model’s weakness of being invariant to cer-

tain types of changes which makes its predictions

insufﬁciently responsive to small input changes.

Attacks based on antonyms and negation have been

proposed to create obstinate adversarial examples

for dialogue models (Niu and Bansal,2018).

Adversarial training is considered as the most

effective defense strategy yet found against adver-

sarial examples (Madry et al.,2018;Goodfellow

et al.,2016). It aims to improve robustness by aug-

menting the original training set with generated

adversarial examples in a way that results in deci-

sion boundaries that correctly classify inputs that

otherwise would have been ﬁckle adversarial ex-

amples. Adversarial training has been shown to

improve robustness for NLP models (Yoo and Qi,

2021). Recent works have also studied certiﬁed ro-

bustness training which gives a stronger guarantee

that the model is robust to all possible perturbations

of a given input (Jia et al.,2019;Ye et al.,2020).

While prior work on NLP robustness focuses on

ﬁckle adversarial examples, we consider both ﬁckle

and obstinate adversarial examples. We then fur-

ther examine the impact of methods designed to im-

prove robustness to ﬁckle adversarial examples on

a model’s vulnerability to obstinate adversarial ex-

amples. Recent work in the vision domain demon-

arXiv:2210.11498v3 [cs.CL] 29 Oct 2022

Figure 1: Distance-oracle misalignment (Tramer et al.,

2020). While the model is trained to be robust to -

bounded perturbation, it becomes too invariant to small

changes in the example (obstinate example ˜x) that lie

on the other side of the oracle decision boundary.

strated that increasing adversarial robustness of im-

age classiﬁcation models by training with ﬁckle ad-

versarial examples may increase vulnerability to ob-

stinate adversarial examples (Tramer et al.,2020).

Even in cases where the model certiﬁably guar-

antees that no adversarial examples can be found

within an

-bounded distance, the norm-bounded

perturbation does not align with the ground truth

decision boundary. This distance-oracle misalign-

ment makes it possible to have obstinate adversar-

ial examples located within the same perturbation

distance, as depicted in Figure 1. In text, ﬁckle

examples are usually generated with a cosine simi-

larity constraint to encourage the representations of

the original and the perturbed sentence to be close

in the embedding space. However, this similarity

measurement may not preserve the actual seman-

tics (Morris et al.,2020) and the model may learn

poor representations during adversarial training.

Contributions.

We study ﬁckle and obstinate ad-

versarial robustness in NLP models with a focus

on synonym and antonym-based adversarial ex-

amples (Figure 2 shows a few examples). We

evaluate both kinds of adversarial robustness on

natural language inference and paraphrase identiﬁ-

cation tasks with BERT (Devlin et al.,2019) and

RoBERTa (Liu et al.,2019) models. We ﬁnd that

there appears to be a tradeoff between robustness

to synonym-based and antonym-based attacks. We

show that while certiﬁed robust training increases

robustness against synonym-based adversarial ex-

amples, it increases vulnerability to antonym-based

attacks (Section 3). We propose a modiﬁcation

to robust training, Balanced Adversarial Training

(BAT), which uses a contrastive learning objective

to help mitigate the distance misalignment problem

by learning from both ﬁckle and obstinate examples

(Section 4). We implement two versions of BAT

with different contrastive learning objectives, and

show the effectiveness in improving both ﬁckleness

and obstinacy robustness (Section 4.2).

2 Constructing Adversarial Examples

We consider a classiﬁcation task where the goal of

the model

is to learn to map the textual input

a sequence of words,

x1, x2, ..., xL

, to its ground

truth label

y∈ {1, ..., c}

. We assume there is a

labeling oracle

that corresponds to ground truth

and outputs the true label of the given input. We fo-

cus on word-level perturbations where the attacker

substitutes words in the original input

with words

from a known perturbation set (which we show

how we construct it in the following sections). The

goal of the attacker is to ﬁnd an adversarial exam-

ple ˜xfor input xsuch that the output of the model

is different from what human would interpret, i.e.

f(˜x)6=O(˜x).

2.1 Fickle Adversarial Examples

For a given input

(x, y)

correctly classiﬁed by

model

and a set of allowed perturbed sentences

, an ﬁckle adversarial example is deﬁned as an

input ˜xfsuch that:

1. ˜xf∈ Sx

2. f(˜xf)6=f(x)

3. O(˜xf) = O(x)

There are many different methods for ﬁnding

ﬁckle adversarial examples. The most common

way is to use synonym word substitutions where

the target words are replaced with similar words

found in the word embedding (Alzantot et al.,2018;

Jin et al.,2020) or use known synonyms from Word-

Net (Ren et al.,2019). Recent work has also ex-

plored using masked language models to generate

word replacements (Li et al.,2020;Garg and Ra-

makrishnan,2020;Li et al.,2021).

We adopt the synonym word substitution method

as in Ye et al. (2020). For each word

in an

input

, we create a synonym set

Sxi

containing

the synonym words of

including itself.

then constructed by a set of sentences where each

word in

can be replaced by a word in

Sxi

. We

consider the case where the attacker does not have

a constraint on the number of words that can be

perturbed for each input, meaning the attacker can

perturb up to Lwords which is the length of x.

Figure 2: Fickle and obstinate adversarial examples for BERT model ﬁne-tuned on natural language inference

(left) and paraphrase identiﬁcation (right) tasks. Words in red are substituted with their synonyms and words in

blue are replaced by their antonyms.

The underlying assumption for ﬁckle examples

to work is that the perturbed sentence

˜xf∈Sx

should have the same ground truth label as the orig-

inal input

, i.e.

O(˜xf) = O(x) = f(x)

. How-

ever, common practice for constructing ﬁckle ex-

amples does not guarantee this is true. Swapping

a word with its synonym may change the semantic

meaning of the example since even subtle changes

in words can have a big impact on meaning, and

a word can have different meanings in different

context. For instance, “the whole race of human

kind” and “the whole competition of human kind”

describe different things. Nonetheless, previous

human evaluation has shown that synonym-based

adversarial examples still retain the same semantic

meaning and label as the original texts most of the

time (Jin et al.,2020;Li et al.,2020).

2.2 Obstinate Adversarial Examples

For a given input

(x, y)

correctly classiﬁed by

model

and a set of allowable perturbed sentences

, an obstinate adversarial example is deﬁned as

an input ˜xosuch that:

1. ˜xo∈ Ax

2. f(˜xo) = f(x)

3. O(˜xo)6=O(x)

While it is challenging to construct obstinate

adversarial examples automatically for image clas-

siﬁers (Tramer et al.,2020), we are able to auto-

mate the process for NLP models. We use a similar

antonym word substitution strategy as proposed

by Niu and Bansal (2018) to construct obstinate

adversarial examples. Similar to synonym word

substitutions, for each word

in an input

, we

construct an antonym set

Axi

that consists of the

antonyms of

. Since we would like to change

the semantic meaning of the input in a way that is

likely to ﬂip its label for the task, the attacker is

only allowed to perturb one word with its antonym

for each sentence.

The way we construct obstinate adversarial ex-

amples may not always satisfy the assumption

where the ground truth label of the obstinate ex-

ample would be different from the original input.

The substituted word may not affect the semantic

meaning of the input depending on the task. For

example, in natural language inference, changing

“the weather is great, we should go out and have

fun” to “the weather is bad, ...” does not effect the

entailment relationship with “we should have some

outdoor activities” since the main argument is in

the second part of the sentence. However, we ﬁnd

that antonym substitutions are able to change the

semantic meaning of the text most of the time and

we choose two tasks that are most likely to change

the label under antonym-based attack.

3 Robustness Tradeoffs

Normally, adversarial defense methods only tar-

get ﬁckle adversarial examples, so there is a risk

that such methods increase vulnerability to ob-

stinate adversarial examples. According to the

distance-oracle misalignment assumption (Tramer

et al.,2020) as depicted in Figure 1, the dis-

tance measure for ﬁnding adversarial examples

and labeling oracle

is misaligned if we have

O(˜xf) = O(x) = y

and

O(˜xo)6=O(x)

, but

dist (x, ˜xf)>dist (x, ˜xo).

3.1 Setup

Our experiments are designed to test our hypothe-

sis that optimizing adversarial robustness of NLP

models using only ﬁckle examples deteriorates the

model’s robustness on obstinate adversarial exam-

ples. We use the SAFER certiﬁed robust training

method proposed by Ye et al. (2020). The idea is

to train a smoother model by randomly perturbing

the sentences with words in the synonym substitu-

tion set at each training iteration. While common

IBP-based certiﬁed robust training methods do not

scale well onto large pre-trained language mod-

els (Jia et al.,2019;Huang et al.,2019), SAFER

is a structure-free approach that can be applied to

any kind of model architectures. In addition, it

gives stronger robustness than traditional adversar-

ial training method (Yoo and Qi,2021).

We train BERT (Devlin et al.,2019) and

RoBERTa (Liu et al.,2019) models on two dif-

ferent tasks with SAFER training for 15 epochs.

We then test the attack success rate for both ﬁckle-

ness and obstinacy attacks at each training epoch.

We use the same perturbation method as described

in Section 2.1 for both the training and the attack.

For each word, the synonym perturbation set is

constructed by selecting the top

nearest neigh-

bors with a cosine similarity constraint of 0.8 in

GLOVE embeddings (Pennington et al.,2014), and

the antonym perturbation set consists of antonym

words found in WordNet (Miller,1995). We follow

the method of Jin et al. (2020) for ﬁnding ﬁckle

adversarial examples by using word importance

ranking and Part-of-Speech (PoS) and sentence se-

mantic similarity constraints as the search criteria.

We replace words from the ones with the highest

word importance scores to the ones with the least

and make sure the new substituted words have the

same PoS tags as the original words. For antonym

attack, we also use word importance ranking and

PoS to search for word substitutions. For com-

parison, we set up baseline models with normal

training on the original training sets.

3.2 Tasks

We choose two different tasks from the GLUE

benchmark (Wang et al.,2018) that are good can-

didates for the antonym attack. Antonym-based

attacks work well on these tasks since both tasks

consist of sentence pairs and changing a word to an

opposite meaning is likely to break the relationship

between the pairs.

Natural Language Inference.

We experiment

with Multi-Genre Natural Language Inference

(MNLI) dataset (Williams et al.,2018) which con-

tains a premise-hypothesis pair for each example.

The task is to identify the relation between the sen-

tences in a premise-hypothesis pair and determine

whether the hypothesis is true (entailment), false

(contradiction) or undetermined (neutral) given the

premise. We consider the case where both premise

and hypothesis can be perturbed, but only one word

from either premise or hypothesis can be substi-

tuted for antonym attack. We exclude examples

with a neutral label when constructing obstinate

adversarial examples since antonym word substi-

tutions may not change their label to a different

class.

Paraphrase Identiﬁcation.

We use Quora Ques-

tion Pairs (QQP) (Iyer et al.,2017) which consists

of questions extracted from Quora. The goal of the

task is to identify duplicate questions. Each ques-

tion pair is labeled as duplicate or non-duplicate.

For our antonym attack strategy, we only target the

duplicate class since antonym word substitutions

are unlikely to ﬂip an initially non-duplicate pair

into a duplicate.

We also conducted experiments using the Wiki

Talk Comments (Wulczyn et al.,2017) dataset, a

dataset for toxicity detection, by adding or remov-

ing toxic words for creating obstinate examples.

However, we found adding toxic words can reach

almost 100% attack success rate, so there did not

seem to be an interesting tradeoff to explore for

available models for this task, and we do not in-

clude it in our results.

3.3 Results

We visualize the attack success rates for ﬁckleness

(synonym attack) and obstinacy (antonym attack)

attacks in Figure 3. The results are consistent with

our hypothesis that optimizing adversarial robust-

ness of NLP models using only ﬁckle examples

can result in models that are more vulnerable to ob-

stinacy attacks. Robustness training for the BERT

model on MNLI improves ﬁckleness robustness,

reducing the synonym attack success rate from

36% to 11% (a 69% decrease) after training for

15 epochs (Figure 3a), but antonym attack success

rate increases from 56% to 63% (a 13% increase).

The antonym attack success rate increases even

more for the RoBERTa model (Figure 3b), increas-

ing from 56% to 67% (a 20% increase) while the

synonym attack success rate decreases from 31.2%

to 10% (a 68% decrease). The RoBERTa model is

pre-trained to be more robust than the BERT model

with dynamic masking, which perhaps explains

the difference. We observe a robustness tradeoff

for QQP dataset as well (see Appendix A.1). In

addition, the ﬁckle adversarial training does not

sacriﬁce the performance on the original examples

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BalancedAdversarialTraining:BalancingTradeoffsbetweenFicklenessandObstinacyinNLPModelsHannahChen,YangfengJi,DavidEvansDepartmentofComputerScienceUniversityofVirginiaCharlottesville,VA22904{yc4dx,yangfeng,evans}@virginia.eduAbstractTraditional(ckle)adversarialexamplesin-volvendingasmallperturbation...

展开>> 收起<<

Balanced Adversarial Training Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models Hannah Chen Yangfeng Ji David Evans.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Balanced Adversarial Training Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models Hannah Chen Yangfeng Ji David Evans

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: