Balanced Adversarial Training Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models Hannah Chen Yangfeng Ji David Evans

2025-05-02 0 0 1.03MB 16 页 10玖币
侵权投诉
Balanced Adversarial Training: Balancing Tradeoffs between
Fickleness and Obstinacy in NLP Models
Hannah Chen, Yangfeng Ji, David Evans
Department of Computer Science
University of Virginia
Charlottesville, VA 22904
{yc4dx,yangfeng,evans}@virginia.edu
Abstract
Traditional (fickle) adversarial examples in-
volve finding a small perturbation that does
not change an input’s true label but confuses
the classifier into outputting a different predic-
tion. Conversely, obstinate adversarial exam-
ples occur when an adversary finds a small
perturbation that preserves the classifier’s pre-
diction but changes the true label of an in-
put. Adversarial training and certified robust
training have shown some effectiveness in im-
proving the robustness of machine learnt mod-
els to fickle adversarial examples. We show
that standard adversarial training methods fo-
cused on reducing vulnerability to fickle ad-
versarial examples may make a model more
vulnerable to obstinate adversarial examples,
with experiments for both natural language in-
ference and paraphrase identification tasks. To
counter this phenomenon, we introduce Bal-
anced Adversarial Training, which incorpo-
rates contrastive learning to increase robust-
ness against both fickle and obstinate adversar-
ial examples.
1 Introduction
Interpreted broadly, an adversarial example is an in-
put crafted intentionally to confuse a model. Most
research on adversarial examples, however, focuses
on a definition of an adversarial example as an in-
put that is constructed by making minimal perturba-
tions to a normal input that change the model’s out-
put, assuming that the small perturbations preserve
the original true label (Goodfellow et al.,2015).
Such adversarial examples occur when a model is
overly influenced by small changes in the input.
Attackers can also target the opposite objective—to
find inputs with minimal changes that change the
ground truth label but for which the model retains
its prior prediction (Jacobsen et al.,2019b).
Various names have been used in the research
literature for these two types of adversarial ex-
amples including perturbation or sensitivity-based
and invariance-based examples (Jacobsen et al.,
2019b,a), and over-sensitive and over-stable exam-
ples (Niu and Bansal,2018;Kumar and Boulanger,
2020). To avoid confusions associated with these
names, we refer them as fickle adversarial ex-
amples (the model changes its output too easily)
and obstinate adversarial examples (the model
doesn’t change its output even though the input
has changed in a way that it should).
In NLP, synonym-based word substitution is a
common method for constructing fickle adversarial
examples (Alzantot et al.,2018;Jin et al.,2020)
since synonym substitutions are assumed to not
change the true label for an input. These methods
target a model’s weakness of being invariant to cer-
tain types of changes which makes its predictions
insufficiently responsive to small input changes.
Attacks based on antonyms and negation have been
proposed to create obstinate adversarial examples
for dialogue models (Niu and Bansal,2018).
Adversarial training is considered as the most
effective defense strategy yet found against adver-
sarial examples (Madry et al.,2018;Goodfellow
et al.,2016). It aims to improve robustness by aug-
menting the original training set with generated
adversarial examples in a way that results in deci-
sion boundaries that correctly classify inputs that
otherwise would have been fickle adversarial ex-
amples. Adversarial training has been shown to
improve robustness for NLP models (Yoo and Qi,
2021). Recent works have also studied certified ro-
bustness training which gives a stronger guarantee
that the model is robust to all possible perturbations
of a given input (Jia et al.,2019;Ye et al.,2020).
While prior work on NLP robustness focuses on
fickle adversarial examples, we consider both fickle
and obstinate adversarial examples. We then fur-
ther examine the impact of methods designed to im-
prove robustness to fickle adversarial examples on
a model’s vulnerability to obstinate adversarial ex-
amples. Recent work in the vision domain demon-
arXiv:2210.11498v3 [cs.CL] 29 Oct 2022
Figure 1: Distance-oracle misalignment (Tramer et al.,
2020). While the model is trained to be robust to -
bounded perturbation, it becomes too invariant to small
changes in the example (obstinate example ˜x) that lie
on the other side of the oracle decision boundary.
strated that increasing adversarial robustness of im-
age classification models by training with fickle ad-
versarial examples may increase vulnerability to ob-
stinate adversarial examples (Tramer et al.,2020).
Even in cases where the model certifiably guar-
antees that no adversarial examples can be found
within an
Lp
-bounded distance, the norm-bounded
perturbation does not align with the ground truth
decision boundary. This distance-oracle misalign-
ment makes it possible to have obstinate adversar-
ial examples located within the same perturbation
distance, as depicted in Figure 1. In text, fickle
examples are usually generated with a cosine simi-
larity constraint to encourage the representations of
the original and the perturbed sentence to be close
in the embedding space. However, this similarity
measurement may not preserve the actual seman-
tics (Morris et al.,2020) and the model may learn
poor representations during adversarial training.
Contributions.
We study fickle and obstinate ad-
versarial robustness in NLP models with a focus
on synonym and antonym-based adversarial ex-
amples (Figure 2 shows a few examples). We
evaluate both kinds of adversarial robustness on
natural language inference and paraphrase identifi-
cation tasks with BERT (Devlin et al.,2019) and
RoBERTa (Liu et al.,2019) models. We find that
there appears to be a tradeoff between robustness
to synonym-based and antonym-based attacks. We
show that while certified robust training increases
robustness against synonym-based adversarial ex-
amples, it increases vulnerability to antonym-based
attacks (Section 3). We propose a modification
to robust training, Balanced Adversarial Training
(BAT), which uses a contrastive learning objective
to help mitigate the distance misalignment problem
by learning from both fickle and obstinate examples
(Section 4). We implement two versions of BAT
with different contrastive learning objectives, and
show the effectiveness in improving both fickleness
and obstinacy robustness (Section 4.2).
2 Constructing Adversarial Examples
We consider a classification task where the goal of
the model
f
is to learn to map the textual input
x
,
a sequence of words,
x1, x2, ..., xL
, to its ground
truth label
y∈ {1, ..., c}
. We assume there is a
labeling oracle
O
that corresponds to ground truth
and outputs the true label of the given input. We fo-
cus on word-level perturbations where the attacker
substitutes words in the original input
x
with words
from a known perturbation set (which we show
how we construct it in the following sections). The
goal of the attacker is to find an adversarial exam-
ple ˜xfor input xsuch that the output of the model
is different from what human would interpret, i.e.
f(˜x)6=O(˜x).
2.1 Fickle Adversarial Examples
For a given input
(x, y)
correctly classified by
model
f
and a set of allowed perturbed sentences
Sx
, an fickle adversarial example is defined as an
input ˜xfsuch that:
1. ˜xf∈ Sx
2. f(˜xf)6=f(x)
3. O(˜xf) = O(x)
There are many different methods for finding
fickle adversarial examples. The most common
way is to use synonym word substitutions where
the target words are replaced with similar words
found in the word embedding (Alzantot et al.,2018;
Jin et al.,2020) or use known synonyms from Word-
Net (Ren et al.,2019). Recent work has also ex-
plored using masked language models to generate
word replacements (Li et al.,2020;Garg and Ra-
makrishnan,2020;Li et al.,2021).
We adopt the synonym word substitution method
as in Ye et al. (2020). For each word
xi
in an
input
x
, we create a synonym set
Sxi
containing
the synonym words of
xi
including itself.
Sx
is
then constructed by a set of sentences where each
word in
x
can be replaced by a word in
Sxi
. We
consider the case where the attacker does not have
a constraint on the number of words that can be
perturbed for each input, meaning the attacker can
perturb up to Lwords which is the length of x.
Figure 2: Fickle and obstinate adversarial examples for BERT model fine-tuned on natural language inference
(left) and paraphrase identification (right) tasks. Words in red are substituted with their synonyms and words in
blue are replaced by their antonyms.
The underlying assumption for fickle examples
to work is that the perturbed sentence
˜xfSx
should have the same ground truth label as the orig-
inal input
x
, i.e.
O(˜xf) = O(x) = f(x)
. How-
ever, common practice for constructing fickle ex-
amples does not guarantee this is true. Swapping
a word with its synonym may change the semantic
meaning of the example since even subtle changes
in words can have a big impact on meaning, and
a word can have different meanings in different
context. For instance, “the whole race of human
kind” and “the whole competition of human kind”
describe different things. Nonetheless, previous
human evaluation has shown that synonym-based
adversarial examples still retain the same semantic
meaning and label as the original texts most of the
time (Jin et al.,2020;Li et al.,2020).
2.2 Obstinate Adversarial Examples
For a given input
(x, y)
correctly classified by
model
f
and a set of allowable perturbed sentences
Ax
, an obstinate adversarial example is defined as
an input ˜xosuch that:
1. ˜xo∈ Ax
2. f(˜xo) = f(x)
3. O(˜xo)6=O(x)
While it is challenging to construct obstinate
adversarial examples automatically for image clas-
sifiers (Tramer et al.,2020), we are able to auto-
mate the process for NLP models. We use a similar
antonym word substitution strategy as proposed
by Niu and Bansal (2018) to construct obstinate
adversarial examples. Similar to synonym word
substitutions, for each word
xi
in an input
x
, we
construct an antonym set
Axi
that consists of the
antonyms of
xi
. Since we would like to change
the semantic meaning of the input in a way that is
likely to flip its label for the task, the attacker is
only allowed to perturb one word with its antonym
for each sentence.
The way we construct obstinate adversarial ex-
amples may not always satisfy the assumption
where the ground truth label of the obstinate ex-
ample would be different from the original input.
The substituted word may not affect the semantic
meaning of the input depending on the task. For
example, in natural language inference, changing
“the weather is great, we should go out and have
fun” to “the weather is bad, ... does not effect the
entailment relationship with “we should have some
outdoor activities” since the main argument is in
the second part of the sentence. However, we find
that antonym substitutions are able to change the
semantic meaning of the text most of the time and
we choose two tasks that are most likely to change
the label under antonym-based attack.
3 Robustness Tradeoffs
Normally, adversarial defense methods only tar-
get fickle adversarial examples, so there is a risk
that such methods increase vulnerability to ob-
stinate adversarial examples. According to the
distance-oracle misalignment assumption (Tramer
et al.,2020) as depicted in Figure 1, the dis-
tance measure for finding adversarial examples
and labeling oracle
O
is misaligned if we have
O(˜xf) = O(x) = y
and
O(˜xo)6=O(x)
, but
dist (x, ˜xf)>dist (x, ˜xo).
3.1 Setup
Our experiments are designed to test our hypothe-
sis that optimizing adversarial robustness of NLP
models using only fickle examples deteriorates the
model’s robustness on obstinate adversarial exam-
ples. We use the SAFER certified robust training
method proposed by Ye et al. (2020). The idea is
to train a smoother model by randomly perturbing
the sentences with words in the synonym substitu-
tion set at each training iteration. While common
IBP-based certified robust training methods do not
scale well onto large pre-trained language mod-
els (Jia et al.,2019;Huang et al.,2019), SAFER
is a structure-free approach that can be applied to
any kind of model architectures. In addition, it
gives stronger robustness than traditional adversar-
ial training method (Yoo and Qi,2021).
We train BERT (Devlin et al.,2019) and
RoBERTa (Liu et al.,2019) models on two dif-
ferent tasks with SAFER training for 15 epochs.
We then test the attack success rate for both fickle-
ness and obstinacy attacks at each training epoch.
We use the same perturbation method as described
in Section 2.1 for both the training and the attack.
For each word, the synonym perturbation set is
constructed by selecting the top
k
nearest neigh-
bors with a cosine similarity constraint of 0.8 in
GLOVE embeddings (Pennington et al.,2014), and
the antonym perturbation set consists of antonym
words found in WordNet (Miller,1995). We follow
the method of Jin et al. (2020) for finding fickle
adversarial examples by using word importance
ranking and Part-of-Speech (PoS) and sentence se-
mantic similarity constraints as the search criteria.
We replace words from the ones with the highest
word importance scores to the ones with the least
and make sure the new substituted words have the
same PoS tags as the original words. For antonym
attack, we also use word importance ranking and
PoS to search for word substitutions. For com-
parison, we set up baseline models with normal
training on the original training sets.
3.2 Tasks
We choose two different tasks from the GLUE
benchmark (Wang et al.,2018) that are good can-
didates for the antonym attack. Antonym-based
attacks work well on these tasks since both tasks
consist of sentence pairs and changing a word to an
opposite meaning is likely to break the relationship
between the pairs.
Natural Language Inference.
We experiment
with Multi-Genre Natural Language Inference
(MNLI) dataset (Williams et al.,2018) which con-
tains a premise-hypothesis pair for each example.
The task is to identify the relation between the sen-
tences in a premise-hypothesis pair and determine
whether the hypothesis is true (entailment), false
(contradiction) or undetermined (neutral) given the
premise. We consider the case where both premise
and hypothesis can be perturbed, but only one word
from either premise or hypothesis can be substi-
tuted for antonym attack. We exclude examples
with a neutral label when constructing obstinate
adversarial examples since antonym word substi-
tutions may not change their label to a different
class.
Paraphrase Identification.
We use Quora Ques-
tion Pairs (QQP) (Iyer et al.,2017) which consists
of questions extracted from Quora. The goal of the
task is to identify duplicate questions. Each ques-
tion pair is labeled as duplicate or non-duplicate.
For our antonym attack strategy, we only target the
duplicate class since antonym word substitutions
are unlikely to flip an initially non-duplicate pair
into a duplicate.
We also conducted experiments using the Wiki
Talk Comments (Wulczyn et al.,2017) dataset, a
dataset for toxicity detection, by adding or remov-
ing toxic words for creating obstinate examples.
However, we found adding toxic words can reach
almost 100% attack success rate, so there did not
seem to be an interesting tradeoff to explore for
available models for this task, and we do not in-
clude it in our results.
3.3 Results
We visualize the attack success rates for fickleness
(synonym attack) and obstinacy (antonym attack)
attacks in Figure 3. The results are consistent with
our hypothesis that optimizing adversarial robust-
ness of NLP models using only fickle examples
can result in models that are more vulnerable to ob-
stinacy attacks. Robustness training for the BERT
model on MNLI improves fickleness robustness,
reducing the synonym attack success rate from
36% to 11% (a 69% decrease) after training for
15 epochs (Figure 3a), but antonym attack success
rate increases from 56% to 63% (a 13% increase).
The antonym attack success rate increases even
more for the RoBERTa model (Figure 3b), increas-
ing from 56% to 67% (a 20% increase) while the
synonym attack success rate decreases from 31.2%
to 10% (a 68% decrease). The RoBERTa model is
pre-trained to be more robust than the BERT model
with dynamic masking, which perhaps explains
the difference. We observe a robustness tradeoff
for QQP dataset as well (see Appendix A.1). In
addition, the fickle adversarial training does not
sacrifice the performance on the original examples
摘要:

BalancedAdversarialTraining:BalancingTradeoffsbetweenFicklenessandObstinacyinNLPModelsHannahChen,YangfengJi,DavidEvansDepartmentofComputerScienceUniversityofVirginiaCharlottesville,VA22904{yc4dx,yangfeng,evans}@virginia.eduAbstractTraditional(ckle)adversarialexamplesin-volvendingasmallperturbation...

展开>> 收起<<
Balanced Adversarial Training Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models Hannah Chen Yangfeng Ji David Evans.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.03MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注