Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting Pengyue Hou1Jie Han1Xingyu Li1 1University of Alberta

2025-05-08 0 0 509.09KB 12 页 10玖币
侵权投诉
Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting
Pengyue Hou,1Jie Han, 1Xingyu Li 1
1University of Alberta
pengyue@ualberta.ca, jhan8@ualberta.ca, xingyu@ualberta.ca
Abstract
Deep Neural Networks are vulnerable to adversarial attacks.
Among many defense strategies, adversarial training with un-
targeted attacks is one of the most effective methods. The-
oretically, adversarial perturbation in untargeted attacks can
be added along arbitrary directions and the predicted labels
of untargeted attacks should be unpredictable. However, we
find that the naturally imbalanced inter-class semantic sim-
ilarity makes those hard-class pairs become virtual targets
of each other. This study investigates the impact of such
closely-coupled classes on adversarial attacks and develops
a self-paced reweighting strategy in adversarial training ac-
cordingly. Specifically, we propose to upweight hard-class
pair losses in model optimization, which prompts learning
discriminative features from hard classes. We further incor-
porate a term to quantify hard-class pair consistency in ad-
versarial training, which greatly boost model robustness. Ex-
tensive experiments show that the proposed adversarial train-
ing method achieves superior robustness performance over
state-of-the-art defenses against a wide range of adversar-
ial attacks. The code of the proposed SPAT is published at
https://github.com/puerrrr/Self-Paced-Adversarial-Training.
1 Introduction
In recent years, DNNs are found to be vulnerable to ad-
versarial attacks, and extensive work has been carried out
on how to defend or reject the threat of adversarial sam-
ples (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy
2014; Nguyen, Yosinski, and Clune 2015). Adversarial
samples are carefully generated with human-imperceptible
noises, yet they can lead to large performance degradation
of well-trained models.
While numerous defenses have been proposed, adversar-
ial training (AT) is a widely recognized strategy (Madry
et al. 2017) and achieves promising performance against a
variety of attacks. AT treats adversarial attacks as an aug-
mentation method and aims to train models that can cor-
rectly classify both adversarial and clean data. Based on
the AT framework, further robustness improvements can
be achieved by exploiting unlabeled, miss-classified data,
pre-training, etc (Alayrac et al. 2019; Carmon et al. 2019;
Hendrycks, Lee, and Mazeika 2019; Zhai et al. 2019; Wang
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
et al. 2019; Jiang et al. 2020; Fan et al. 2021; Hou et al.
2022).
In existing adversarial training, untargeted attacks are
widely used in model optimization and evaluation (Moosavi-
Dezfooli, Fawzi, and Frossard 2016; Madry et al. 2017;
Zhang et al. 2019; Wang et al. 2019; Kannan, Kurakin, and
Goodfellow 2018; Shafahi et al. 2019; Wong, Rice, and
Kolter 2020). Unlike targeted attacks that aim to misguide
a model to a particular class other than the true one, untar-
geted adversaries do not specify the targeted category and
perturb the clean data so that its prediction is away from its
true label. In theory, adversarial perturbation in untargeted
attacks can be added along arbitrary directions and classifi-
cation of untargeted attacks should be unpredictable. How-
ever, the study by Carlini et al argues that an untargeted
attack is simply a more efficient method of running a tar-
geted attack for each target and taking the closest (Carlini
and Wagner 2017b). Figure 1 (a) presents the misclassifi-
cation statistics of PDG-attacked dog images, where almost
half of dog images are misclassified as cats, and over 40%
of the cat images are misclassified as dogs. Considering that
cat and dog images share many common features in vision,
we raise the following questions:
”Does the unbalanced inter-class semantic similarity lead to
the non-uniformly distributed misclassification statistics? If
yes, are classification predictions of untargeted adversaries
predictable?”
To answer these questions, this paper revisits the recipe
for generating gradient-based first-order adversaries and
surprisingly discovers that untargeted attacks may be tar-
geted. In theory, we prove that adversarial perturbation di-
rections in untargeted attacks are actually biased toward
the hard-class pairs of the clean data under attack. Intu-
itively, semantically-similar classes constitute hard-class
pairs (HCPs) and semantically-different classes form easy-
class pairs (ECPs).
Accordingly, we propose explicitly taking the inter-class
semantic similarity into account in AT algorithm design and
develop a self-paced adversarial training (SPAT) strategy to
upweight hard/easy-class pair losses and downweight easy-
class pair losses, encouraging the training procedure to ne-
glect redundant information from easy class pairs. Since
HCPs and ECPs may change during model training (depend-
ing on the current optimization status), their scaling factors
arXiv:2210.15068v2 [cs.CV] 29 Nov 2022
(a) PDG attacks on vanilla-
trained model
(b) PDG attacks on SPAT-
trained model
Figure 1: Predictions of untargeted adversarial attacks
(PGD-20) by CIFAR-10 vanilla-trained and SPAT-trained
classifiers. (a) For the vanilla-trained model, over 40% of
the dog images are misclassified as cats and (b) it is reduced
to 30.6% with the SPAT-trained model.
are adaptively updated at their own pace. Such self-paced
reweighting offers SPAT more optimization flexibility. In ad-
dition, we further incorporate an HCP-ECP consistency term
in SPAT and show its effectiveness in boosting model adver-
sarial robustness. Our main contributions are:
We investigate the cause of the unevenly distributed mis-
classification statistics in untargeted attacks. We find that
adversarial perturbations are actually biased by targeted
sample’s hard-class pairs.
We introduce a SPAT strategy that takes inter-class se-
mantic similarity into account. Adaptively upweighting
hard-class pair loss encourages discriminative feature
learning.
We propose incorporating an HCP-ECP consistency reg-
ularization term in adversarial training, which boosts
model adversarial robustness by a large margin.
2 Related Work
2.1 Adversarial Attack and Defense
The objective of adversarial attacks is to search for human-
imperceptible perturbation δso that the adversarial sample
x0=x+δ(1)
can fool a model f(x;φ)well-trained on clean data x. Here
φrepresents the trainable parameters in a model. For nota-
tion simplification, we use f(x)to denote f(x;φ)in the
rest of the paper. One main branch of adversarial noise gen-
eration is the gradient-based method, such as the Fast Gradi-
ent Sign Method (FGSM) (Goodfellow, Shlens, and Szegedy
2014), and its variants (Kurakin, Goodfellow, and Bengio
2016; Madry et al. 2017). Another popular strategy is opti-
mization based, such as the CW attack (Carlini and Wagner
2017b).
Several pre/post-processing-based methods have shown
outstanding performance in adversarial detection and clas-
sification tasks (Grosse et al. 2017; Metzen et al. 2017; Xie
et al. 2017; Feinman et al. 2017; Li and Li 2017). They
aim to use either a secondary neural network or random
augmentation methods, such as cropping, compression and
blurring to strengthen model robustness. However, Carlini
et al. showed that they all can be defeated by a tailored at-
tack (Carlini and Wagner 2017a). Adversarial Training, on
the other hand, uses regulation methods to directly enhance
the robustness of classifiers. Such optimization scheme is
often referred to as the ”min-max game”:
argmin
φ
E(x,y)D[max
δS
L(f(x0),y)],(2)
where the inner max function aims to generate efficient and
strong adversarial perturbation based on a specific loss func-
tion L, and the outer min function optimizes the network
parameters φfor model robustness. Another branch of AT
aims to achieve logit level robustness, where the objective
function not only requires correct classification of the adver-
sarial samples, but also encourages the logits of clean and
adversarial sample pairs to be similar (Kannan, Kurakin,
and Goodfellow 2018; Zhang et al. 2019; Wang et al. 2019).
Their AT objective functions usually can be formulated as a
compound loss:
L(θ) = Lacc +λLrob (3)
where Lacc is usually the cross entropy (CE) loss on clean
or adversarial data, Lrob quantifies clean-adversarial logit
pairing, and λis a hyper-parameter to control the relative
weights for these two terms. The proposed SPAT in this pa-
per introduces self-paced reweighting mechanisms upon the
compound loss and soft-differentiates hard/easy-class pair
loss in model optimization for model robustness boost.
2.2 Re-weighting in Adversarial Training
Re-weighting is a simple yet effective strategy for address-
ing biases in machine learning, for instance, class imbalance.
When class imbalance exists in the datasets, the training pro-
cedure is very likely over-fit to those categories with a larger
amount of samples, leading to unsatisfactory performance
regarding minority groups. With the re-weighting technique,
one can down-weight the loss from majority classes and ob-
tain a balanced learning solution for minority groups.
Re-weighting is also a common technique for hard ex-
ample mining. Generally, hard examples are those data that
have similar representations but belong to different classes.
Hard sample mining is a crucial component in deep metric
learning (Hoffer and Ailon 2015; Hermans, Beyer, and Leibe
2017) and Contrastive learning (Chen et al. 2020; Khosla
et al. 2020). With re-weighting, we can directly utilize the
loss information during training and characterize those sam-
ples that contribute large losses as hard examples. For ex-
ample, OHEM (Shrivastava, Gupta, and Girshick 2016) and
Focal Loss (Lin et al. 2017) put more weight on the loss of
misclassified samples to effectively minimize the impact of
easy examples.
Previous studies show that utilizing hard adversarial sam-
ples promotes stronger adversarial robustness (Madry et al.
2017; Wang et al. 2019; Mao et al. 2019; Pang et al. 2020).
For instance, MART (Wang et al. 2019) explicitly applies a
re-weighting factor for misclassified samples by a soft de-
cision scheme. Recently, several re-weighting-based algo-
rithms have also been proposed to address fairness-related
issues in AT. (Wang et al. 2021) adopt a re-weighting strat-
egy to address the data imbalance problem in AT and showed
that adversarially trained models can suffer much worse per-
formance degradation in under-represented classes. Xu et
al. (Xu et al. 2021) empirically showed that even in bal-
anced datasets, AT still suffers from the fairness problem,
where some classes have much higher performance than oth-
ers. They propose to combine re-weighting and re-margin
for different classes to achieve robust fairness. Zhang et
al. (Zhang et al. 2020) propose to assign weights based
on how difficult to change the prediction of a natural data
point to a different class. However, existing AT re-weighting
strategies only considered intra-class or inter-sample rela-
tionships, but ignored the inter-class biases in model opti-
mization. We propose to explicitly take the inter-class se-
mantic similarity into account in the proposed SPAT strategy
and up-weights the loss from hard-class pairs in AT.
3 Untargeted Adversaries are Targeted
Untargeted adversarial attacks are usually adopt in adver-
sarial training. In theory, adversarial perturbation in untar-
geted attacks can be added along arbitrary directions, lead-
ing to unpredictable false classification. However, our ob-
servations on many adversarial attacks contradict this. For
example, when untargeted adversaries attack images of cats,
the resulting images are likely to be classified as dogs empir-
ically. We visualize image embeddings from the penultimate
layer of the vanilla-trained model via t-SNE in Figure 2. In
the figure, the embeddings of dog and cat images are close
to each other, which suggests the semantic similarity in their
representations. With this observation, we hypothesize that
the unbalanced inter-class semantic similarity leads to the
non-uniformly distributed misclassification statistics.
In this section, we investigate this interesting yet over-
looked aspect of adversarial attacks and find that untargeted
adversarial examples may be highly biased by their hard-
class pairs. The insight in this section directly motivates the
proposed self-paced adversarial training for model robust-
ness improvement.
3.1 Notations
Given a dataset with labeled pairs {X,Y}={(x, y)|x
Rc×m×n, y [1, C]}, a classifier can be formulated as a
mapping function f:XY:
f(x) = S(WTzx),(4)
where Cis the number of categories, and Srepresents the
softmax function in the classification layer. We use zxto
denote the representation of an input sample xin the penul-
timate layer of the model and W= (wi,w2, ..., wC)for
the trainable parameters (including weights and bias) of the
softmax layer. Note that wican be considered as the proto-
type of class iand the production WTzxin (4) calculates
the similarity between zxand different class-prototype wi.
During training, the model fis optimized to minimize a spe-
cific loss L(f(x), y).
In literature, the most commonly used adversarial attacks,
such as PGD and its variants, generate adversaries based on
disteasy
disthard
disthard
disteasy
Figure 2: t-SNE visualization of 1000 randomly sampled im-
age embeddings from CIFAR-10. Due to the naturally im-
balanced semantic similarity, inter-class distance is much
smaller for hard-class pairs.
first-order derivative information about the network (Madry
et al. 2017). Such adversarial perturbations can be generally
formulated as follows:
x0=x+g(xL(f(x), y)),(5)
where is the step size to modify the data and xis the
gradient with respect to the input x. We take gto denote any
function on the gradient, for example, g(x) = kxkpis the `p
norm.
3.2 Bias in Untargeted Adversarial Attacks
The first-order adversarial attacks usually deploy the CE loss
between the prediction f(x)and the target yto calculate
adversarial perturbations. The CE loss can be formulated as
L(f(x), y) = log ewiTzx
PC
j=1 ewjTzx
(6)
For notation simplification in the rest of this paper, we have
σ(wiTzx) = ewiTzx
PC
j=1 ewjTzx.
Lemma 1 (proof in Appendix): For an oracle model that
predicts the labels perfectly on clean data, the gradient of
the CE loss with respect to sample xfrom the ith category
is:
xL(f(x), y)=[
C
X
j6=i
σ(wjTzx)wj]xzx.(7)
Lemma 1 indicates that for a clean data xfrom the ith
category, its first-order adversarial update follows the direc-
tion of the superposition of all false-class prototypes wjfor
j[1, C], j 6=i. The weight of the jth prototype wjin the
superposition is σ(wjTzx). The greater the value of the dot
摘要:

ImprovingAdversarialRobustnesswithSelf-PacedHard-ClassPairReweightingPengyueHou,1JieHan,1XingyuLi11UniversityofAlbertapengyue@ualberta.ca,jhan8@ualberta.ca,xingyu@ualberta.caAbstractDeepNeuralNetworksarevulnerabletoadversarialattacks.Amongmanydefensestrategies,adversarialtrainingwithun-targetedattac...

展开>> 收起<<
Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting Pengyue Hou1Jie Han1Xingyu Li1 1University of Alberta.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:509.09KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注