Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting Pengyue Hou1Jie Han1Xingyu Li1 1University of Alberta

2025-05-08 1 0 509.09KB 12 页 10玖币

侵权投诉

Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting

Pengyue Hou,1Jie Han, 1Xingyu Li 1

1University of Alberta

pengyue@ualberta.ca, jhan8@ualberta.ca, xingyu@ualberta.ca

Abstract

Deep Neural Networks are vulnerable to adversarial attacks.

Among many defense strategies, adversarial training with un-

targeted attacks is one of the most effective methods. The-

oretically, adversarial perturbation in untargeted attacks can

be added along arbitrary directions and the predicted labels

of untargeted attacks should be unpredictable. However, we

ﬁnd that the naturally imbalanced inter-class semantic sim-

ilarity makes those hard-class pairs become virtual targets

of each other. This study investigates the impact of such

closely-coupled classes on adversarial attacks and develops

a self-paced reweighting strategy in adversarial training ac-

cordingly. Speciﬁcally, we propose to upweight hard-class

pair losses in model optimization, which prompts learning

discriminative features from hard classes. We further incor-

porate a term to quantify hard-class pair consistency in ad-

versarial training, which greatly boost model robustness. Ex-

tensive experiments show that the proposed adversarial train-

ing method achieves superior robustness performance over

state-of-the-art defenses against a wide range of adversar-

ial attacks. The code of the proposed SPAT is published at

https://github.com/puerrrr/Self-Paced-Adversarial-Training.

1 Introduction

In recent years, DNNs are found to be vulnerable to ad-

versarial attacks, and extensive work has been carried out

on how to defend or reject the threat of adversarial sam-

ples (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy

2014; Nguyen, Yosinski, and Clune 2015). Adversarial

samples are carefully generated with human-imperceptible

noises, yet they can lead to large performance degradation

of well-trained models.

While numerous defenses have been proposed, adversar-

ial training (AT) is a widely recognized strategy (Madry

et al. 2017) and achieves promising performance against a

variety of attacks. AT treats adversarial attacks as an aug-

mentation method and aims to train models that can cor-

rectly classify both adversarial and clean data. Based on

the AT framework, further robustness improvements can

be achieved by exploiting unlabeled, miss-classiﬁed data,

pre-training, etc (Alayrac et al. 2019; Carmon et al. 2019;

Hendrycks, Lee, and Mazeika 2019; Zhai et al. 2019; Wang

et al. 2019; Jiang et al. 2020; Fan et al. 2021; Hou et al.

2022).

In existing adversarial training, untargeted attacks are

widely used in model optimization and evaluation (Moosavi-

Dezfooli, Fawzi, and Frossard 2016; Madry et al. 2017;

Zhang et al. 2019; Wang et al. 2019; Kannan, Kurakin, and

Goodfellow 2018; Shafahi et al. 2019; Wong, Rice, and

Kolter 2020). Unlike targeted attacks that aim to misguide

a model to a particular class other than the true one, untar-

geted adversaries do not specify the targeted category and

perturb the clean data so that its prediction is away from its

true label. In theory, adversarial perturbation in untargeted

attacks can be added along arbitrary directions and classiﬁ-

cation of untargeted attacks should be unpredictable. How-

ever, the study by Carlini et al argues that an untargeted

attack is simply a more efﬁcient method of running a tar-

geted attack for each target and taking the closest (Carlini

and Wagner 2017b). Figure 1 (a) presents the misclassiﬁ-

cation statistics of PDG-attacked dog images, where almost

half of dog images are misclassiﬁed as cats, and over 40%

of the cat images are misclassiﬁed as dogs. Considering that

cat and dog images share many common features in vision,

we raise the following questions:

”Does the unbalanced inter-class semantic similarity lead to

the non-uniformly distributed misclassiﬁcation statistics? If

yes, are classiﬁcation predictions of untargeted adversaries

predictable?”

To answer these questions, this paper revisits the recipe

for generating gradient-based ﬁrst-order adversaries and

surprisingly discovers that untargeted attacks may be tar-

geted. In theory, we prove that adversarial perturbation di-

rections in untargeted attacks are actually biased toward

the hard-class pairs of the clean data under attack. Intu-

itively, semantically-similar classes constitute hard-class

pairs (HCPs) and semantically-different classes form easy-

class pairs (ECPs).

Accordingly, we propose explicitly taking the inter-class

semantic similarity into account in AT algorithm design and

develop a self-paced adversarial training (SPAT) strategy to

upweight hard/easy-class pair losses and downweight easy-

class pair losses, encouraging the training procedure to ne-

glect redundant information from easy class pairs. Since

HCPs and ECPs may change during model training (depend-

ing on the current optimization status), their scaling factors

arXiv:2210.15068v2 [cs.CV] 29 Nov 2022

(a) PDG attacks on vanilla-

trained model

(b) PDG attacks on SPAT-

trained model

Figure 1: Predictions of untargeted adversarial attacks

(PGD-20) by CIFAR-10 vanilla-trained and SPAT-trained

classiﬁers. (a) For the vanilla-trained model, over 40% of

the dog images are misclassiﬁed as cats and (b) it is reduced

to 30.6% with the SPAT-trained model.

are adaptively updated at their own pace. Such self-paced

reweighting offers SPAT more optimization ﬂexibility. In ad-

dition, we further incorporate an HCP-ECP consistency term

in SPAT and show its effectiveness in boosting model adver-

sarial robustness. Our main contributions are:

• We investigate the cause of the unevenly distributed mis-

classiﬁcation statistics in untargeted attacks. We ﬁnd that

adversarial perturbations are actually biased by targeted

sample’s hard-class pairs.

• We introduce a SPAT strategy that takes inter-class se-

mantic similarity into account. Adaptively upweighting

hard-class pair loss encourages discriminative feature

learning.

• We propose incorporating an HCP-ECP consistency reg-

ularization term in adversarial training, which boosts

model adversarial robustness by a large margin.

2 Related Work

2.1 Adversarial Attack and Defense

The objective of adversarial attacks is to search for human-

imperceptible perturbation δso that the adversarial sample

x0=x+δ(1)

can fool a model f(x;φ)well-trained on clean data x. Here

φrepresents the trainable parameters in a model. For nota-

tion simpliﬁcation, we use f(x)to denote f(x;φ)in the

rest of the paper. One main branch of adversarial noise gen-

eration is the gradient-based method, such as the Fast Gradi-

ent Sign Method (FGSM) (Goodfellow, Shlens, and Szegedy

2014), and its variants (Kurakin, Goodfellow, and Bengio

2016; Madry et al. 2017). Another popular strategy is opti-

mization based, such as the CW attack (Carlini and Wagner

2017b).

Several pre/post-processing-based methods have shown

outstanding performance in adversarial detection and clas-

siﬁcation tasks (Grosse et al. 2017; Metzen et al. 2017; Xie

et al. 2017; Feinman et al. 2017; Li and Li 2017). They

aim to use either a secondary neural network or random

augmentation methods, such as cropping, compression and

blurring to strengthen model robustness. However, Carlini

et al. showed that they all can be defeated by a tailored at-

tack (Carlini and Wagner 2017a). Adversarial Training, on

the other hand, uses regulation methods to directly enhance

the robustness of classiﬁers. Such optimization scheme is

often referred to as the ”min-max game”:

argmin

E(x,y)∼D[max

δ∈S

L(f(x0),y)],(2)

where the inner max function aims to generate efﬁcient and

strong adversarial perturbation based on a speciﬁc loss func-

tion L, and the outer min function optimizes the network

parameters φfor model robustness. Another branch of AT

aims to achieve logit level robustness, where the objective

function not only requires correct classiﬁcation of the adver-

sarial samples, but also encourages the logits of clean and

adversarial sample pairs to be similar (Kannan, Kurakin,

and Goodfellow 2018; Zhang et al. 2019; Wang et al. 2019).

Their AT objective functions usually can be formulated as a

compound loss:

L(θ) = Lacc +λLrob (3)

where Lacc is usually the cross entropy (CE) loss on clean

or adversarial data, Lrob quantiﬁes clean-adversarial logit

pairing, and λis a hyper-parameter to control the relative

weights for these two terms. The proposed SPAT in this pa-

per introduces self-paced reweighting mechanisms upon the

compound loss and soft-differentiates hard/easy-class pair

loss in model optimization for model robustness boost.

2.2 Re-weighting in Adversarial Training

Re-weighting is a simple yet effective strategy for address-

ing biases in machine learning, for instance, class imbalance.

When class imbalance exists in the datasets, the training pro-

cedure is very likely over-ﬁt to those categories with a larger

amount of samples, leading to unsatisfactory performance

regarding minority groups. With the re-weighting technique,

one can down-weight the loss from majority classes and ob-

tain a balanced learning solution for minority groups.

Re-weighting is also a common technique for hard ex-

ample mining. Generally, hard examples are those data that

have similar representations but belong to different classes.

Hard sample mining is a crucial component in deep metric

learning (Hoffer and Ailon 2015; Hermans, Beyer, and Leibe

2017) and Contrastive learning (Chen et al. 2020; Khosla

et al. 2020). With re-weighting, we can directly utilize the

loss information during training and characterize those sam-

ples that contribute large losses as hard examples. For ex-

ample, OHEM (Shrivastava, Gupta, and Girshick 2016) and

Focal Loss (Lin et al. 2017) put more weight on the loss of

misclassiﬁed samples to effectively minimize the impact of

easy examples.

Previous studies show that utilizing hard adversarial sam-

ples promotes stronger adversarial robustness (Madry et al.

2017; Wang et al. 2019; Mao et al. 2019; Pang et al. 2020).

For instance, MART (Wang et al. 2019) explicitly applies a

re-weighting factor for misclassiﬁed samples by a soft de-

cision scheme. Recently, several re-weighting-based algo-

rithms have also been proposed to address fairness-related

issues in AT. (Wang et al. 2021) adopt a re-weighting strat-

egy to address the data imbalance problem in AT and showed

that adversarially trained models can suffer much worse per-

formance degradation in under-represented classes. Xu et

al. (Xu et al. 2021) empirically showed that even in bal-

anced datasets, AT still suffers from the fairness problem,

where some classes have much higher performance than oth-

ers. They propose to combine re-weighting and re-margin

for different classes to achieve robust fairness. Zhang et

al. (Zhang et al. 2020) propose to assign weights based

on how difﬁcult to change the prediction of a natural data

point to a different class. However, existing AT re-weighting

strategies only considered intra-class or inter-sample rela-

tionships, but ignored the inter-class biases in model opti-

mization. We propose to explicitly take the inter-class se-

mantic similarity into account in the proposed SPAT strategy

and up-weights the loss from hard-class pairs in AT.

3 Untargeted Adversaries are Targeted

Untargeted adversarial attacks are usually adopt in adver-

sarial training. In theory, adversarial perturbation in untar-

geted attacks can be added along arbitrary directions, lead-

ing to unpredictable false classiﬁcation. However, our ob-

servations on many adversarial attacks contradict this. For

example, when untargeted adversaries attack images of cats,

the resulting images are likely to be classiﬁed as dogs empir-

ically. We visualize image embeddings from the penultimate

layer of the vanilla-trained model via t-SNE in Figure 2. In

the ﬁgure, the embeddings of dog and cat images are close

to each other, which suggests the semantic similarity in their

representations. With this observation, we hypothesize that

the unbalanced inter-class semantic similarity leads to the

non-uniformly distributed misclassiﬁcation statistics.

In this section, we investigate this interesting yet over-

looked aspect of adversarial attacks and ﬁnd that untargeted

adversarial examples may be highly biased by their hard-

class pairs. The insight in this section directly motivates the

proposed self-paced adversarial training for model robust-

ness improvement.

3.1 Notations

Given a dataset with labeled pairs {X,Y}={(x, y)|x∈

Rc×m×n, y ∈[1, C]}, a classiﬁer can be formulated as a

mapping function f:X−→ Y:

f(x) = S(WTzx),(4)

where Cis the number of categories, and Srepresents the

softmax function in the classiﬁcation layer. We use zxto

denote the representation of an input sample xin the penul-

timate layer of the model and W= (wi,w2, ..., wC)for

the trainable parameters (including weights and bias) of the

softmax layer. Note that wican be considered as the proto-

type of class iand the production WTzxin (4) calculates

the similarity between zxand different class-prototype wi.

During training, the model fis optimized to minimize a spe-

ciﬁc loss L(f(x), y).

In literature, the most commonly used adversarial attacks,

such as PGD and its variants, generate adversaries based on

disteasy

disthard

disteasy

Figure 2: t-SNE visualization of 1000 randomly sampled im-

age embeddings from CIFAR-10. Due to the naturally im-

balanced semantic similarity, inter-class distance is much

smaller for hard-class pairs.

ﬁrst-order derivative information about the network (Madry

et al. 2017). Such adversarial perturbations can be generally

formulated as follows:

x0=x+g(∇xL(f(x), y)),(5)

where is the step size to modify the data and ∇xis the

gradient with respect to the input x. We take gto denote any

function on the gradient, for example, g(x) = kxkpis the `p

norm.

3.2 Bias in Untargeted Adversarial Attacks

The ﬁrst-order adversarial attacks usually deploy the CE loss

between the prediction f(x)and the target yto calculate

adversarial perturbations. The CE loss can be formulated as

L(f(x), y) = −log ewiTzx

j=1 ewjTzx

(6)

For notation simpliﬁcation in the rest of this paper, we have

σ(wiTzx) = ewiTzx

j=1 ewjTzx.

Lemma 1 (proof in Appendix): For an oracle model that

predicts the labels perfectly on clean data, the gradient of

the CE loss with respect to sample xfrom the ith category

is:

∇xL(f(x), y)=[

j6=i

σ(wjTzx)wj]∇xzx.(7)

Lemma 1 indicates that for a clean data xfrom the ith

category, its ﬁrst-order adversarial update follows the direc-

tion of the superposition of all false-class prototypes wjfor

j∈[1, C], j 6=i. The weight of the jth prototype wjin the

superposition is σ(wjTzx). The greater the value of the dot

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingAdversarialRobustnesswithSelf-PacedHard-ClassPairReweightingPengyueHou,1JieHan,1XingyuLi11UniversityofAlbertapengyue@ualberta.ca,jhan8@ualberta.ca,xingyu@ualberta.caAbstractDeepNeuralNetworksarevulnerabletoadversarialattacks.Amongmanydefensestrategies,adversarialtrainingwithun-targetedattac...

展开>> 收起<<

Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting Pengyue Hou1Jie Han1Xingyu Li1 1University of Alberta.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting Pengyue Hou1Jie Han1Xingyu Li1 1University of Alberta

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: