TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED REPRESENTATION LEARNING Hang Wang12 Sahar Karami1 Ousmane Dia1 Hippolyt Ritter1 Ehsan Emamjomeh-Zadeh1

2025-05-06 0 0 1.84MB 8 页 10玖币

侵权投诉

TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED

REPRESENTATION LEARNING

Hang Wang1,2,∗, Sahar Karami1,∗, Ousmane Dia1, Hippolyt Ritter1, Ehsan Emamjomeh-Zadeh1,

Jiahui Chen1, Zhen Xiang2, David J. Miller2, George Kesidis2

1Meta 2Pennsylvania State University

ABSTRACT

A backdoor or Trojan attack is an important type of data poi-

soning attack against deep neural network (DNN) classiﬁers,

wherein the training dataset is poisoned with a small number

of samples that each possess the backdoor pattern (usually a

pattern that is either imperceptible or innocuous) and which

are mislabeled to the attacker’s target class. When trained

on a backdoor-poisoned dataset, a DNN behaves normally on

most benign test samples but makes incorrect predictions to

the target class when the test sample has the backdoor pat-

tern incorporated (i.e., contains a backdoor trigger). Here we

focus on image classiﬁcation tasks and show that supervised

training may build stronger association between the backdoor

pattern and the associated target class than that between nor-

mal features and the true class of origin. By contrast, self-

supervised representation learning ignores the labels of sam-

ples and learns a feature embedding based on images’ se-

mantic content. Using a feature embedding found by self-

supervised representation learning, a data cleansing method,

which combines sample ﬁltering and re-labeling, is devel-

oped. Experiments on CIFAR-10 benchmark datasets show

that our method achieves state-of-the-art performance in mit-

igating backdoor attacks.

Index Terms—Backdoor; contrastive learning; data

cleansing

1. INTRODUCTION

It has been shown that Deep Neural Networks (DNNs) are

vulnerable to backdoor attacks (Trojans) [1]. Such an attack is

launched by poisoning a small batch of training samples from

one or more source classes chosen by the attacker. Training

samples are poisoned by embedding innocuous or impercep-

tible backdoor patterns into the samples and changing their

labels to a target class of the attack. For a successful at-

tack, a DNN classiﬁer trained on the poisoned dataset: i) will

have good accuracy on clean test samples (without backdoor

patterns incorporated); ii) but will classify test samples that

come from a source class of the attack, but with the backdoor

pattern incorporated (i.e., backdoor-triggered), to the target

class. Backdoor attacks may be relatively easily achieved in

practice because of an insecure training out-sourcing process,

through which both a vast training dataset is created and deep

*Equal contribution, corresponding to Sahar Karami (sahark@meta.com)

learning itself is conducted. Thus, devising realistic defenses

against backdoor poisoning is an important research area. In

this paper, we consider defenses that operate after the training

dataset is formed but before the training process. The aim is

to cleanse the training dataset prior to training of the classiﬁer.

We observe that, with supervised training on the backdoor-

attacked dataset, a DNN model learns stronger “afﬁnity”

between the backdoor pattern and the target class than that

between normal features and the true class of origin. This

strong afﬁnity is enabled (despite the backdoor pattern typi-

cally being small in magnitude) by mislabeling the poisoned

samples to the target class. However, self-supervised con-

trastive learning, does not make use of supervising class

labels; thus, it provides a way for learning from the training

set without learning the backdoor mapping.

Based on this observation, a training set cleansing method

is proposed. Using the training set D, we ﬁrst learn a feature

representation using self-supervised contrastive loss. We hy-

pothesize that, since the backdoor pattern is small in magni-

tude, self-supervised training will not emphasize the features

of the backdoor pattern contained in the poisoned samples.

Working in the learned feature embedding space, we then pro-

pose two methods (kNN-based and Energy based) to detect

and ﬁlter out samples whose predicted class is not in agree-

ment with the labeled class. We then relabel detected samples

to their predicted class (for use in subsequent classiﬁer train-

ing) if the prediction is made “with high conﬁdence”. An

overview of our method is shown in Fig. 1. Unlike many

existing backdoor defenses, Our method requires neither a

small clean dataset available to the defender, nor a reverse-

engineered backdoor pattern (if present), nor a fully trained

DNN classiﬁer on the (possibly poisoned) training dataset.

Also, ours is the ﬁrst work to address the problem of back-

door samples evading (“leaking through”) a rejection ﬁlter –

we propose a relabeling method to effectively neutralize this

effect. A complete version of our paper with Appendix is on-

line available.

2. THREAT MODEL AND RELATED WORKS

Consider a clean dataset D={(xi, yi)|i= 1...N}, where:

xi∈RX×H×Wis the ith image in the dataset with X,H

and Wrespectively the number of image channels, height,

and width; yi∈ {1,2, ..., C}is the corresponding class la-

bel, with the number of classes C > 1. Backdoor attacks

arXiv:2210.10272v2 [cs.LG] 14 Mar 2023

Fig. 1: Overview of the data cleansing method.

poison a dataset by: i) choosing an attack target class t, and

then obtaining a subset (of size M) of images from classes

other than t:Ds={(xj, yj)|i= 1...M, yj6=t},Ds⊂ D,

and MN; ii) the backdoor pattern is then incorporated

into each sample in Dsusing the attacker’s backdoor embed-

ding function g:RX×H×W−→ RX×H×W; iii) the label

for each poisoned sample is then changed to the target class:

Dp={(g(x), t)|x∈ Ds}; iv) ﬁnally the poisoned dataset is

formed by putting the attacked images back into the training

set: ¯

D= (D\Ds)∪ Dp. If the attack is successful, the vic-

tim model f:RX×H×W→ {1,2, ..., C}, when trained on

the poisoned dataset, will have normal (good) classiﬁcation

accuracy on clean (backdoor-free) test samples, but will clas-

sify most backdoor-triggered test samples to the target class

of the attack. In the image domain, backdoor patterns could,

e.g., be: i) a small patch that replaces the original pixels of an

image [1, 2, 3]; ii) a perturbation added to some pixels of an

image [4, 5, 6]; or iii) a “blended” patch attack [4].

On the other hand, the defender aims to obtain a classi-

ﬁer with good classiﬁcation accuracy on clean test samples

and which correctly classiﬁes test samples with the backdoor

pattern. Defenses against backdoors that are deployed post-

training aim to detect whether a DNN model is a backdoor

victim [2, 7, 8, 9, 6, 10] and, further, to mitigate the attack

if a detection is declared [2, 11, 12]. Most post-training de-

fenses require a relatively small clean dataset (distributed as

the clean training set), with their performance generally sen-

sitive to the number of available clean samples [12, 2, 7, 10].

In this paper, alternatively, we aim to cleanse the training set

prior to deep learning. Related work on training set cleansing

includes [13, 14, 15, 16]. All of these methods rely on em-

bedded feature representations of a classiﬁer fully trained on

the possibly poisoned training set ([14] suggests that an auto-

encoder could be used instead). [14, 13] use a 2-component

clustering approach to separate backdoor-poisoned samples

from clean samples ([14] uses a singular-value decomposi-

tion while [13] uses a simple 2-means clustering), while [15]

uses a Gaussian mixture model whose number of components

is chosen based on BIC [17]. Instead of clustering, [16] em-

ploys a reverse-engineered backdoor pattern estimated using

a small clean dataset. DBD [18] builds a classiﬁer based on

an encoder learned via self-supervised contrastive loss; then

the classiﬁer is ﬁne-tuned. In each iteration some samples

are identiﬁed as “low credible” samples by the classiﬁer, with

their labels removed; the classiﬁer is then updated based on

the processed dataset in a semi-supervised manner.

3. METHODOLOGY

3.1. Vulnerability of supervised training

We now illustrate the vulnerability of supervised training

by analysis of a simple linear model trained on a poisoned

dataset, considering the case where all classes other than the

target are (poisoned) source classes. The victim classiﬁer

forms a linear discriminant function for each class s, i.e., the

inner product fs(x) = x·wx, where ws∈RX×H×Wis the

vector of model weights corresponding to class s. Assume

that, after supervised training, each training sample is classi-

ﬁed correctly with conﬁdence at least τ > 0as measured by

the margin:

fyi(xi)−max

c6=yi

fc(xi)≥τ, ∀(xi, yi)∈¯

D.(1)

Assuming that the backdoor pattern ∆xis additively in-

corporated, given an attack sample based on clean xsorigi-

nally from source-class s6=t, Eq. (1) implies

wt·(xs+ ∆x)−ws·(xs+ ∆x)≥τ. (2)

If xsis also classiﬁed to swith margin τ, then

ws·xs−wt·xs≥τ. (3)

Adding (1) and (3) gives

ft(∆x)−fs(∆x)=(wt−ws)·∆x≥2τ. (4)

This loosely suggests that, after training with a poisoned

training dataset, the model has stronger “afﬁnity” between

the target class and the backdoor pattern (4) than between

the source class and the class-discriminative features of clean

source-class samples (3). This phenomenon is experimentally

veriﬁed when the model is a DNN, as shown in Apdx. A.

However, these strong afﬁnities are only made possible by

the mislabeling of the backdoor-poisoned samples. Given that

usually the perturbation ∆xis small, backdoor attacked im-

ages differ minutely from the original (clean) images. Thus if

a model is trained in a self-supervised manner, without mak-

ing use of the class labels, the feature representations of xand

x+ ∆xshould be quite similar/highly proximal. Thus, in the

model’s representation space, poisoned samples may “stand

out” as outliers in that their labels may disagree with the la-

bels of samples in close proximity to them. This is the basic

idea behind the cleansing method we now describe.

3.2. Self-supervised contrastive learning

SimCLR [19, 20] is a self-supervised training method to learn

a feature representation for images based on their semantic

content. In SimCLR, in each mini-batch, Ksamples are ran-

domly selected from the training dataset, and each selected

sample xkis augmented to form two versions, resulting in 2K

augmented samples. Augmented samples are then fed into

the feature representation model, which is an encoder E(·)

followed by a linear projector L(·), with the feature vector z

extracted from the last layer: z=L(E(x)). For simplicity

we will refer to L(E(·)) as the “encoder” hereon. The en-

coder is trained to minimize the following objective function:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TRAININGSETCLEANSINGOFBACKDOORPOISONINGBYSELF-SUPERVISEDREPRESENTATIONLEARNINGHangWang1;2;,SaharKarami1;,OusmaneDia1,HippolytRitter1,EhsanEmamjomeh-Zadeh1,JiahuiChen1,ZhenXiang2,DavidJ.Miller2,GeorgeKesidis21Meta2PennsylvaniaStateUniversityABSTRACTAbackdoororTrojanattackisanimportanttypeofdatapoi-...

展开>> 收起<<

TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED REPRESENTATION LEARNING Hang Wang12 Sahar Karami1 Ousmane Dia1 Hippolyt Ritter1 Ehsan Emamjomeh-Zadeh1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TRAINING SET CLEANSING OF BACKDOOR POISONING BY SELF-SUPERVISED REPRESENTATION LEARNING Hang Wang12 Sahar Karami1 Ousmane Dia1 Hippolyt Ritter1 Ehsan Emamjomeh-Zadeh1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: