Untargeted Backdoor Watermark Towards Harmless and Stealthy Dataset Copyright Protection Yiming Li1 Yang Bai2 Yong Jiang1 Yong Yang3 Shu-Tao Xia1 Bo Li4

2025-05-06 0 0 2.78MB 26 页 10玖币
侵权投诉
Untargeted Backdoor Watermark: Towards
Harmless and Stealthy Dataset Copyright Protection
Yiming Li1,
, Yang Bai2,, Yong Jiang1, Yong Yang3, Shu-Tao Xia1, Bo Li4
1Tsinghua Shenzhen International Graduate School, Tsinghua University, China
2Tencent Security Zhuque Lab, China
3Tencent Security Platform Department, China
4The Department of Computer Science, University of Illinois at Urbana-Champaign, USA
li-ym18@mails.tinghua.edu.cn;{mavisbai,coolcyang}@tencent.com;
{jiangy,xiast}@sz.tsinghua.edu.cn;lbo@illinois.edu
Abstract
Deep neural networks (DNNs) have demonstrated their superiority in practice.
Arguably, the rapid development of DNNs is largely benefited from high-quality
(open-sourced) datasets, based on which researchers and developers can easily
evaluate and improve their learning methods. Since the data collection is usually
time-consuming or even expensive, how to protect their copyrights is of great
significance and worth further exploration. In this paper, we revisit dataset own-
ership verification. We find that existing verification methods introduced new
security risks in DNNs trained on the protected dataset, due to the targeted nature
of poison-only backdoor watermarks. To alleviate this problem, in this work, we
explore the untargeted backdoor watermarking scheme, where the abnormal model
behaviors are not deterministic. Specifically, we introduce two dispersibilities
and prove their correlation, based on which we design the untargeted backdoor
watermark under both poisoned-label and clean-label settings. We also discuss how
to use the proposed untargeted backdoor watermark for dataset ownership verifica-
tion. Experiments on benchmark datasets verify the effectiveness of our methods
and their resistance to existing backdoor defenses. Our codes are available at
https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark.
1 Introduction
Deep neural networks (DNNs) have been widely and successfully deployed in many applications, for
their effectiveness and efficiency. Arguably, the existence of high-quality open-sourced datasets (
e.g.
,
CIFAR-10 [
1
] and ImageNet [
2
]) is one of the key factors for the prosperity of DNNs. Researchers
and developers can easily evaluate and improve their methods based on them. However, these datasets
may probably be used for commercial purposes without authorization rather than only the educational
or academic goals, due to their high accessibility.
Currently, there were some classical methods for data protection, including encryption, data water-
marking, and defenses against data leakage. However, these methods cannot be used to protect the
copyrights of open-sourced datasets, since they either hinder the dataset accessibility or functionality
(
e.g.
, encryption), require manipulating the training process (
e.g.
, differential privacy), or even have
no effect in this case. To the best of our knowledge, there is only one method [
3
,
4
] designed for
protecting open-sourced datasets. Specifically, it first adopted poison-only backdoor attacks [
5
] to
watermark the unprotected dataset and then conducted ownership verification by verifying whether
the suspicious model has specific targeted backdoor behaviors (as shown in Figure 1).
The first two authors contributed equally to this work. Correspondence to: Yang Bai and Shu-Tao Xia.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00875v3 [cs.CR] 5 Apr 2023
Benign Images
Poisoned Images
Probability on
the Target Class 𝑷"
Hypothesis Tes t
𝐻$: 𝑷&> 𝑷"
Suspicious DNN
Probability on
the Target Class 𝑷&
Figure 1: The verification process of BEDW.
Poisoned Images
DNNs with
Targeted Backdoor Watermarks
Target Label
Random Labels
DNNs with Our
Untargeted Backdoor Watermarks
Figure 2: The inference process of DNNs
with different types of backdoor watermarks.
In this paper, we revisit dataset ownership verification. We argue that BEDW introduced new
threatening security risks in DNNs trained on the protected datasets, due to the targeted manner
of existing backdoor watermarks. Specifically, the adversaries can exploit the embedded hidden
backdoors to maliciously and deterministically manipulate model predictions (as shown in Figure 2).
Based on this understanding, we explore how to design the untargeted backdoor watermark (UBW)
and how to use it for harmless and stealthy dataset ownership verification. Specifically, we first
introduce two dispersibilities, including averaged sample-wise and averaged class-wise dispersibility,
and prove their correlation. Based on them, we propose a simple yet effective heuristic method for
UBW with poisoned labels (
i.e.
, UBW-P) and the UBW with clean labels (
i.e.
, UBW-C) based on
bi-level optimization. The UBW-P is more effective while the UBW-C is more stealthy. We also
design a UBW-based dataset ownership verification, based on the pairwise T-test [6] at the end.
The main contributions of this paper are four-fold:
1)
We reveal the limitations of existing methods in
protecting the copyrights of open-sourced datasets;
2)
We explore the untargeted backdoor watermark
(UBW) paradigm under both poisoned-label and clean-label settings;
3)
We further discuss how to
use our UBW for harmless and stealthy dataset ownership verification;
4)
Extensive experiments on
benchmark datasets verify the effectiveness of our method.
2 Related Work
In this paper, we focus on the backdoor watermarks in image classification. The watermarks in other
tasks (e.g., [7, 8, 9]) and their dataset protection are out of the scope of this paper.
2.1 Data Protection
Data protection aims to prevent unauthorized data usage or protect data privacy, which has always
been an important research direction. Currently, encryption, data watermarking, and the defenses
against data leakage are the most widespread methods discussed in data protection, as follows:
Encryption.
Currently, encryption is the most widely used data protection method, which intends to
encrypt the whole or parts of the protected data [
10
,
11
,
12
]. Only authorized users have the secret
key to decrypt the encrypted data for further usage. Except for directly preventing unauthorized data
usage, there were also some empirical methods focused on encrypting only the sensitive information
(e.g., backgrounds or image-label mappings) [13, 14, 15].
Data Watermarking.
This approach was initially used to embed a distinctive watermark into the
data to protect its copyright based on ownership verification [
16
,
17
,
18
]. Recently, data watermarking
was also adopted for other applications, such as DeepFake detection [
19
] and image steganography
[20], inspired by its unique properties.
Defenses against Data Leakage.
These methods mainly focus on preventing the leakage of sensitive
information (
e.g.
, membership inference [
21
], attribute inference [
22
], and deep gradient leakage
[
23
]) during the training process. Among all these methods, differential privacy [
24
,
25
,
26
] is the
most representative one for its good theoretical properties and effectiveness. In general, differential
privacy requires to introduce certain randomness via adding noises when training the model.
However, the aforementioned existing methods can not be adopted to prevent open-soured datasets
from being unauthorizedly used, since they either hinder dataset functionalities or are not capable in
this scenario. To the best of our knowledge, there was only one method [
3
,
4
] designed for protecting
open-sourced datasets, based on the poison-only targeted backdoor attacks [
5
]. However, this method
will introduce new security threats in the models trained on the protected dataset, which hinders its
usage. How to better protect dataset copyrights is still an important open question.
2
2.2 Backdoor Attacks
Backdoor attacks are emerging yet critical threats in the training process of deep neural networks
(DNNs), where the adversary intends to embed hidden backdoors into DNNs. The attacked models
behave normally in predicting benign samples, whereas the predictions are maliciously changed
whenever the adversary-specified trigger patterns appear. Due to this property, they were also used as
the watermark techniques for model [27, 28, 29] and dataset [3, 4] ownership verification.
In general, existing backdoor attacks can be divided into three main categories, including
1)
poison-
only attacks [
30
,
31
,
32
],
2)
training-controlled attacks [
33
,
34
,
35
], and
3)
model-modified attacks
[
36
,
37
,
38
], based on the adversary’s capacity levels. In this paper, we only focus on poison-only
backdoor attacks, since they are the hardest attack having widespread threat scenarios. Only these
attacks can be used to protect open-sourced datasets [
3
,
4
]. In particular, based on the label type,
existing poison-only attacks can also be separated into two main sub-types, as follows:
Poison-only Backdoor Attacks with Poisoned Labels.
In these attacks, the re-assigned labels of
poisoned samples are different from their ground-truth labels. For example, a cat-like poisoned image
may be labeled as the dog in the poisoned dataset released by backdoor adversaries. It is currently
the most widespread attack paradigm. To the best of our knowledge, BadNets [
30
] is the first and
most representative attack with poisoned labels. Specifically, the BadNets adversary randomly selects
certain benign samples from the original benign dataset to generate poisoned samples, based on
adding a specific trigger pattern to the images and changing their labels to the pre-defined target label.
The adversary will then combine the generated poisoned samples with the remaining benign ones to
make the poisoned dataset, which is released to train the attacked models. After that, Chen et al. [
39
]
proposed the blended attack, which suggested that the poisoned image should be similar to its benign
version to ensure stealthiness. Most recently, a more stealthy and effective attack (
i.e.
, WaNet [
32
])
was proposed, which exploited image warping to design trigger patterns.
Poison-only Backdoor Attacks with Clean Labels.
Turner et al. [
31
] proposed the first poison-
only backdoor attack with clean labels (i.e., label-consistent attack), where the target label is the
same as the ground-truth label of all poisoned samples. They argued that attacks with poisoned
labels were not stealthy enough even when the trigger pattern was invisible, since users could still
identify the attacks by examining the image-label relation when they caught the poisoned samples.
However, this attack is far less effective when the dataset has many classes or high image-resolution
(
e.g.
, GTSRB and ImageNet) [
40
,
41
,
5
]. Most recently, a more effective attack (i.e., Sleeper Agent)
was proposed, which generated trigger patterns by optimization [
40
]. Nevertheless, these attacks are
still difficult since the ‘robust features’ contained in the poisoned images will hinder the learning of
trigger patterns [
5
]. How to design attacks with clean labels is still left far behind and worth further
exploration.
Besides, to the best of our knowledge, all existing backdoor attacks are targeted,
i.e.
, the predictions
of poisoned samples are deterministic and known by the adversaries. How to design backdoor attacks
in an untargeted manner and its positive applications remain blank and worth further explorations.
3 Untargeted Backdoor Watermark (UBW)
3.1 Preliminaries
Threat Model.
In this paper, we focus on poison-only backdoor attacks as the backdoor watermarks
in image classification. Specifically, the backdoor adversaries are only allowed to modify some benign
samples while having neither the information nor the ability to modify other training components
(
e.g.
, training loss, training schedule, and model structure). The generated poisoned samples with
remaining unmodified benign ones will be released to victims, who will train their DNNs based on
them. In particular, we only consider poison-only backdoor attacks instead of other types of methods
(
e.g.
, training-controlled attacks or model-modified attacks) because they require additional adversary
capacities and therefore can not be used to protect open-sourced datasets [3, 4].
The Main Pipeline of Existing Targeted Backdoor Attacks.
Let
D={(xi, yi)}N
i=1
denotes the
benign training set, where
xi∈ X ={0,1,...,255}C×W×H
is the image,
yi∈ Y ={1, . . . , K}
is its label, and
K
is the number of classes. How to generate the poisoned dataset
Dp
is the
cornerstone of poison-only backdoor attacks. To the best of our knowledge, almost all existing
3
backdoor attacks are targeted, where all poisoned samples share the same target label. Specifically,
Dp
consists of two disjoint parts, including the modified version of a selected subset (
i.e.
,
Ds
) of
D
and remaining benign samples,
i.e.
,
Dp=Dm∪ Db
, where
yt
is an adversary-specified target
label,
Db=D\Ds
,
Dm={(x0, yt)|x0=G(x;θ),(x, y)∈ Ds}
,
γ,|Ds|
|D|
is the poisoning rate,
and
G:X → X
is an adversary-specified poisoned image generator with parameter
θ
. In particular,
poison-only backdoor attacks are mainly characterized by their poison generator
G
. For example,
G(x)=(1α)x+αt
, where
α[0,1]C×W×H
,
t∈ X
is the trigger pattern, and
is
the element-wise product in the blended attack [
39
];
G(x) = x+t
in the ISSBA [
42
]. Once the
poisoned dataset
Dp
is generated, it will be released to train DNNs. Accordingly, in the inference
process, the attacked model behaves normally on predicting benign samples while its predictions will
be maliciously and constantly changed to the target label whenever poisoned images appear.
3.2 Problem Formulation
As described in previous sections, DNNs trained on the poisoned dataset will have distinctive
behaviors while behaving normally in predicting benign images. As such, the poison-only backdoor
attacks can be used to watermark (open-sourced) datasets for their copyright protection. However,
this method introduces new security threats in the model since the backdoor adversaries can determine
model predictions of malicious samples, due to the targeted nature of existing backdoor watermarks.
Motivated by this understanding, we explore untargeted backdoor watermark (UBW) in this paper.
Our Watermark’s Goals.
The UBW has three main goals, including
1)
effectiveness,
2)
stealthiness,
and
3)
dispersibility. Specifically, the effectiveness requires that the watermarked DNNs will
misclassify poisoned images; The stealthiness needs that dataset users can not identify the watermark;
The dispersibility (denoted in Definition 1) ensures dispersible predictions of poisoned images.
Definition 1
(Averaged Prediction Dispersibility)
.
Let
D={(xi, yi)}N
i=1
indicates the dataset where
yi∈ Y ={1, . . . , K}
and
C:X → Y
is a classifier. Let
P(j)
is the probability vector of model
predictions on samples having the ground-truth label j, where the i-th element of P(j)is
P(j)
i,PN
k=1 I{C(xk) = i} · I{yk=j}
PN
k=1 I{yk=j}.(1)
The averaged prediction dispersibility Dpis defined as
Dp,1
N
K
X
j=1
N
X
i=1
I{yi=j} · HP(j),(2)
where H(·)denotes the entropy [43].
In general,
Dp
measures how dispersible the predictions of different images having the same label.
The larger the Dp, the harder that the adversaries can deterministically manipulate the predictions.
3.3 Untargeted Backdoor Watermark with Poisoned Labels (UBW-P)
Arguably, the most straightforward strategy to fulfill prediction dispersibility is to make the predictions
of poisoned images as the uniform probability vector. Specifically, we propose to randomly ‘shuffle’
the label of poisoned training samples when making the poisoned dataset. This attack is dubbed
untargeted backdoor watermark with poisoned labels (UBW-P) in this paper.
Specifically, similar to the existing targeted backdoor watermarks, our UBW-P first randomly
select a subset
Ds
from the benign dataset
D
to make its modified version
Dm
by
Dm=
{(x0, y0)|x0=G(x;θ), y0[1,· · · , K],(x, y)∈ Ds}
, where ‘
y0[1,· · · , K]
’ denotes sampling
y0
from the list
[1,· · · , K]
with equal probability and
G
is an adversary-specified poisoned image
generator. The modified subset
Dm
associated with the remaining benign samples
D\Ds
will then be
released to train the model f(·;w)by
min
wX
(x,y)∈Dm(D\Ds)
L(f(x;w), y),(3)
where Lis the loss function (e.g., cross-entropy [43]).
4
In the inference process, for any testing sample
(ˆ
x,ˆy)/∈ D
, the adversary can activate the hidden
backdoor contained in attacked DNNs with poisoned image G(ˆ
x), based on the generator G.
3.4 Untargeted Backdoor Watermark with Clean Labels (UBW-C)
As we will demonstrate in Section 5, the aforementioned heuristic UBW-P can reach promising results.
However, it is not stealthy enough even though the poisoning rate can be small, since UBW-P is still
with poisoned labels. Dataset users may identify the watermark by examining the image-label relation
when they catch the poisoned samples. In this section, we discuss how to design the untargeted
backdoor watermark with clean labels (UBW-C), based on the bi-level optimization [44].
To formulate UBW-C as a bi-level optimization, we need to optimize the prediction dispersibility.
However, it is non-differentiable and therefore cannot be optimized directly. In this paper, we
introduce two differentiable surrogate dispersibilities to alleviate this problem, as follows:
Definition 2
(Averaged Sample-wise and Class-wise Dispersibility)
.
Let
D={(xi, yi)}N
i=1
indicates
the dataset where
yi∈ Y ={1, . . . , K}
, the averaged sample-wise dispersibility of predictions given
by the DNN f(·)(over dataset D) is defined as
Ds,1
N
N
X
i=1
H(f(xi)) ,(4)
while the class-wise dispersibility is defined as
Dc,1
N
K
X
j=1
N
X
i=1
I{yi=j} · H PN
k=1 f(xk)·I{yk=j}
PN
k=1 I{yk=j}!.(5)
In general, the averaged sample-wise dispersibility describes the average dispersion of predicted
probability vectors for all samples, while the averaged class-wise dispersibility depicts the average
degree of the dispersion of the average prediction of samples in each class. Maximizing them will
have similar effects in optimizing the prediction dispersibility Dp.
In particular, the main difference of UBW-C compared with UBW-P and existing targeted backdoor
watermarks lies in the generation of the modified subset
Dm
. Specifically, in UBW-C, we do not
modify the labels of all poisoned samples,
i.e.
,
Dm={(x0, y)|x0=G(x;θ),(x, y)∈ Ds}
. Before
we reach the technical details of our UBW-C, we first present the necessary lemma and theorem.
Lemma 1.
The averaged class-wise dispersibility is always greater than the averaged sample-wise
dispersibility divided by N,i.e.,Dc>1
N·Ds.
Theorem 1.
Let
f(·;w)
denotes the DNN with parameter
w
,
G(·;θ)
is the poisoned image generator
with parameter θ, and D={(xi, yi)}N
i=1 is a given dataset with Kdifferent classes, we have
max
θ
N
X
i=1
H(f(G(xi;θ); w)) < N·max
θ
K
X
j=1
N
X
i=1
I{yi=jH PN
i=1 f(G(xi;θ); w)·I{yi=j}
PN
i=1 I{yi=j}!.
Theorem 1 implies that we can optimize the averaged sample-wise dispersibility
Ds
and the class-wise
dispersibility
Dc
simultaneously by only maximizing
Ds
. It motivates us to generate the modified
subset Dmin our UBW-C (via optimizing generator G) as follows:
max
θX
(x,y)∈Ds
[L(f(G(x;θ); w), y) + λ·H(f(G(x;θ); w))] ,(6)
s.t. w= arg min
wX
(x,y)∈Dp
L(f(x;w), y),(7)
where λis a non-negative trade-off hyper-parameter.
In general, the aforementioned process is a standard bi-level optimization, which can be effectively
and efficiently solved by alternatively optimizing the lower-level and upper-level sub-problems [
44
].
5
摘要:

UntargetedBackdoorWatermark:TowardsHarmlessandStealthyDatasetCopyrightProtectionYimingLi1;,YangBai2;,YongJiang1,YongYang3,Shu-TaoXia1,BoLi41TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity,China2TencentSecurityZhuqueLab,China3TencentSecurityPlatformDepartment,China4TheDepartmentofCom...

展开>> 收起<<
Untargeted Backdoor Watermark Towards Harmless and Stealthy Dataset Copyright Protection Yiming Li1 Yang Bai2 Yong Jiang1 Yong Yang3 Shu-Tao Xia1 Bo Li4.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:2.78MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注