Distillation-Resistant Watermarking for Model Protection in NLP Xuandong Zhao Lei Li Yu-Xiang Wang University of California Santa Barbara

2025-04-27 0 0 2.64MB 12 页 10玖币
侵权投诉
Distillation-Resistant Watermarking for Model Protection in NLP
Xuandong Zhao Lei Li Yu-Xiang Wang
University of California, Santa Barbara
{xuandongzhao,leili,yuxiangw}@cs.ucsb.edu
Abstract
How can we protect the intellectual property
of trained NLP models? Modern NLP models
are prone to stealing by querying and distill-
ing from their publicly exposed APIs. How-
ever, existing protection methods such as wa-
termarking only work for images but are not
applicable to text. We propose Distillation-
Resistant Watermarking (DRW), a novel tech-
nique to protect NLP models from being stolen
via distillation. DRW protects a model by in-
jecting watermarks into the victim’s prediction
probability corresponding to a secret key and
is able to detect such a key by probing a sus-
pect model. We prove that a protected model
still retains the original accuracy within a cer-
tain bound. We evaluate DRW on a diverse set
of NLP tasks including text classification, part-
of-speech tagging, and named entity recogni-
tion. Experiments show that DRW protects the
original model and detects stealing suspects at
100% mean average precision for all four tasks
while the prior method fails on two1.
1 Introduction
Large-scale pre-trained neural models have shown
great success in NLP tasks (Devlin et al.,2019;Liu
et al.,2019). Task-specific NLP models are often
deployed as web services with pay-per-query APIs
in business applications. Protecting the intellectual
property of these cloud deployed models is a crit-
ical issue in both research and practice. Service
providers often use authentication mechanism to
authorize valid accesses. However, while this pre-
vents clients directly copying a victim model, it
does not hinder clients from stealing it using dis-
tillation. Emerging model extraction attacks have
demonstrated convincingly that most functions of
the victim API are likely to be stolen with care-
fully designed queries (Tramèr et al.,2016;Wal-
lace et al.,2020;Krishna et al.,2020;He et al.,
1
Our code is available at
https://github.com/
XuandongZhao/DRW
2021). A model extraction process is often imper-
ceptible because it queries APIs in the same way as
a normal user does (Orekondy et al.,2019). In this
paper, we study the problem of model protection
for NLP against distillation stealing.
Little has been done to adapt watermarking to
identify model infringements in language tasks. Al-
though a number of defense techniques have been
proposed to prevent the model extraction for com-
puter vision, they are not applicable to language
tasks with discrete tokens. Among them, deep neu-
ral networks (DNN) watermarking (Szyller et al.,
2021;Jia et al.,2021) works by embedding a se-
cret watermark (e.g., logo or signature) into the
model exploiting the over-parameterization prop-
erty of DNNs. This procedure leverages a trigger
set to stamp invisible watermarks on their commer-
cial models before distributing them to customers.
When suspicion of model theft arises, model own-
ers can conduct an official ownership claim with the
aid of the trigger set. However, these protections all
focus on the image/audio tasks, since it is easy to
modify the continuous data. In addition, most wa-
termarking methods are invasive and fragile. They
cannot avoid tampering with the training procedure
in order to embed the watermark. Besides, the wa-
termarks are outliers of the task distribution so that
the adversary may not carry the watermark through
distillation.
To fill in the gap, we make the first attempt to
protect NLP models from distillation. We pro-
pose
D
istillation-
R
esistant
W
atermarking (DRW)
to protect models and detect suspicious stealing.
Inspired by the idea from CosWM for computer
vision (Charette et al.,2022), we utilize prediction
perturbation to embed a secret sinusoidal signal to
the output of the victim API. To handle discrete
tokens, we design a technique to randomly project
tokens to a uniform region within sinusoidal cycles.
We design watermarking effective for distillation
with soft labels and with hard-sampled labels. As
arXiv:2210.03312v2 [cs.CL] 23 Oct 2022
long as the adversary trains the distillation proce-
dure till convergence, DRW is able to detect the
watermark signal from the extracted model.
The advantages of DRW include 1) training inde-
pendence: it works directly on the trained models
and can be directly plugged into the final output. 2)
flexibility: it can be applied to both soft-label out-
put and hard-label output in the black-box setting.
3) effectiveness: we evaluate the effectiveness of
DRW and obtain perfect model extraction detection
accuracy; we also justify the fidelity with a negligi-
ble side effect on the original classification quality.
4) scalability: the secret keys for the watermark
are randomly generated on the fly so that we are
able to provide different watermarks for different
end-users and verify them.
The contributions of this paper are as follows:
We enhance the concept of model protection
against model extraction attacks with an em-
phasis on language applications.
We propose DRW, a novel method to inject
watermarks to the output of the NLP models
and later to detect if suspects distill from the
victim.
We provide a theoretical guarantee on the pro-
tected API accuracy — with protection DRW
does not harm much of original API’s perfor-
mance.
Experiments on four diverse tasks (POS
Tagging/NER/SST-2/MRPC) verify that DRW
detects extracted models with 100% mean av-
erage precision, yet with only a small drop
(<5%) in original prediction performance.
2 Related Work
Model Extraction Attacks
Model extraction at-
tacks target the confidentiality of ML models and
aim to imitate the function of a black-box victim
model (Tramèr et al.,2016;Orekondy et al.,2019;
Correia-Silva et al.,2018). First, adversaries col-
lect or synthesize an initially unlabeled substitute
dataset. Next, they exploit the ability to query the
victim model APIs for label predictions to annotate
the substitute dataset. Then, they can train a high-
performance model utilizing the pseudo-labeled
dataset. Recently, several works (Krishna et al.,
2020;Wallace et al.,2020;He et al.,2021) attempt
to address the model extraction attacks on NLP
models, e.g. BERT (Devlin et al.,2019) or Google
Translate.
Knowledge Distillation
Model extraction attacks
are closely related to knowledge distillation (KD)
(Hinton et al.,2015), where the adversary acts as
the student who approximates the behaviors of the
teacher (victim) model. The student can learn from
soft labels or hard labels. KD with soft labels has
been widely applied due to the fact that soft labels
can carry a lot of useful information (Phuong and
Lampert,2019;Zhou et al.,2021).
Watermarking
A digital watermark is an unde-
tected label embedded in a noise-tolerant signal,
such as audio, video, or image data. It is designed
to identify the owner of the signal’s copyright.
Some works (Uchida et al.,2017;Adi et al.,2018;
Zhang et al.,2018;Merrer et al.,2019) employ
watermarks to prevent precise duplication of ma-
chine learning models. They insert watermarks
into the parameters of the protected model or con-
struct backdoor images that activate particular pre-
dictions. If an adversary exactly copies a protected
model, a watermark can be used to verify owner-
ship. However, safeguarding models from model
extraction attacks is more difficult due to the fact
that the parameters of the suspect model might be
vastly different from those of the victim model, and
the backdoor behavior may not be transferred to the
suspect model either. Several works (Juuti et al.,
2019;Szyller et al.,2021;Jia et al.,2021;Charette
et al.,2022;He et al.,2022) study how to identify
extracted models that are distilled from the victim
model. Jia et al. (2021) forces the protected model
to acquire features for identifying data samples
taken from authentic and watermarked data. He
et al. (2022) conducts lexical modification as a wa-
termarking method to protect language generation
APIs. CosWM (Charette et al.,2022) incorporates
a watermark as a cosine signal into the output of the
protected model. Since the cosine signal is difficult
to eliminate, extracted models trained via distilla-
tion will continue to have a significant watermark
signal. Nonetheless, CosWM only applies to image
data and soft distillation. We design multiple new
techniques to extend CosWM in handling the text
data with discrete sequence and we provide a theo-
retical guarantee on the protected API accuracy for
soft and hard distillations
3 Proposed Method: DRW
3.1 Overview
Figure 1presents an overview of distillation proce-
dure, watermarking and detection. The main idea
Unlabeled
Dataset
Victim Model API
Watermark Query
Predictions
Train
Adversary
Extracted Model
Pseudo-labeled
Dataset
Victim Model
Process of model
extraction attack
Probing
Dataset
The suspect model extracted
the victim model!
Suspect Model Key
Process of watermark
detection
Key
Query
Figure 1: Overview of model extraction attack and watermark detection. The upper panel illustrates that the
API owner adds a sinusoidal perturbation to the predicted probability distribution before answering end-users.
The extracted model will convey this periodical signal if the adversary distills the victim model. At the phase
of watermark detection, as shown in the bottom panel, the owner queries the suspect model and applies Fourier
transform to the output with a key. Then the designed perturbation can be detected when a peak shows up in the
frequency domain at fw. The extracted watermark can thus serve as legal evidence and judgment for the ownership
claim.
of DRW is to introduce a perturbation to the output
of a protected model. This designed perturbation
is transferred onto a suspect model distilled from a
victim model that remains identifiable by probing
the suspect model.
Problem Formulation
We consider a common
real-world scenario that the adversary only has
black-box access to the victim model’s API
V
.
There exist two types of output from victim model
API: soft (real-valued) labels (i.e. probabilities)
and hard labels. The adversary employs an auxil-
iary unlabeled dataset to query
V
. Once the adver-
sary gains the predictions from the victim model,
it can train a separate model
S
from scratch with
the pseudo-labeled dataset. The adversary may ei-
ther distill the victim model with hard labels by
minimizing the cross-entropy loss
LCE =
m
X
i=1
ˆ
yilog (ˆ
qi),(1)
where
ˆ
qi
is the prediction from the stealer’s model
and
ˆ
y
are the pseudo-labels from the victim model;
or distill from soft labels by minimizing the Kull-
back–Leibler (KL) divergence loss
LKL =
m
X
i=1
ˆ
yilog ˆ
yi
ˆ
qi.(2)
3.2 Watermarking the Victim Models
DRW dynamically embeds a watermark in re-
sponse to queries made by an API’s end-user.
We use a set of variables to represent key
K=
(c, fw,vk,vs,M)
, where
c∈ {1, . . . , m}
is the
target class to embed watermark;
fwR
is the
angular frequency;
vkRn
is the phase vector;
vsRn
is the selection vector;
MR|Dn
is
the random token matrix.
|D|
represents the vocab-
ulary size, so that every token ID corresponds to
vector
MiRn
. Following Charette et al. (2022),
we define a periodic signal function based on
K
and the input x.
zc(x) = cos (fwg(vk, x, M)) , c =c
cos (fwg(vk, x, M) + π), c 6=c
(3)
for
c∈ {1, . . . , m}
, where
g(·)[0,1)
is a hash
function projecting a text representation to a scalar.
Ideally, the scalar should uniformly distribute span-
ning multiple cycles.
Constructing the hash function
We project ev-
ery input
x
into the fixed scalar range to add the si-
nusoidal perturbation by the hash function
g(·)
. We
randomly generate the phase vector
vk
, selection
vector
vs
and the token matrix
M
. Each element
in
{vk,vs}
is randomly sampled from a uniform
distribution over
[0,1)
. Each element of the matrix
M
is randomly sampled from a standard normal
distribution
Mij ∼ N(0,1)
. Let
MiRn
de-
note the
i
-th row of matrix
M
,
v>
kMi∼ N(0,n
3)
and
v>
sMi∼ N(0,n
3)
(we prove it in Appendix
A.2). Then we apply probability integral transfor-
mation to obtain the uniform distribution of the
摘要:

Distillation-ResistantWatermarkingforModelProtectioninNLPXuandongZhaoLeiLiYu-XiangWangUniversityofCalifornia,SantaBarbara{xuandongzhao,leili,yuxiangw}@cs.ucsb.eduAbstractHowcanweprotecttheintellectualpropertyoftrainedNLPmodels?ModernNLPmodelsarepronetostealingbyqueryinganddistill-ingfromtheirpublicl...

展开>> 收起<<
Distillation-Resistant Watermarking for Model Protection in NLP Xuandong Zhao Lei Li Yu-Xiang Wang University of California Santa Barbara.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:2.64MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注