Distillation-Resistant Watermarking for Model Protection in NLP Xuandong Zhao Lei Li Yu-Xiang Wang University of California Santa Barbara

2025-04-27 0 0 2.64MB 12 页 10玖币

侵权投诉

Distillation-Resistant Watermarking for Model Protection in NLP

Xuandong Zhao Lei Li Yu-Xiang Wang

University of California, Santa Barbara

{xuandongzhao,leili,yuxiangw}@cs.ucsb.edu

Abstract

How can we protect the intellectual property

of trained NLP models? Modern NLP models

are prone to stealing by querying and distill-

ing from their publicly exposed APIs. How-

ever, existing protection methods such as wa-

termarking only work for images but are not

applicable to text. We propose Distillation-

Resistant Watermarking (DRW), a novel tech-

nique to protect NLP models from being stolen

via distillation. DRW protects a model by in-

jecting watermarks into the victim’s prediction

probability corresponding to a secret key and

is able to detect such a key by probing a sus-

pect model. We prove that a protected model

still retains the original accuracy within a cer-

tain bound. We evaluate DRW on a diverse set

of NLP tasks including text classiﬁcation, part-

of-speech tagging, and named entity recogni-

tion. Experiments show that DRW protects the

original model and detects stealing suspects at

100% mean average precision for all four tasks

while the prior method fails on two1.

1 Introduction

Large-scale pre-trained neural models have shown

great success in NLP tasks (Devlin et al.,2019;Liu

et al.,2019). Task-speciﬁc NLP models are often

deployed as web services with pay-per-query APIs

in business applications. Protecting the intellectual

property of these cloud deployed models is a crit-

ical issue in both research and practice. Service

providers often use authentication mechanism to

authorize valid accesses. However, while this pre-

vents clients directly copying a victim model, it

does not hinder clients from stealing it using dis-

tillation. Emerging model extraction attacks have

demonstrated convincingly that most functions of

the victim API are likely to be stolen with care-

fully designed queries (Tramèr et al.,2016;Wal-

lace et al.,2020;Krishna et al.,2020;He et al.,

Our code is available at

https://github.com/

XuandongZhao/DRW

2021). A model extraction process is often imper-

ceptible because it queries APIs in the same way as

a normal user does (Orekondy et al.,2019). In this

paper, we study the problem of model protection

for NLP against distillation stealing.

Little has been done to adapt watermarking to

identify model infringements in language tasks. Al-

though a number of defense techniques have been

proposed to prevent the model extraction for com-

puter vision, they are not applicable to language

tasks with discrete tokens. Among them, deep neu-

ral networks (DNN) watermarking (Szyller et al.,

2021;Jia et al.,2021) works by embedding a se-

cret watermark (e.g., logo or signature) into the

model exploiting the over-parameterization prop-

erty of DNNs. This procedure leverages a trigger

set to stamp invisible watermarks on their commer-

cial models before distributing them to customers.

When suspicion of model theft arises, model own-

ers can conduct an ofﬁcial ownership claim with the

aid of the trigger set. However, these protections all

focus on the image/audio tasks, since it is easy to

modify the continuous data. In addition, most wa-

termarking methods are invasive and fragile. They

cannot avoid tampering with the training procedure

in order to embed the watermark. Besides, the wa-

termarks are outliers of the task distribution so that

the adversary may not carry the watermark through

distillation.

To ﬁll in the gap, we make the ﬁrst attempt to

protect NLP models from distillation. We pro-

pose

istillation-

esistant

atermarking (DRW)

to protect models and detect suspicious stealing.

Inspired by the idea from CosWM for computer

vision (Charette et al.,2022), we utilize prediction

perturbation to embed a secret sinusoidal signal to

the output of the victim API. To handle discrete

tokens, we design a technique to randomly project

tokens to a uniform region within sinusoidal cycles.

We design watermarking effective for distillation

with soft labels and with hard-sampled labels. As

arXiv:2210.03312v2 [cs.CL] 23 Oct 2022

long as the adversary trains the distillation proce-

dure till convergence, DRW is able to detect the

watermark signal from the extracted model.

The advantages of DRW include 1) training inde-

pendence: it works directly on the trained models

and can be directly plugged into the ﬁnal output. 2)

ﬂexibility: it can be applied to both soft-label out-

put and hard-label output in the black-box setting.

3) effectiveness: we evaluate the effectiveness of

DRW and obtain perfect model extraction detection

accuracy; we also justify the ﬁdelity with a negligi-

ble side effect on the original classiﬁcation quality.

4) scalability: the secret keys for the watermark

are randomly generated on the ﬂy so that we are

able to provide different watermarks for different

end-users and verify them.

The contributions of this paper are as follows:

•

We enhance the concept of model protection

against model extraction attacks with an em-

phasis on language applications.

•

We propose DRW, a novel method to inject

watermarks to the output of the NLP models

and later to detect if suspects distill from the

victim.

•

We provide a theoretical guarantee on the pro-

tected API accuracy — with protection DRW

does not harm much of original API’s perfor-

mance.

•

Experiments on four diverse tasks (POS

Tagging/NER/SST-2/MRPC) verify that DRW

detects extracted models with 100% mean av-

erage precision, yet with only a small drop

(<5%) in original prediction performance.

2 Related Work

Model Extraction Attacks

Model extraction at-

tacks target the conﬁdentiality of ML models and

aim to imitate the function of a black-box victim

model (Tramèr et al.,2016;Orekondy et al.,2019;

Correia-Silva et al.,2018). First, adversaries col-

lect or synthesize an initially unlabeled substitute

dataset. Next, they exploit the ability to query the

victim model APIs for label predictions to annotate

the substitute dataset. Then, they can train a high-

performance model utilizing the pseudo-labeled

dataset. Recently, several works (Krishna et al.,

2020;Wallace et al.,2020;He et al.,2021) attempt

to address the model extraction attacks on NLP

models, e.g. BERT (Devlin et al.,2019) or Google

Translate.

Knowledge Distillation

Model extraction attacks

are closely related to knowledge distillation (KD)

(Hinton et al.,2015), where the adversary acts as

the student who approximates the behaviors of the

teacher (victim) model. The student can learn from

soft labels or hard labels. KD with soft labels has

been widely applied due to the fact that soft labels

can carry a lot of useful information (Phuong and

Lampert,2019;Zhou et al.,2021).

Watermarking

A digital watermark is an unde-

tected label embedded in a noise-tolerant signal,

such as audio, video, or image data. It is designed

to identify the owner of the signal’s copyright.

Some works (Uchida et al.,2017;Adi et al.,2018;

Zhang et al.,2018;Merrer et al.,2019) employ

watermarks to prevent precise duplication of ma-

chine learning models. They insert watermarks

into the parameters of the protected model or con-

struct backdoor images that activate particular pre-

dictions. If an adversary exactly copies a protected

model, a watermark can be used to verify owner-

ship. However, safeguarding models from model

extraction attacks is more difﬁcult due to the fact

that the parameters of the suspect model might be

vastly different from those of the victim model, and

the backdoor behavior may not be transferred to the

suspect model either. Several works (Juuti et al.,

2019;Szyller et al.,2021;Jia et al.,2021;Charette

et al.,2022;He et al.,2022) study how to identify

extracted models that are distilled from the victim

model. Jia et al. (2021) forces the protected model

to acquire features for identifying data samples

taken from authentic and watermarked data. He

et al. (2022) conducts lexical modiﬁcation as a wa-

termarking method to protect language generation

APIs. CosWM (Charette et al.,2022) incorporates

a watermark as a cosine signal into the output of the

protected model. Since the cosine signal is difﬁcult

to eliminate, extracted models trained via distilla-

tion will continue to have a signiﬁcant watermark

signal. Nonetheless, CosWM only applies to image

data and soft distillation. We design multiple new

techniques to extend CosWM in handling the text

data with discrete sequence and we provide a theo-

retical guarantee on the protected API accuracy for

soft and hard distillations

3 Proposed Method: DRW

3.1 Overview

Figure 1presents an overview of distillation proce-

dure, watermarking and detection. The main idea

Unlabeled

Dataset

Victim Model API

Watermark Query

Predictions

Train

Adversary

Extracted Model

Pseudo-labeled

Dataset

Victim Model

Process of model

extraction attack

Probing

Dataset

The suspect model extracted

the victim model!

Suspect Model Key

Process of watermark

detection

Key

Query

Figure 1: Overview of model extraction attack and watermark detection. The upper panel illustrates that the

API owner adds a sinusoidal perturbation to the predicted probability distribution before answering end-users.

The extracted model will convey this periodical signal if the adversary distills the victim model. At the phase

of watermark detection, as shown in the bottom panel, the owner queries the suspect model and applies Fourier

transform to the output with a key. Then the designed perturbation can be detected when a peak shows up in the

frequency domain at fw. The extracted watermark can thus serve as legal evidence and judgment for the ownership

claim.

of DRW is to introduce a perturbation to the output

of a protected model. This designed perturbation

is transferred onto a suspect model distilled from a

victim model that remains identiﬁable by probing

the suspect model.

Problem Formulation

We consider a common

real-world scenario that the adversary only has

black-box access to the victim model’s API

There exist two types of output from victim model

API: soft (real-valued) labels (i.e. probabilities)

and hard labels. The adversary employs an auxil-

iary unlabeled dataset to query

. Once the adver-

sary gains the predictions from the victim model,

it can train a separate model

from scratch with

the pseudo-labeled dataset. The adversary may ei-

ther distill the victim model with hard labels by

minimizing the cross-entropy loss

LCE =−

i=1

yilog (ˆ

qi),(1)

where

is the prediction from the stealer’s model

and

are the pseudo-labels from the victim model;

or distill from soft labels by minimizing the Kull-

back–Leibler (KL) divergence loss

LKL =

i=1

yilog ˆ

qi.(2)

3.2 Watermarking the Victim Models

DRW dynamically embeds a watermark in re-

sponse to queries made by an API’s end-user.

We use a set of variables to represent key

(c∗, fw,vk,vs,M)

, where

c∗∈ {1, . . . , m}

is the

target class to embed watermark;

fw∈R

is the

angular frequency;

vk∈Rn

is the phase vector;

vs∈Rn

is the selection vector;

M∈R|D|×n

the random token matrix.

|D|

represents the vocab-

ulary size, so that every token ID corresponds to

vector

Mi∈Rn

. Following Charette et al. (2022),

we deﬁne a periodic signal function based on

and the input x.

zc(x) = cos (fwg(vk, x, M)) , c =c∗

cos (fwg(vk, x, M) + π), c 6=c∗

(3)

for

c∈ {1, . . . , m}

, where

g(·)∈[0,1)

is a hash

function projecting a text representation to a scalar.

Ideally, the scalar should uniformly distribute span-

ning multiple cycles.

Constructing the hash function

We project ev-

ery input

into the ﬁxed scalar range to add the si-

nusoidal perturbation by the hash function

g(·)

. We

randomly generate the phase vector

, selection

vector

and the token matrix

. Each element

{vk,vs}

is randomly sampled from a uniform

distribution over

[0,1)

. Each element of the matrix

is randomly sampled from a standard normal

distribution

Mij ∼ N(0,1)

. Let

Mi∈Rn

de-

note the

-th row of matrix

kMi∼ N(0,n

and

sMi∼ N(0,n

(we prove it in Appendix

A.2). Then we apply probability integral transfor-

mation to obtain the uniform distribution of the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Distillation-ResistantWatermarkingforModelProtectioninNLPXuandongZhaoLeiLiYu-XiangWangUniversityofCalifornia,SantaBarbara{xuandongzhao,leili,yuxiangw}@cs.ucsb.eduAbstractHowcanweprotecttheintellectualpropertyoftrainedNLPmodels?ModernNLPmodelsarepronetostealingbyqueryinganddistill-ingfromtheirpublicl...

展开>> 收起<<

Distillation-Resistant Watermarking for Model Protection in NLP Xuandong Zhao Lei Li Yu-Xiang Wang University of California Santa Barbara.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distillation-Resistant Watermarking for Model Protection in NLP Xuandong Zhao Lei Li Yu-Xiang Wang University of California Santa Barbara

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: