long as the adversary trains the distillation proce-
dure till convergence, DRW is able to detect the
watermark signal from the extracted model.
The advantages of DRW include 1) training inde-
pendence: it works directly on the trained models
and can be directly plugged into the final output. 2)
flexibility: it can be applied to both soft-label out-
put and hard-label output in the black-box setting.
3) effectiveness: we evaluate the effectiveness of
DRW and obtain perfect model extraction detection
accuracy; we also justify the fidelity with a negligi-
ble side effect on the original classification quality.
4) scalability: the secret keys for the watermark
are randomly generated on the fly so that we are
able to provide different watermarks for different
end-users and verify them.
The contributions of this paper are as follows:
•
We enhance the concept of model protection
against model extraction attacks with an em-
phasis on language applications.
•
We propose DRW, a novel method to inject
watermarks to the output of the NLP models
and later to detect if suspects distill from the
victim.
•
We provide a theoretical guarantee on the pro-
tected API accuracy — with protection DRW
does not harm much of original API’s perfor-
mance.
•
Experiments on four diverse tasks (POS
Tagging/NER/SST-2/MRPC) verify that DRW
detects extracted models with 100% mean av-
erage precision, yet with only a small drop
(<5%) in original prediction performance.
2 Related Work
Model Extraction Attacks
Model extraction at-
tacks target the confidentiality of ML models and
aim to imitate the function of a black-box victim
model (Tramèr et al.,2016;Orekondy et al.,2019;
Correia-Silva et al.,2018). First, adversaries col-
lect or synthesize an initially unlabeled substitute
dataset. Next, they exploit the ability to query the
victim model APIs for label predictions to annotate
the substitute dataset. Then, they can train a high-
performance model utilizing the pseudo-labeled
dataset. Recently, several works (Krishna et al.,
2020;Wallace et al.,2020;He et al.,2021) attempt
to address the model extraction attacks on NLP
models, e.g. BERT (Devlin et al.,2019) or Google
Translate.
Knowledge Distillation
Model extraction attacks
are closely related to knowledge distillation (KD)
(Hinton et al.,2015), where the adversary acts as
the student who approximates the behaviors of the
teacher (victim) model. The student can learn from
soft labels or hard labels. KD with soft labels has
been widely applied due to the fact that soft labels
can carry a lot of useful information (Phuong and
Lampert,2019;Zhou et al.,2021).
Watermarking
A digital watermark is an unde-
tected label embedded in a noise-tolerant signal,
such as audio, video, or image data. It is designed
to identify the owner of the signal’s copyright.
Some works (Uchida et al.,2017;Adi et al.,2018;
Zhang et al.,2018;Merrer et al.,2019) employ
watermarks to prevent precise duplication of ma-
chine learning models. They insert watermarks
into the parameters of the protected model or con-
struct backdoor images that activate particular pre-
dictions. If an adversary exactly copies a protected
model, a watermark can be used to verify owner-
ship. However, safeguarding models from model
extraction attacks is more difficult due to the fact
that the parameters of the suspect model might be
vastly different from those of the victim model, and
the backdoor behavior may not be transferred to the
suspect model either. Several works (Juuti et al.,
2019;Szyller et al.,2021;Jia et al.,2021;Charette
et al.,2022;He et al.,2022) study how to identify
extracted models that are distilled from the victim
model. Jia et al. (2021) forces the protected model
to acquire features for identifying data samples
taken from authentic and watermarked data. He
et al. (2022) conducts lexical modification as a wa-
termarking method to protect language generation
APIs. CosWM (Charette et al.,2022) incorporates
a watermark as a cosine signal into the output of the
protected model. Since the cosine signal is difficult
to eliminate, extracted models trained via distilla-
tion will continue to have a significant watermark
signal. Nonetheless, CosWM only applies to image
data and soft distillation. We design multiple new
techniques to extend CosWM in handling the text
data with discrete sequence and we provide a theo-
retical guarantee on the protected API accuracy for
soft and hard distillations
3 Proposed Method: DRW
3.1 Overview
Figure 1presents an overview of distillation proce-
dure, watermarking and detection. The main idea