As model stealing has raised considerable concerns about model ownership, an increasing number
of model IP protection methods have been proposed in the last few years. Generally, there are two
categories to validate and protect the source model’s IP, i.e., the watermarking methods [
5
,
12
–
19
]
and the fingerprinting methods [
7
,
20
–
23
]. The watermarking methods use weight regularization
[
12
–
14
] or backdoor inserting [
5
,
16
,
17
] during model training and leave the specific watermark in
the model. However, they need to involve the model’s training procedure and sacrifice the model’s
performance on the main task. A typical example is EWE [
5
] that witnesses a
4%
classification
accuracy drop on CIFAR10. On the contrary, the fingerprinting methods make use of adversarial
examples’ transferability and identify the stolen models by adversarial examples’ attack success rates
on the suspect model. Since they do not involve in the training procedure, model fingerprinting does
not influence the model’s accuracy. Nevertheless, these adversarial example based fingerprinting
methods are still sensitive to adversarial training [
24
]. They also take up a large amount of time
for the model owner to train the surrogate models (models that the model owner trains using model
extraction by themselves) and generate adversarial examples. In addition, with the changes in label
space on transfer-learning, the adversarial examples’ target label disappears, meaning that they also
can not identify model stealing attacks with transfer learning techniques [25].
As stated above, existing fingerprinting methods, leveraging the suspect model’s output as a point-
wise indicator to detect the stolen models, are sensitive to adversarial training or transfer learning. To
address this problem, we focus on the pair-wise relationship between the outputs and develop a new
method called SAC. Intuitively, samples with similar outputs in the source model are more likely to
also have similar outputs in the stolen models. In particular, we employ the correlation difference
between the source model and the suspect model as the indicator to detect the stolen model. However,
calculating correlation using all the samples from the defender’s dataset will be influenced by the
common knowledge shared by most models trained for the same task, on which most models will
output the same label. To avoid it, we leverage the normal samples which are wrongly predicted by
both the source and the surrogate models as the model input and propose to fingerprint using sample
correlation with wrongly predicted samples (SAC-w). Furthermore, to reduce the needs for a large
number of normal samples and save time consumption for the defender, we use CutMix Augmented
samples directly to calculate the correlation difference (SAC-m), which does not need to train the
surrogate models or generate adversarial examples. To verify the effectiveness of SAC-w and SAC-m,
we investigate 5 types of attacks (i.e., fine-tuning, pruning, transfer learning, model extraction, and
adversarial training), and compare the performance against these attacks across different model
architectures and datasets.
Our main contributions are summarized as follows:
•
We introduce sample correlation into model IP protection, and propose to leverage the correlation
difference as the robust indicator to identify the model stealing attacks and provide a new insight
into model IP protection.
•
We introduce the wrongly-predicted normal samples and CutMix Augmented samples to replace ad-
versarial examples as the model inputs, providing two robust correlation-based model fingerprinting
methods.
•
Extensive results verify that SAC is able to identify different model stealing attacks across different
architectures and datasets with
AUC = 1
in most cases, performing better than previous methods.
Besides, SAC-m only takes up 4.45 seconds on the CIFAR10 dataset, greatly lowering the model
owner’s computation burden.
2 Related Work
Model stealing attacks greatly threaten the rights of the model owner. In general, we can summarize
them in several categories as follows: (1) Fine-tuning [
26
]: The attacker updates the parameters of the
source model using the labeled training data with several epochs. (2) Pruning [
8
,
9
,
27
]: The attacker
prunes less significant weights of the source model based on some indicators such as activation. (3)
Transfer learning [
25
]: The attacker transfers the source model to some similar tasks and makes use
of the source model’s knowledge. (4) Model extraction [
10
,
11
]: Because data labeling is costly and
time-consuming, and there is a large amount of unlabeled data on the Internet, the attacker can steal
the function of the source model using only the unlabeled same-distribution data. Different from the
above attacks which need access to the inner parameters of the model, the model extraction attack
2