Distilling the Undistillable Learning from a Nasty Teacher Surgan Jandial1 Yash Khasbage2 Arghya Pal3 Vineeth N Balasubramanian2

2025-05-02 0 0 1.01MB 17 页 10玖币
侵权投诉
Distilling the Undistillable: Learning from a
Nasty Teacher
Surgan Jandial1, Yash Khasbage2, Arghya Pal3, Vineeth N Balasubramanian2,
and Balaji Krishnamurthy1
1Adobe MDSR Labs
2Indian Institute of Technology, Hyderabad
3Dept. Of Psychiatry and Radiology, Harvard
Abstract. The inadvertent stealing of private/sensitive information us-
ing Knowledge Distillation (KD) has been getting significant attention
recently and has guided subsequent defense efforts considering its critical
nature. Recent work Nasty Teacher proposed to develop teachers which
can not be distilled or imitated by models attacking it. However, the
promise of confidentiality offered by a nasty teacher is not well studied,
and as a further step to strengthen against such loopholes, we attempt
to bypass its defense and steal (or extract) information in its presence
successfully. Specifically, we analyze Nasty Teacher from two different
directions and subsequently leverage them carefully to develop simple
yet efficient methodologies, named as HTC and SCM, which increase
the learning from Nasty Teacher by upto 68.63% on standard datasets.
Additionally, we also explore an improvised defense method based on
our insights of stealing. Our detailed set of experiments and ablations on
diverse models/settings demonstrate the efficacy of our approach.
Keywords: Knowledge Distillation, Model Stealing, Privacy.
1 Introduction
Knowledge Distillation utilizes the outputs of a pre-trained model (i.e teacher)
to train a generally smaller model (i.e student). Typically, KD methods are used
to compress models that are wide, deep and require significant computational
resources and pose challenges to model deployment. Over the years, KD methods
have seen success in various settings beyond model compression including few-
shot learning [29], continual learning [6], and adversarial robustness [11], to name
a few – highlighting its importance in training DNN models. However, recently,
there has been a growing concern of misusing KD methods as a means to steal
the implicit model knowledge of a teacher model that could be proprietary and
confidential to an organization. KD methods provide an inadvertent pathway for
leak of intellectual property that could potentially be a threat for science and
society. Surprisingly, the importance of defending against such KD-based stealing
was only recently explored in [22,19], making this a timely and important topic.
In particular, [22] recently proposed a defense mechanism to protect such
KD-based stealing of intellectual property using a training strategy called the
arXiv:2210.11728v1 [cs.CV] 21 Oct 2022
2 S. Jandial et al.
‘Nasty Teacher’. This strategy attempts to transform the original teacher into
a model that is ‘undistillable’, i.e., any student model that attempts to learn
from such a teacher gets significantly degraded performance. This method max-
imally disturbs incorrect class logits (a significant source of model knowledge),
which produces confusing outputs devoid of clear, meaningful information. This
method showed promising results in defending against such KD-based stealing
from DNN models. However, any security-related technology development re-
quires simultaneous progress of both attacks and defenses for sturdy progress
of the field, and eventually lead to the development of robust models. In this
work, we seek to test the extent of the defense obtained by the ‘Nasty Teacher’
[22], and show that it is possible to recover model knowledge despite this defense
using the logit outputs of such a teacher. Subsequently, we leverage the garnered
insights and propose a simple yet effective defense strategy, which significantly
improves defense against KD-based stealing.
To this end, we ask two key questions: (i) can we transform the outputs of
the Nasty Teacher to reduce the extent of confusion, and thus be able to steal
despite is defense? and (ii) can we transform the outputs of the Nasty Teacher
to recover hidden essential relationships between the class logits? To answer
these two questions, we propose two approaches – High-Temperature Composi-
tion (HTC) which systematically reduces confusion in the logits and Sequence of
Contrastive Model (SCM) which systematically recovers relationships between
the logits. These approaches result in performance improvement of KD, thereby
highlighting the continued vulnerability of DNN models to KD-based stealing.
Because of their generic formulation and simplicity, we believe our proposed
ideas could apply well to similar approaches that may be developed in future
along the same lines as the Nasty Teacher. To summarize, this work analyzes
key attributes of output scores (which capture the strength and clarity of model
knowledge) that could stimulate knowledge stealing and thereby leverages those
to strengthen defenses against such attacks too. Our key contributions are sum-
marized as follows:
We draw attention to the recently identified vulnerability of KD methods in
model-stealing, and analyze the first defense method in this direction, i.e.
Nasty Teacher, from two perspectives: (i) reducing the extent of confusion in
the class logit outputs; and (ii) extracting essential relationship information
from the class logit outputs. We develop two simple yet effective strategies –
High Temperature Composition (HTC) and Sequence of Contrastive Model
(SCM) – which can undo the defense of the Nasty Teacher, pointing to the
need for better defenses in this domain.
Secondly, we leverage our obtained insights and propose an extension of
Nasty Teacher, which outperforms the earlier defense under similar settings.
We conduct exhaustive experiments and ablation studies on standard bench-
mark datasets and models to demonstrate the effectiveness of our approaches.
We hope that our efforts in this work will provide important insights and encour-
age further investigation on a critical problem with DNN models in contemporary
times where privacy and confidentiality are increasingly valued.
Distilling the Undistillable 3
2 Related Work
We discuss prior work both from perspectives of Knowledge Distillation (KD)
as well as its use in model-stealing below.
Knowledge Distillation: KD methods transfer knowledge from a larger net-
work (referred to as teacher) to a smaller network (referred to as student) by
enforcing students to match the teacher’s output. With seminal works [4,14] lay-
ing the foundation, KD has gained wide popularity in recent years. The initial
techniques for KD mainly focused on distilling knowledge from logits or proba-
bilities. This idea got further extended to distilling features in [31,40,36,28],
and many others. In all such methods, KD is used to improve the performance of
the student model in various settings. More detailed surveys on KD can be found
in [12,35,21]. Our focus in this work, however, is on recent works [22,37,19],
which have discussed how KD can unintentionally expose threats to Intellectual
Property (IP) and private content of the underlying DNN models and data,
thereby motivating a new, important direction in KD methods.
Model Stealing and KD: Model stealing involves stealing any information
from a DNN model that is desired to be inaccessible to an adversary/end-user.
Such stealing can happen in multiple ways: (1) Model Extraction as a Black
Box. An adversary could query existing model-based software, and with just its
outputs clones the knowledge into a model of their own; (2) Using Data Inputs.
An adversary may potentially access similar/same data as the victim, which can
be used to extract knowledge/IP; or Using Model Architecture/Parameters. An
adversary may attempt to extract critical model information – such as the ar-
chitecture type or the entire model file – through unintentional leaks, academic
publications or other means. There have been a few disparate efforts in the past
to protect against model/IP stealing in different contexts such as watermark-
based methods [34,41], passport-based methods [8,42], dataset inference [25],
and so on. These methods focused on verifying ownership, while other methods
such as [17,15] focused on defending against few model extraction attacks. How-
ever, the focus of these efforts was different from the one discussed herein. In
this work, we specifically explore the recently highlighted problem of KD-based
model stealing [22,19]. As noted in [22,19], most existing verification and defense
methods do not address KD-based stealing, leaving this rather critical problem
vulnerable. Our work helps analyze the first defense for KD-based stealing [22],
identifies loopholes using simple strategies and also leverages them to propose
a newer defense to this problem. We believe our findings will accelerate further
efforts in this important space. The work closest to ours is one that has been
recently published – Skeptical Student [19] – which probes the confidentiality
of [22] by appropriately designing the student (or hacker) architecture. Our ap-
proach in this work is different, and focuses on mechanisms of student training,
without changing the architecture. 4
4Code available at https://github.com/surgan12/NastyAttacks.
4 S. Jandial et al.
3 Learning from a Nasty Teacher
3.1 Background
Knowledge Distillation (KD): KD methods train a smaller student network,
θs, with the outputs of a typically large pre-trained teacher network, θtalongside
the ground-truth labels. Given an input image x, the output logits of student
given by zs=θs(x) and teacher logits given by zt=θt(x), a temperature param-
eter τis used to soften the logits and obtain a transformed output probability
vector using the softmax function:
ys=softmax(zs),yt=softmax(zt) (1)
where ysand ytare the new output probability vectors of the student and
teacher, respectively. The final loss function used to train the student model is
given by:
L=α·λ·KL(ys, yt) + (1 α)· LCE (2)
where KL stands for Kullback-Leibler divergence, (LCE ) represents standard
cross-entropy loss, and λ, α are two hyperparameters to control the importance
of the loss function terms (λ=τ2generally).
KD-based Stealing: Given a stealer (or student) model, denoted by θs, and a
victim (or teacher) θt, the stealer is said to succeed in stealing knowledge using
KD if by using the input-output information of the victim, it can grasp some
additional knowledge which is not accessible in the victim’s absence. As stated
in [22], this phenomenon can be measured in terms of difference in maximum
accuracy of stealer with and without stealing from victim. Formally, stealing is
said to happen if:
Accw(KD(θs, θt)) > Accwo(θs) (3)
where the left expression refers to the accuracy with stealing, and the right one
refers to accuracy without stealing.
Defense against KD based Stealing: Following [22], we consider a method
Mas defense, if it degrades the student’s tendency (or accuracy) of stealing. For-
mally, considering the accuracy of stealer without defense Mas Accw(KD(θs, θt))
and with defense as Accwm(KD(M(θt, θt))), Mis said to be a defense if:
Accwm(KD(M(θs, θt))) < Accw(KD(θs, θt)) (4)
Nasty Teacher(NT)[22]: The Nasty Teacher methodology transforms the orig-
inal model to a model which has accuracy as high as the original model (to ensure
model usability) but whose output distribution (or logits) significantly camou-
flages the meaningful information.
Formally, given a teacher model θt, they output a nasty teacher model θn
trained by minimizing cross-entropy loss LCE with target labels y(to ensure
high accuracy) and also by maximizing KL-Divergence LKL with the outputs
of the original teacher (to maximally contrast or disturb from the original and
create a confusing distribution). This can be written as:
Ln(x, y) = Lce(θn(x), y)ω·τ2
A· LKL(θn(x), θt(x)) (5)
摘要:

DistillingtheUndistillable:LearningfromaNastyTeacherSurganJandial1,YashKhasbage2,ArghyaPal3,VineethNBalasubramanian2,andBalajiKrishnamurthy11AdobeMDSRLabs2IndianInstituteofTechnology,Hyderabad3Dept.OfPsychiatryandRadiology,HarvardAbstract.Theinadvertentstealingofprivate/sensitiveinformationus-ingKno...

展开>> 收起<<
Distilling the Undistillable Learning from a Nasty Teacher Surgan Jandial1 Yash Khasbage2 Arghya Pal3 Vineeth N Balasubramanian2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.01MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注