Distilling the Undistillable Learning from a Nasty Teacher Surgan Jandial1 Yash Khasbage2 Arghya Pal3 Vineeth N Balasubramanian2

2025-05-02 0 0 1.01MB 17 页 10玖币

侵权投诉

Distilling the Undistillable: Learning from a

Nasty Teacher

Surgan Jandial1, Yash Khasbage2, Arghya Pal3, Vineeth N Balasubramanian2,

and Balaji Krishnamurthy1

1Adobe MDSR Labs

2Indian Institute of Technology, Hyderabad

3Dept. Of Psychiatry and Radiology, Harvard

Abstract. The inadvertent stealing of private/sensitive information us-

ing Knowledge Distillation (KD) has been getting signiﬁcant attention

recently and has guided subsequent defense eﬀorts considering its critical

nature. Recent work Nasty Teacher proposed to develop teachers which

can not be distilled or imitated by models attacking it. However, the

promise of conﬁdentiality oﬀered by a nasty teacher is not well studied,

and as a further step to strengthen against such loopholes, we attempt

to bypass its defense and steal (or extract) information in its presence

successfully. Speciﬁcally, we analyze Nasty Teacher from two diﬀerent

directions and subsequently leverage them carefully to develop simple

yet eﬃcient methodologies, named as HTC and SCM, which increase

the learning from Nasty Teacher by upto 68.63% on standard datasets.

Additionally, we also explore an improvised defense method based on

our insights of stealing. Our detailed set of experiments and ablations on

diverse models/settings demonstrate the eﬃcacy of our approach.

Keywords: Knowledge Distillation, Model Stealing, Privacy.

1 Introduction

Knowledge Distillation utilizes the outputs of a pre-trained model (i.e teacher)

to train a generally smaller model (i.e student). Typically, KD methods are used

to compress models that are wide, deep and require signiﬁcant computational

resources and pose challenges to model deployment. Over the years, KD methods

have seen success in various settings beyond model compression including few-

shot learning [29], continual learning [6], and adversarial robustness [11], to name

a few – highlighting its importance in training DNN models. However, recently,

there has been a growing concern of misusing KD methods as a means to steal

the implicit model knowledge of a teacher model that could be proprietary and

conﬁdential to an organization. KD methods provide an inadvertent pathway for

leak of intellectual property that could potentially be a threat for science and

society. Surprisingly, the importance of defending against such KD-based stealing

was only recently explored in [22,19], making this a timely and important topic.

In particular, [22] recently proposed a defense mechanism to protect such

KD-based stealing of intellectual property using a training strategy called the

arXiv:2210.11728v1 [cs.CV] 21 Oct 2022

2 S. Jandial et al.

‘Nasty Teacher’. This strategy attempts to transform the original teacher into

a model that is ‘undistillable’, i.e., any student model that attempts to learn

from such a teacher gets signiﬁcantly degraded performance. This method max-

imally disturbs incorrect class logits (a signiﬁcant source of model knowledge),

which produces confusing outputs devoid of clear, meaningful information. This

method showed promising results in defending against such KD-based stealing

from DNN models. However, any security-related technology development re-

quires simultaneous progress of both attacks and defenses for sturdy progress

of the ﬁeld, and eventually lead to the development of robust models. In this

work, we seek to test the extent of the defense obtained by the ‘Nasty Teacher’

[22], and show that it is possible to recover model knowledge despite this defense

using the logit outputs of such a teacher. Subsequently, we leverage the garnered

insights and propose a simple yet eﬀective defense strategy, which signiﬁcantly

improves defense against KD-based stealing.

To this end, we ask two key questions: (i) can we transform the outputs of

the Nasty Teacher to reduce the extent of confusion, and thus be able to steal

despite is defense? and (ii) can we transform the outputs of the Nasty Teacher

to recover hidden essential relationships between the class logits? To answer

these two questions, we propose two approaches – High-Temperature Composi-

tion (HTC) which systematically reduces confusion in the logits and Sequence of

Contrastive Model (SCM) which systematically recovers relationships between

the logits. These approaches result in performance improvement of KD, thereby

highlighting the continued vulnerability of DNN models to KD-based stealing.

Because of their generic formulation and simplicity, we believe our proposed

ideas could apply well to similar approaches that may be developed in future

along the same lines as the Nasty Teacher. To summarize, this work analyzes

key attributes of output scores (which capture the strength and clarity of model

knowledge) that could stimulate knowledge stealing and thereby leverages those

to strengthen defenses against such attacks too. Our key contributions are sum-

marized as follows:

–We draw attention to the recently identiﬁed vulnerability of KD methods in

model-stealing, and analyze the ﬁrst defense method in this direction, i.e.

Nasty Teacher, from two perspectives: (i) reducing the extent of confusion in

the class logit outputs; and (ii) extracting essential relationship information

from the class logit outputs. We develop two simple yet eﬀective strategies –

High Temperature Composition (HTC) and Sequence of Contrastive Model

(SCM) – which can undo the defense of the Nasty Teacher, pointing to the

need for better defenses in this domain.

–Secondly, we leverage our obtained insights and propose an extension of

Nasty Teacher, which outperforms the earlier defense under similar settings.

–We conduct exhaustive experiments and ablation studies on standard bench-

mark datasets and models to demonstrate the eﬀectiveness of our approaches.

We hope that our eﬀorts in this work will provide important insights and encour-

age further investigation on a critical problem with DNN models in contemporary

times where privacy and conﬁdentiality are increasingly valued.

Distilling the Undistillable 3

2 Related Work

We discuss prior work both from perspectives of Knowledge Distillation (KD)

as well as its use in model-stealing below.

Knowledge Distillation: KD methods transfer knowledge from a larger net-

work (referred to as teacher) to a smaller network (referred to as student) by

enforcing students to match the teacher’s output. With seminal works [4,14] lay-

ing the foundation, KD has gained wide popularity in recent years. The initial

techniques for KD mainly focused on distilling knowledge from logits or proba-

bilities. This idea got further extended to distilling features in [31,40,36,28],

and many others. In all such methods, KD is used to improve the performance of

the student model in various settings. More detailed surveys on KD can be found

in [12,35,21]. Our focus in this work, however, is on recent works [22,37,19],

which have discussed how KD can unintentionally expose threats to Intellectual

Property (IP) and private content of the underlying DNN models and data,

thereby motivating a new, important direction in KD methods.

Model Stealing and KD: Model stealing involves stealing any information

from a DNN model that is desired to be inaccessible to an adversary/end-user.

Such stealing can happen in multiple ways: (1) Model Extraction as a Black

Box. An adversary could query existing model-based software, and with just its

outputs clones the knowledge into a model of their own; (2) Using Data Inputs.

An adversary may potentially access similar/same data as the victim, which can

be used to extract knowledge/IP; or Using Model Architecture/Parameters. An

adversary may attempt to extract critical model information – such as the ar-

chitecture type or the entire model ﬁle – through unintentional leaks, academic

publications or other means. There have been a few disparate eﬀorts in the past

to protect against model/IP stealing in diﬀerent contexts such as watermark-

based methods [34,41], passport-based methods [8,42], dataset inference [25],

and so on. These methods focused on verifying ownership, while other methods

such as [17,15] focused on defending against few model extraction attacks. How-

ever, the focus of these eﬀorts was diﬀerent from the one discussed herein. In

this work, we speciﬁcally explore the recently highlighted problem of KD-based

model stealing [22,19]. As noted in [22,19], most existing veriﬁcation and defense

methods do not address KD-based stealing, leaving this rather critical problem

vulnerable. Our work helps analyze the ﬁrst defense for KD-based stealing [22],

identiﬁes loopholes using simple strategies and also leverages them to propose

a newer defense to this problem. We believe our ﬁndings will accelerate further

eﬀorts in this important space. The work closest to ours is one that has been

recently published – Skeptical Student [19] – which probes the conﬁdentiality

of [22] by appropriately designing the student (or hacker) architecture. Our ap-

proach in this work is diﬀerent, and focuses on mechanisms of student training,

without changing the architecture. 4

4Code available at https://github.com/surgan12/NastyAttacks.

4 S. Jandial et al.

3 Learning from a Nasty Teacher

3.1 Background

Knowledge Distillation (KD): KD methods train a smaller student network,

θs, with the outputs of a typically large pre-trained teacher network, θtalongside

the ground-truth labels. Given an input image x, the output logits of student

given by zs=θs(x) and teacher logits given by zt=θt(x), a temperature param-

eter τis used to soften the logits and obtain a transformed output probability

vector using the softmax function:

ys=softmax(zs/τ),yt=softmax(zt/τ) (1)

where ysand ytare the new output probability vectors of the student and

teacher, respectively. The ﬁnal loss function used to train the student model is

given by:

L=α·λ·KL(ys, yt) + (1 −α)· LCE (2)

where KL stands for Kullback-Leibler divergence, (LCE ) represents standard

cross-entropy loss, and λ, α are two hyperparameters to control the importance

of the loss function terms (λ=τ2generally).

KD-based Stealing: Given a stealer (or student) model, denoted by θs, and a

victim (or teacher) θt, the stealer is said to succeed in stealing knowledge using

KD if by using the input-output information of the victim, it can grasp some

additional knowledge which is not accessible in the victim’s absence. As stated

in [22], this phenomenon can be measured in terms of diﬀerence in maximum

accuracy of stealer with and without stealing from victim. Formally, stealing is

said to happen if:

Accw(KD(θs, θt)) > Accwo(θs) (3)

where the left expression refers to the accuracy with stealing, and the right one

refers to accuracy without stealing.

Defense against KD based Stealing: Following [22], we consider a method

Mas defense, if it degrades the student’s tendency (or accuracy) of stealing. For-

mally, considering the accuracy of stealer without defense Mas Accw(KD(θs, θt))

and with defense as Accwm(KD(M(θt, θt))), Mis said to be a defense if:

Accwm(KD(M(θs, θt))) < Accw(KD(θs, θt)) (4)

Nasty Teacher(NT)[22]: The Nasty Teacher methodology transforms the orig-

inal model to a model which has accuracy as high as the original model (to ensure

model usability) but whose output distribution (or logits) signiﬁcantly camou-

ﬂages the meaningful information.

Formally, given a teacher model θt, they output a nasty teacher model θn

trained by minimizing cross-entropy loss LCE with target labels y(to ensure

high accuracy) and also by maximizing KL-Divergence LKL with the outputs

of the original teacher (to maximally contrast or disturb from the original and

create a confusing distribution). This can be written as:

Ln(x, y) = Lce(θn(x), y)−ω·τ2

A· LKL(θn(x), θt(x)) (5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DistillingtheUndistillable:LearningfromaNastyTeacherSurganJandial1,YashKhasbage2,ArghyaPal3,VineethNBalasubramanian2,andBalajiKrishnamurthy11AdobeMDSRLabs2IndianInstituteofTechnology,Hyderabad3Dept.OfPsychiatryandRadiology,HarvardAbstract.Theinadvertentstealingofprivate/sensitiveinformationus-ingKno...

展开>> 收起<<

Distilling the Undistillable Learning from a Nasty Teacher Surgan Jandial1 Yash Khasbage2 Arghya Pal3 Vineeth N Balasubramanian2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distilling the Undistillable Learning from a Nasty Teacher Surgan Jandial1 Yash Khasbage2 Arghya Pal3 Vineeth N Balasubramanian2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: