Hard Gate Knowledge Distillation - Leverage Calibration for a Robust and Reliable Language Model Dongkyu Lee13Zhiliang Tian2yYingxiu Zhao1

2025-05-06 0 0 760.57KB 11 页 10玖币

侵权投诉

Hard Gate Knowledge Distillation -

Leverage Calibration for a Robust and Reliable Language Model

Dongkyu Lee1,3∗Zhiliang Tian2†Yingxiu Zhao1

Ka Chun Cheung3Nevin L. Zhang1

1Department of Computer Science and Engineering, HKUST

2College of Computer, National University of Defense Technology

3NVIDIA AI Technology Center, NVIDIA

1{dleear, yzhaocx, lzhang}@cse.ust.hk

2tianzhilianghit@gmail.com 3chcheung@nvidia.com

Abstract

In knowledge distillation, a student model is

trained with supervisions from both knowl-

edge from a teacher and observations drawn

from a training data distribution. Knowledge

of a teacher is considered a subject that holds

inter-class relations which send a meaningful

supervision to a student; hence, much effort

has been put to ﬁnd such knowledge to be dis-

tilled. In this paper, we explore a question that

has been given little attention: “when to distill

such knowledge." The question is answered in

our work with the concept of model calibra-

tion; we view a teacher model not only as a

source of knowledge but also as a gauge to de-

tect miscalibration of a student. This simple

and yet novel view leads to a hard gate knowl-

edge distillation scheme that switches between

learning from a teacher model and training

data. We verify the gating mechanism in the

context of natural language generation at both

the token-level and the sentence-level. Empir-

ical comparisons with strong baselines show

that hard gate knowledge distillation not only

improves model generalization, but also signif-

icantly lowers model calibration error.

1 Introduction

In recent years, the deep learning community has

achieved marked performance gains across a va-

riety of tasks (Brown et al.,2020;Devlin et al.,

2018). In the meantime, some deep learning mod-

els have become excessively large, limiting their

applicability in some scenarios. To cope with the

issue, Hinton et al. (2015) proposed knowledge

distillation (KD), in which knowledge of a large

network, called a teacher network, is transferred to

a relatively small model, called a student model.

The beneﬁts of KD have been widely witnessed

across multiple domains (Romero et al.,2015;Jiao

∗

This work was done while Dongkyu was an intern at

NVIDIA

†

Corresponding author

et al.,2020). Recently, it has been observed that

KD can be used in both reducing model size and

improving model generalization (Tang et al.,2021;

Furlanello et al.,2018). Hinton et al. (2015) ar-

gue that a distribution, deﬁned by a teacher, holds

inter-class relations, commonly referred to as the

dark knowledge, and that such distribution brings

a meaningful supervision to a student. There-

fore, a large body of research in KD has viewed a

teacher as a source of knowledge and has focused

on ﬁnding a meaningful knowledge to be trans-

ferred (Romero et al.,2015;Bulò et al.,2016;Park

et al.,2019;Yuan et al.,2020;Kim et al.,2021).

In this work, we focus on when to distill knowl-

edge of a teacher. This is a central question to ask,

as a model can beneﬁt from the adaptive control

of supervision between ground truth and a teacher;

When a model is trained to increase the predictive

score of a prediction, a one-hot encoded supervi-

sion, without incorporating teacher model, sends a

direct signal in increasing the score (Müller et al.,

2019). In another case, when a model is trained

to learn knowledge of a teacher, a teacher’s output

without fusing a ground truth sends more direct

signal in minimizing the knowledge gap between

the student and the teacher. However, the question

of “when" has not been answered. For this reason,

previous works choose to learn from both of the

supervisions.

We give an answer to the question from the

perspective of model calibration. Model calibra-

tion refers to how well a predicted probability of

a model reﬂects the true accuracy. Therefore, a

well-calibrated predictive score represents the

like-

lihood of correctness of a prediction

(Guo et al.,

2017). In this light, such score can be viewed as

a gauge to detect a miscalibration of a student in

training; when a student makes a prediction with

a probability mass that is higher than the expected

accuracy of the prediction (overconﬁdence), a stu-

dent model is trained with only supervision from a

arXiv:2210.12427v1 [cs.CL] 22 Oct 2022

teacher. In the case of underconﬁdence, a student

is trained with only supervision from ground-truth.

Switching supervision is supported by two

widely accepted ideas: 1) the close link between

miscalibration and overﬁtting, and 2) the regular-

ization effect of KD. Guo et al. (2017) empirically

ﬁnd that a model overﬁts to negative log likeli-

hood (NLL) training, leading to miscalibration, and

Mukhoti et al. (2020) further support the claim.

Therefore, we utilize the regularization effect held

in KD training (Yuan et al.,2020). Aside from

the inter-class relations held in knowledge, recent

ﬁndings suggest that KD is a form of adaptive reg-

ularization (Tang et al.,2021;Yuan et al.,2020),

where a teacher enforces a student to distribute

probability mass on output space more evenly.

Taking all these factors into account, we present

a simple, yet novel KD method, called Hard gate

Knowledge Distillation (HKD). Given a calibrated

teacher model, the teacher gates supervisions be-

tween knowledge and observation for each in-

stance/time step, selecting which objective the stu-

dent should be optimized to. We introduce two lev-

els of hard gates: the token-level and the sentence-

level which are instance-speciﬁc hard gates com-

puted on the ﬂy during forward propagation. Our

work validates the proposed idea on a task in the

Natural Language Generation (NLG) domain, as

there is an inseparable relation between the qual-

ity of an output and model calibration (Kumar and

Sarawagi,2019).

The contributions of the proposed method are as

follows:

•

To the best of our knowledge, this work is

the ﬁrst attempt to leverage knowledge and

observations in KD with a hard gate which is

instance-speciﬁc.

•

Our work introduces a novel view and role

of a teacher model in student-teacher frame-

work which improve model generalization and

model calibration of a student by a signiﬁcant

margin across multiple datasets.

2 Preliminaries & Related Work

2.1 Knowledge Distillation

The conventional logit-based KD (Hinton et al.,

2015) aims to minimize the distance between the

probability distribution mapped by a teacher and

that of a student, while another objective is to maxi-

mize the likelihood of predicting ground truth. Fol-

lowing is the loss of an instance

(xi,yi)∈ X × Y

at time-step

, where

indicates the index of the

sample.1

Lkd =−

|V|

(1 −α)yi

t,v log Pθ(yi

t,v|ci

<t)

+αPφ(yi

t,v|ci

<t;τ) log Pθ(yi

t,v|ci

<t;τ)

(1)

and

denote a set of vocabularies and a temper-

ature respectively.

and

denote parameters of a

teacher and those of a student.

denotes a balanc-

ing parameter which in this work is termed a

gate

and

c<t

is a context at time step

, hence made of

input

and preceding tokens

y<t

. The gate is set

to a value between 0 and 1, which indicates a soft

gate, and it is shared among instances and remains

ﬁxed throughout training (Park et al.,2019;Hinton

et al.,2015;Yuan et al.,2020). Therefore, a student

model is trained with a soft target

˜yi

, a result of

linear interpolation between a ground truth and a

distribution mapped by a teacher.

Numerous studies have attempted to ﬁnd mean-

ingful knowledge to be distilled. Starting with inter-

class relations on logit space (Park et al.,2019;Hin-

ton et al.,2015), the scope of knowledge expanded

to feature-level (Romero et al.,2015) to encourage

a student to maintain similar intermediate represen-

tations to those of a teacher. Recent studies ﬁnd

that even a model with an identical model structure

to that of a student can suit the role as a teacher;

thus it is commonly referred to as Self-Knowledge

Distillation (Yuan et al.,2020;Kim et al.,2021;Liu

et al.,2021). (Yuan et al.,2020;Tang et al.,2021)

argue that the success is brought by KD’s close link

to label smoothing (Szegedy et al.,2016), with KD

holding a regularization effect. In this regard, there

have been attempts to explore the importance of

the soft gate. PS-KD (Kim et al.,2021) linearly

increases the value of the gate in the course of train-

ing. Similar to our work, Zhu and Wang (2021)

propose a hard gate mechanism in KD; however the

work utilizes an iteration-speciﬁc hard gate, and

the gates only apply to distillation loss of KD.

2.2 Calibration

A model is said to be well-calibrated when the

predictive conﬁdence truly reﬂects true accuracy

(Guo et al.,2017).

P(ˆ

Y=Y|P(ˆ

Y|X) = p) = p, ∀p∈[0,1] (2)

Loss equations are illustrated in time-step level here-

inafter as a natural language generation task can be viewed as

a sequence of classiﬁcation.

Therefore, when a model makes predictions with

probability of

, the accuracy of the predictions

is expected to be

. The quantity is commonly

approximated with Expected Calibration Error and

Maximum Calibration Error (Naeini et al.,2015).

There have been continuous efforts in lowering

the calibration error of a model, and one of the

simplest, yet effective methods is temperature scal-

ing (Guo et al.,2017). Temperature scaling is a

parametric post-hoc calibration method, where a

single parameter is learned; with model parameters

ﬁxed, the single parameter is learned to lower the

negative log likelihood on validation dataset. This

simple calibration method has been widely appre-

ciated for its ability to improve the reliability of a

model (Müller et al.,2019).

3 Approach

In this section, we ﬁrst discuss a new interpreta-

tion of a teacher under KD training and introduce

methods that switch supervision between knowl-

edge and observations with an instance-speciﬁc

hard gate.

3.1 A View on Teacher Model

When a teacher model is well-calibrated, via cali-

bration method such as temperature scaling (Guo

et al.,2017), the predictive score of a teacher can be

used to estimate the true likelihood of correctness.

In this light, a teacher can be used to evaluate if a

student model makes a miscalibrated prediction, ei-

ther resulting in underconﬁdence or overconﬁdence.

Furthermore, given a calibrated teacher, minimiz-

ing the knowledge gap provides a meaningful in-

sight which is more than learning the inter-class

relations, as the objective extends to improving cal-

ibration of a student. By minimizing the KL di-

vergence between the two probability distributions,

the prediction of a student is expected to reﬂect the

calibrated predictive score.

3.2 Hard Gate

From the novel view of a teacher, our work presents

two instance-speciﬁc hard gates: the token-level

and the sentence-level hard gate.

3.2.1 Token-Level Gate

When a predictive score of a prediction by a stu-

dent is high compared to an approximated likeli-

hood of the correctness of the prediction, a student

is supervised to distribute the probability mass to

other remaining classes, hence learning to output a

smooth distribution. In another case, in which the

predictive score is less than the approximation, the

student is learned with supervision that increases

the probability, learning from a sample drawn from

the data distribution.

In every time step, instance-speciﬁc hard gates

are computed on the ﬂy during forward propagation

as follows:

t=(1,if Pθ(yi

t,j |ci

<t)> f(yi

t,j ,ci

<t)

0,otherwise (3)

Pθ(yi

t,j |ci

<t)

and

f(yi

t,j ,ci

<t)

are conditional prob-

ability of a ground truth index

mapped by a stu-

dent model and the true likelihood of

t,j

occurring

in the given context. Since the true likelihood of

correctness cannot be obtained, we approximate

the quantity with a teacher network with enhanced

calibration ability f(yi

t,j ,ci

<t)≈Pφ(yi

t,j |c<t;τ).

Supervision from Observations (α= 0)

When

the hard gate is computed to be 0, it is an indication

of underﬁtting and underconﬁdence by a student on

the instance. The student needs further training so

that the likelihood of predicting the target index is

increased. Due to the normalizing activation layer,

softmax, a direct way of escalating the probability

mass on the ground truth index is to minimize the

KL divergence with one-hot encoded ground truth

(Müller et al.,2019), without incorporating knowl-

edge. That being the case, when the hard gate is

set to 0, supervision to a student solely comes from

ground truth.

Supervision from Knowledge (α= 1)

In an-

other case when the gate is set to 1, it is an indi-

cation of overconﬁdence evaluated by the approxi-

mated quantity mapped by a teacher. Therefore, a

student is trained to distribute the probability mass

on output space more evenly; the student learns to

close the gap between its probability distribution

and that of a teacher.

This gating mechanism can be viewed as smooth-

ing of labels, hence presenting a regularization ef-

fect. The entropy of supervisions by the proposed

method, conventional logit-based KD (

), and one-

hot encoded target (hard target) are as follows:

H(Pφ(Y|ci

<t;τ)) ≥H(˜

t)≥H(yi

t)(4)

where

is the soft target that is a linear interpola-

tion of a ground truth and a probability distribution

mapped by a teacher. The entropies illustrate how

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HardGateKnowledgeDistillation-LeverageCalibrationforaRobustandReliableLanguageModelDongkyuLee1;3ZhiliangTian2yYingxiuZhao1KaChunCheung3NevinL.Zhang11DepartmentofComputerScienceandEngineering,HKUST2CollegeofComputer,NationalUniversityofDefenseTechnology3NVIDIAAITechnologyCenter,NVIDIA1{dleear,yzhaoc...

展开>> 收起<<

Hard Gate Knowledge Distillation - Leverage Calibration for a Robust and Reliable Language Model Dongkyu Lee13Zhiliang Tian2yYingxiu Zhao1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hard Gate Knowledge Distillation - Leverage Calibration for a Robust and Reliable Language Model Dongkyu Lee13Zhiliang Tian2yYingxiu Zhao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: