Hard Gate Knowledge Distillation - Leverage Calibration for a Robust and Reliable Language Model Dongkyu Lee13Zhiliang Tian2yYingxiu Zhao1

2025-05-06 0 0 760.57KB 11 页 10玖币
侵权投诉
Hard Gate Knowledge Distillation -
Leverage Calibration for a Robust and Reliable Language Model
Dongkyu Lee1,3Zhiliang Tian2Yingxiu Zhao1
Ka Chun Cheung3Nevin L. Zhang1
1Department of Computer Science and Engineering, HKUST
2College of Computer, National University of Defense Technology
3NVIDIA AI Technology Center, NVIDIA
1{dleear, yzhaocx, lzhang}@cse.ust.hk
2tianzhilianghit@gmail.com 3chcheung@nvidia.com
Abstract
In knowledge distillation, a student model is
trained with supervisions from both knowl-
edge from a teacher and observations drawn
from a training data distribution. Knowledge
of a teacher is considered a subject that holds
inter-class relations which send a meaningful
supervision to a student; hence, much effort
has been put to find such knowledge to be dis-
tilled. In this paper, we explore a question that
has been given little attention: when to distill
such knowledge." The question is answered in
our work with the concept of model calibra-
tion; we view a teacher model not only as a
source of knowledge but also as a gauge to de-
tect miscalibration of a student. This simple
and yet novel view leads to a hard gate knowl-
edge distillation scheme that switches between
learning from a teacher model and training
data. We verify the gating mechanism in the
context of natural language generation at both
the token-level and the sentence-level. Empir-
ical comparisons with strong baselines show
that hard gate knowledge distillation not only
improves model generalization, but also signif-
icantly lowers model calibration error.
1 Introduction
In recent years, the deep learning community has
achieved marked performance gains across a va-
riety of tasks (Brown et al.,2020;Devlin et al.,
2018). In the meantime, some deep learning mod-
els have become excessively large, limiting their
applicability in some scenarios. To cope with the
issue, Hinton et al. (2015) proposed knowledge
distillation (KD), in which knowledge of a large
network, called a teacher network, is transferred to
a relatively small model, called a student model.
The benefits of KD have been widely witnessed
across multiple domains (Romero et al.,2015;Jiao
This work was done while Dongkyu was an intern at
NVIDIA
Corresponding author
et al.,2020). Recently, it has been observed that
KD can be used in both reducing model size and
improving model generalization (Tang et al.,2021;
Furlanello et al.,2018). Hinton et al. (2015) ar-
gue that a distribution, defined by a teacher, holds
inter-class relations, commonly referred to as the
dark knowledge, and that such distribution brings
a meaningful supervision to a student. There-
fore, a large body of research in KD has viewed a
teacher as a source of knowledge and has focused
on finding a meaningful knowledge to be trans-
ferred (Romero et al.,2015;Bulò et al.,2016;Park
et al.,2019;Yuan et al.,2020;Kim et al.,2021).
In this work, we focus on when to distill knowl-
edge of a teacher. This is a central question to ask,
as a model can benefit from the adaptive control
of supervision between ground truth and a teacher;
When a model is trained to increase the predictive
score of a prediction, a one-hot encoded supervi-
sion, without incorporating teacher model, sends a
direct signal in increasing the score (Müller et al.,
2019). In another case, when a model is trained
to learn knowledge of a teacher, a teacher’s output
without fusing a ground truth sends more direct
signal in minimizing the knowledge gap between
the student and the teacher. However, the question
of “when" has not been answered. For this reason,
previous works choose to learn from both of the
supervisions.
We give an answer to the question from the
perspective of model calibration. Model calibra-
tion refers to how well a predicted probability of
a model reflects the true accuracy. Therefore, a
well-calibrated predictive score represents the
like-
lihood of correctness of a prediction
(Guo et al.,
2017). In this light, such score can be viewed as
a gauge to detect a miscalibration of a student in
training; when a student makes a prediction with
a probability mass that is higher than the expected
accuracy of the prediction (overconfidence), a stu-
dent model is trained with only supervision from a
arXiv:2210.12427v1 [cs.CL] 22 Oct 2022
teacher. In the case of underconfidence, a student
is trained with only supervision from ground-truth.
Switching supervision is supported by two
widely accepted ideas: 1) the close link between
miscalibration and overfitting, and 2) the regular-
ization effect of KD. Guo et al. (2017) empirically
find that a model overfits to negative log likeli-
hood (NLL) training, leading to miscalibration, and
Mukhoti et al. (2020) further support the claim.
Therefore, we utilize the regularization effect held
in KD training (Yuan et al.,2020). Aside from
the inter-class relations held in knowledge, recent
findings suggest that KD is a form of adaptive reg-
ularization (Tang et al.,2021;Yuan et al.,2020),
where a teacher enforces a student to distribute
probability mass on output space more evenly.
Taking all these factors into account, we present
a simple, yet novel KD method, called Hard gate
Knowledge Distillation (HKD). Given a calibrated
teacher model, the teacher gates supervisions be-
tween knowledge and observation for each in-
stance/time step, selecting which objective the stu-
dent should be optimized to. We introduce two lev-
els of hard gates: the token-level and the sentence-
level which are instance-specific hard gates com-
puted on the fly during forward propagation. Our
work validates the proposed idea on a task in the
Natural Language Generation (NLG) domain, as
there is an inseparable relation between the qual-
ity of an output and model calibration (Kumar and
Sarawagi,2019).
The contributions of the proposed method are as
follows:
To the best of our knowledge, this work is
the first attempt to leverage knowledge and
observations in KD with a hard gate which is
instance-specific.
Our work introduces a novel view and role
of a teacher model in student-teacher frame-
work which improve model generalization and
model calibration of a student by a significant
margin across multiple datasets.
2 Preliminaries & Related Work
2.1 Knowledge Distillation
The conventional logit-based KD (Hinton et al.,
2015) aims to minimize the distance between the
probability distribution mapped by a teacher and
that of a student, while another objective is to maxi-
mize the likelihood of predicting ground truth. Fol-
lowing is the loss of an instance
(xi,yi) X × Y
at time-step
t
, where
i
indicates the index of the
sample.1
Lkd =
|V|
X
v
(1 α)yi
t,v log Pθ(yi
t,v|ci
<t)
+αPφ(yi
t,v|ci
<t;τ) log Pθ(yi
t,v|ci
<t;τ)
(1)
V
and
τ
denote a set of vocabularies and a temper-
ature respectively.
φ
and
θ
denote parameters of a
teacher and those of a student.
α
denotes a balanc-
ing parameter which in this work is termed a
gate
,
and
c<t
is a context at time step
t
, hence made of
input
x
and preceding tokens
y<t
. The gate is set
to a value between 0 and 1, which indicates a soft
gate, and it is shared among instances and remains
fixed throughout training (Park et al.,2019;Hinton
et al.,2015;Yuan et al.,2020). Therefore, a student
model is trained with a soft target
˜yi
t
, a result of
linear interpolation between a ground truth and a
distribution mapped by a teacher.
Numerous studies have attempted to find mean-
ingful knowledge to be distilled. Starting with inter-
class relations on logit space (Park et al.,2019;Hin-
ton et al.,2015), the scope of knowledge expanded
to feature-level (Romero et al.,2015) to encourage
a student to maintain similar intermediate represen-
tations to those of a teacher. Recent studies find
that even a model with an identical model structure
to that of a student can suit the role as a teacher;
thus it is commonly referred to as Self-Knowledge
Distillation (Yuan et al.,2020;Kim et al.,2021;Liu
et al.,2021). (Yuan et al.,2020;Tang et al.,2021)
argue that the success is brought by KD’s close link
to label smoothing (Szegedy et al.,2016), with KD
holding a regularization effect. In this regard, there
have been attempts to explore the importance of
the soft gate. PS-KD (Kim et al.,2021) linearly
increases the value of the gate in the course of train-
ing. Similar to our work, Zhu and Wang (2021)
propose a hard gate mechanism in KD; however the
work utilizes an iteration-specific hard gate, and
the gates only apply to distillation loss of KD.
2.2 Calibration
A model is said to be well-calibrated when the
predictive confidence truly reflects true accuracy
(Guo et al.,2017).
P(ˆ
Y=Y|P(ˆ
Y|X) = p) = p, p[0,1] (2)
1
Loss equations are illustrated in time-step level here-
inafter as a natural language generation task can be viewed as
a sequence of classification.
Therefore, when a model makes predictions with
probability of
p
, the accuracy of the predictions
is expected to be
p
. The quantity is commonly
approximated with Expected Calibration Error and
Maximum Calibration Error (Naeini et al.,2015).
There have been continuous efforts in lowering
the calibration error of a model, and one of the
simplest, yet effective methods is temperature scal-
ing (Guo et al.,2017). Temperature scaling is a
parametric post-hoc calibration method, where a
single parameter is learned; with model parameters
fixed, the single parameter is learned to lower the
negative log likelihood on validation dataset. This
simple calibration method has been widely appre-
ciated for its ability to improve the reliability of a
model (Müller et al.,2019).
3 Approach
In this section, we first discuss a new interpreta-
tion of a teacher under KD training and introduce
methods that switch supervision between knowl-
edge and observations with an instance-specific
hard gate.
3.1 A View on Teacher Model
When a teacher model is well-calibrated, via cali-
bration method such as temperature scaling (Guo
et al.,2017), the predictive score of a teacher can be
used to estimate the true likelihood of correctness.
In this light, a teacher can be used to evaluate if a
student model makes a miscalibrated prediction, ei-
ther resulting in underconfidence or overconfidence.
Furthermore, given a calibrated teacher, minimiz-
ing the knowledge gap provides a meaningful in-
sight which is more than learning the inter-class
relations, as the objective extends to improving cal-
ibration of a student. By minimizing the KL di-
vergence between the two probability distributions,
the prediction of a student is expected to reflect the
calibrated predictive score.
3.2 Hard Gate
From the novel view of a teacher, our work presents
two instance-specific hard gates: the token-level
and the sentence-level hard gate.
3.2.1 Token-Level Gate
When a predictive score of a prediction by a stu-
dent is high compared to an approximated likeli-
hood of the correctness of the prediction, a student
is supervised to distribute the probability mass to
other remaining classes, hence learning to output a
smooth distribution. In another case, in which the
predictive score is less than the approximation, the
student is learned with supervision that increases
the probability, learning from a sample drawn from
the data distribution.
In every time step, instance-specific hard gates
are computed on the fly during forward propagation
as follows:
ai
t=(1,if Pθ(yi
t,j |ci
<t)> f(yi
t,j ,ci
<t)
0,otherwise (3)
Pθ(yi
t,j |ci
<t)
and
f(yi
t,j ,ci
<t)
are conditional prob-
ability of a ground truth index
j
mapped by a stu-
dent model and the true likelihood of
yi
t,j
occurring
in the given context. Since the true likelihood of
correctness cannot be obtained, we approximate
the quantity with a teacher network with enhanced
calibration ability f(yi
t,j ,ci
<t)Pφ(yi
t,j |c<t;τ).
Supervision from Observations (α= 0)
When
the hard gate is computed to be 0, it is an indication
of underfitting and underconfidence by a student on
the instance. The student needs further training so
that the likelihood of predicting the target index is
increased. Due to the normalizing activation layer,
softmax, a direct way of escalating the probability
mass on the ground truth index is to minimize the
KL divergence with one-hot encoded ground truth
(Müller et al.,2019), without incorporating knowl-
edge. That being the case, when the hard gate is
set to 0, supervision to a student solely comes from
ground truth.
Supervision from Knowledge (α= 1)
In an-
other case when the gate is set to 1, it is an indi-
cation of overconfidence evaluated by the approxi-
mated quantity mapped by a teacher. Therefore, a
student is trained to distribute the probability mass
on output space more evenly; the student learns to
close the gap between its probability distribution
and that of a teacher.
This gating mechanism can be viewed as smooth-
ing of labels, hence presenting a regularization ef-
fect. The entropy of supervisions by the proposed
method, conventional logit-based KD (
˜
yi
t
), and one-
hot encoded target (hard target) are as follows:
H(Pφ(Y|ci
<t;τ)) H(˜
yi
t)H(yi
t)(4)
where
˜
yi
t
is the soft target that is a linear interpola-
tion of a ground truth and a probability distribution
mapped by a teacher. The entropies illustrate how
摘要:

HardGateKnowledgeDistillation-LeverageCalibrationforaRobustandReliableLanguageModelDongkyuLee1;3ZhiliangTian2yYingxiuZhao1KaChunCheung3NevinL.Zhang11DepartmentofComputerScienceandEngineering,HKUST2CollegeofComputer,NationalUniversityofDefenseTechnology3NVIDIAAITechnologyCenter,NVIDIA1{dleear,yzhaoc...

展开>> 收起<<
Hard Gate Knowledge Distillation - Leverage Calibration for a Robust and Reliable Language Model Dongkyu Lee13Zhiliang Tian2yYingxiu Zhao1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:760.57KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注