Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation Dongkyu Lee12Ka Chun Cheung2Nevin L. Zhang1

2025-04-29 0 0 1.66MB 12 页 10玖币
侵权投诉
Adaptive Label Smoothing with Self-Knowledge
in Natural Language Generation
Dongkyu Lee1,2Ka Chun Cheung2Nevin L. Zhang1
1Department of Computer Science and Engineering, HKUST
2NVIDIA AI Technology Center, NVIDIA
dleear@cse.ust.hk chcheung@nvidia.com lzhang@cse.ust.hk
Abstract
Overconfidence has been shown to impair gen-
eralization and calibration of a neural network.
Previous studies remedy this issue by adding a
regularization term to a loss function, prevent-
ing a model from making a peaked distribution.
Label smoothing smoothes target labels with
a pre-defined prior label distribution; as a re-
sult, a model is learned to maximize the like-
lihood of predicting the soft label. Nonethe-
less, the amount of smoothing is the same in
all samples and remains fixed in training. In
other words, label smoothing does not reflect
the change in probability distribution mapped
by a model over the course of training. To ad-
dress this issue, we propose a regularization
scheme that brings dynamic nature into the
smoothing parameter by taking model proba-
bility distribution into account, thereby vary-
ing the parameter per instance. A model in
training self-regulates the extent of smoothing
on the fly during forward propagation. Fur-
thermore, inspired by recent work in bridg-
ing label smoothing and knowledge distilla-
tion, our work utilizes self-knowledge as a
prior label distribution in softening target la-
bels, and presents theoretical support for the
regularization effect by knowledge distillation
and the dynamic smoothing parameter. Our
regularizer is validated comprehensively, and
the result illustrates marked improvements in
model generalization and calibration, enhanc-
ing robustness and trustworthiness of a model.
1 Introduction
In common practice, a neural network is trained to
maximize the expected likelihood of observed tar-
gets, and the gradient with respect to the objective
updates the learnable model parameters. With hard
targets (one-hot encoded), the maximum objective
can be approached when a model assigns a high
probability mass to the corresponding target label
This work was done while Dongkyu was an intern at
NVIDIA
over the output space. That is, due to the normaliz-
ing activation functions (i.e.
softmax
), a model is
trained in order for logits to have a marked differ-
ence between the target logit and the other classes
logits (Müller et al.,2019).
Despite its wide application and use, the max-
imum likelihood estimation with hard targets has
been found to incur an overconfident problem; the
predictive score of a model does not reflect the ac-
tual accuracy of the prediction. Consequently, this
leads to degradation in model calibration (Pereyra
et al.,2017), as well as in model performance
(Müller et al.,2019). Additionally, this problem
stands out more clearly with a limited number of
samples, as a model is more prone to overfitting. To
remedy such phenomenon, Szegedy et al. (2016)
proposed label smoothing, in which one-hot en-
coded targets are replaced with smoothed targets.
Label smoothing has boosted performance in com-
puter vision (Szegedy et al.,2016), and has been
highly preferred in other domains, such as Natural
Language Processing (Vaswani et al.,2017;Lewis
et al.,2020).
However, there are several aspects to be dis-
cussed in label smoothing. First, it comes with
certain downsides, namely the static smoothing
parameter. The smoothing regularizer fails to ac-
count for the change in probability mass over the
course of training. Despite the fact that a model
can benefit from adaptive control of the smoothing
extent depending on the signs of overfitting and
overconfidence, the smoothing parameter remains
fixed throughout training in all instances.
Another aspect of label smoothing to be con-
sidered is its connection to knowledge distillation
(Hinton et al.,2015). There have been attempts to
bridge label smoothing and knowledge distillation,
and the findings suggest that the latter is an adap-
tive form of the former (Tang et al.,2021;Yuan
et al.,2020). However, the regularization effect
on overconfidence by self-knowledge distillation is
arXiv:2210.13459v1 [cs.LG] 22 Oct 2022
still poorly understood and explored.
To tackle the issues mentioned above, this
work presents adaptive label smoothing with self-
knowledge as a prior label distribution. Our reg-
ularizer allows a model to self-regulate the extent
of smoothing based on the entropic level of model
probability distribution,
varying the amount per
sample and per time step.
Furthermore, our theo-
retical analysis suggests that self-knowledge distil-
lation and the adaptive smoothing parameter have
a strong regularization effect by rescaling gradients
on logit space. To the best of our knowledge, our
work is the first attempt in making
both smooth-
ing extent and prior label distribution adaptive
.
Our work validates the efficacy of the proposed
regularization method on machine translation tasks,
achieving superior results in model performance
and model calibration compared to other baselines.
2 Preliminaries & Related Work
2.1 Label Smoothing
Label smoothing (Szegedy et al.,2016) was first in-
troduced to prevent a model from making a peaked
probability distribution. Since its introduction, it
has been in wide application as a means of regular-
ization (Vaswani et al.,2017;Lewis et al.,2020). In
label smoothing, one-hot encoded ground-truth la-
bel (
y
) and a pre-defined prior label distribution (
q
)
are mixed with the weight, the smoothing param-
eter (
α
), forming a smoothed ground-truth label.
A model with label smoothing is learned to max-
imize the likelihood of predicting the smoothed
label distribution. Specifically,
Lls =
|C|
X
i=1
(1 α)y(n)
ilog Pθ(yi|x(n))
+αqilog Pθ(yi|x(n))
(1)
|C|
denotes the number of classes,
(n)
the index of
a sample in a batch, and
Pθ
the probability distribu-
tion mapped by a model.
α
is commonly set to 0.1,
and remains fixed throughout training (Vaswani
et al.,2017;Lewis et al.,2020). A popular choice
of
q
is an uniform distribution (
qU(|C|)
), while
unigram distribution is another option for dealing
with an imbalanced label distribution (Vaswani
et al.,2017;Szegedy et al.,2016;Müller et al.,
2019;Pereyra et al.,2017). The pre-defined prior
label distribution remains unchanged, hence the
latter cross-entropy term in Equation 1is equiva-
lent to minimizing the KL divergence between the
model prediction and the pre-defined label distri-
bution. In line with the idea, Pereyra et al. (2017)
proposed confidence penalty (ConfPenalty) that
adds negative entropy term to the loss function,
thereby minimizing the KL divergence between the
uniform distribution and model probability distribu-
tion. Ghoshal et al. (2021) proposed low-rank adap-
tive label smoothing (LORAS) that jointly learns a
noise distribution for softening targets and model
parameters. Li et al. (2020); Krothapalli and Ab-
bott (2020) introduced smoothing schemes that are
data-dependent.
2.2 Knowledge Distillation
Knowledge distillation (Hinton et al.,2015) aims to
transfer the dark knowledge of (commonly) a larger
and better performing teacher model to a student
model (Buciluundefined et al.,2006). The idea is
to mix the ground-truth label with the model prob-
ability distribution of a teacher model, resulting in
an adaptive version of label smoothing (Tang et al.,
2021).
Lkd =
|C|
X
i=1
(1 α)y(n)
ilog Pθ(yi|x(n))
+α¯
Pφ(yi|x(n)) log ¯
Pθ(yi|x(n))
(2)
φ
and
θ
denote the parameters of a teacher model
and a student model respectively.
¯
P
indicates a
probability distribution smoothed with a temper-
ature. Similar to label smoothing,
φ
remains un-
changed in training; thus a student model is learned
to minimize the KL divergence between its prob-
ability distribution and that of the teacher model.
When
¯
Pφ
follows a uniform distribution with the
temperature set to 1, the loss function of knowl-
edge distillation is identical to that of uniform label
smoothing.
Training a large teacher model can be computa-
tionally expensive; for this reason, there have been
attempts to replace the teacher model with the stu-
dent model itself, called self-knowledge distillation
(Zhang et al.,2019;Yuan et al.,2020;Kim et al.,
2021;Zhang and Sabuncu,2020). TF-KD (Yuan
et al.,2020) trains a student with a pre-trained
teacher that is identical to the student in terms of
structure. SKD-PRT (Kim et al.,2021) utilizes
the previous epoch checkpoint as a teacher with
linear increase in
α
. (Zhang and Sabuncu,2020)
incorporates beta distribution sampling (BETA)
and self-knowledge distillation (SD), and intro-
duce instance-specific prior label distribution. (Yun
et al.,2020) utilizes self-knowledge distillation to
minimize the predictive distribution of samples
with the same class, encouraging consistent proba-
bility distribution within the same class.
3 Approach
The core components of label smoothing are two-
fold: smoothing parameter (
α
) and prior label dis-
tribution. The components determine how much
to smooth the target label using which distribu-
tion, a process that requires careful choice of se-
lection. In this section, we illustrate how to make
the smoothing parameter adaptive. We also demon-
strate how our adaptive smoothing parameter and
self-knowledge distillation as a prior distribution
act as a form of regularization with theoretical anal-
ysis on the gradients.
3.1 Adaptive α
An intuitive and ideal way of softening the hard
target is to bring dynamic nature into choosing
α
;
a sample with low entropic level in model predic-
tion, an indication of peaked probability distribu-
tion, receives a high smoothing parameter to fur-
ther smooth the target label. In another scenario, in
which high entropy of model prediction (flat distri-
bution) is seen, the smoothing factor is decreased.
With the intuition, our method computes the
smoothing parameter on the fly during the forward
propagation in training, relying on the entropic
level of model probability distribution per sample,
and per time step in case of sequential classifica-
tion.1
H(Pθ(y|x(n))) =
|C|
X
i=1
Pθ(yi|x(n))
log Pθ(yi|x(n))
(3)
The entropy quantifies the level of probability mass
distributed across the label space; therefore, low
entropy is an indication of overfitting and overcon-
fidence (Pereyra et al.,2017;Meister et al.,2020).
Since entropy does not have a fixed range be-
tween 0 and 1, one simple scheme is to normal-
ize the entropy with maximum entropy (
log |C|
).
Hence, the normalization is capable of handling
variable size of class set among different datasets.
α(n)= 1 H(Pθ(y|x(n)))
log |C|(4)
1
For notational simplicity, time step is not included in the
equation hereafter.
With this mechanism, a sample with high entropy is
trained with low
α
, and a sample with low entropy
receives high
α
. The computation for
α
is excluded
from the computation graph for the gradient calcu-
lation, hence, the gradient does not flow through
adaptive α(n).
There are two essential benefits of adopting the
adaptive smoothing parameter. As the smooth-
ing extent is determined by its own probability
mass over the output space, the hyperparameter
search for
α
is removed. Furthermore, it is strongly
connected to the gradient rescaling effect on self-
knowledge distillation, which will be dealt in Sec-
tion 3.3 in detail.
3.2 Self-Knowledge As A Prior
Similar to (Kim et al.,2021;Liu et al.,2021), our
regularizer loads a past student model checkpoint
as teacher network parameters in the course of
training, though with a core difference in the se-
lection process. The intuition is to utilize past self-
knowledge which generalizes well, thereby hinder-
ing the model from overfitting to observations in
the training set.
φt= argmax
θiΘt
g(f(X0;θi), Y 0)(5)
Θt
is a set of past model checkpoints up to the
current epoch
t
in training, and function
f
is a spe-
cific task, which in our work is machine translation.
X0
and
Y0
are sets of input and ground-truth sam-
ples from a validation dataset
2
, and the function
g
could be any proper evaluation metric for model
generalization (i.e. accuracy).
3
Our work utilizes
the
n
-gram matching score, BLEU (Papineni et al.,
2002) being the function
g
for finding the suitable
prior label distribution.
Equation 5depicts how the selection process
of a self-teacher depends on the generalization of
each past epoch checkpoint. In other words, a past
checkpoint with the least generalization error is uti-
lized as the self-teacher, a source of self-knowledge,
to send generalized supervision. Furthermore, at
every epoch, with Equation 5, the proposed ap-
proach replaces the self-teacher with the one with
the best generalization.
Combining the adaptive smoothing parameter
and self-knowledge as a prior distribution, our loss
2
Note that validation dataset is used to calculate gener-
alization error, not to train. Therefore, it is similar to early
stopping (Prechelt,2012).
3
Depending on the objective of the function
g
, such as loss,
argminθcan also be used.
摘要:

AdaptiveLabelSmoothingwithSelf-KnowledgeinNaturalLanguageGenerationDongkyuLee1;2KaChunCheung2NevinL.Zhang11DepartmentofComputerScienceandEngineering,HKUST2NVIDIAAITechnologyCenter,NVIDIAdleear@cse.ust.hkchcheung@nvidia.comlzhang@cse.ust.hkAbstractOvercondencehasbeenshowntoimpairgen-eralizationandc...

展开>> 收起<<
Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation Dongkyu Lee12Ka Chun Cheung2Nevin L. Zhang1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.66MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注