
Adaptive Label Smoothing with Self-Knowledge
in Natural Language Generation
Dongkyu Lee1,2∗Ka Chun Cheung2Nevin L. Zhang1
1Department of Computer Science and Engineering, HKUST
2NVIDIA AI Technology Center, NVIDIA
dleear@cse.ust.hk chcheung@nvidia.com lzhang@cse.ust.hk
Abstract
Overconfidence has been shown to impair gen-
eralization and calibration of a neural network.
Previous studies remedy this issue by adding a
regularization term to a loss function, prevent-
ing a model from making a peaked distribution.
Label smoothing smoothes target labels with
a pre-defined prior label distribution; as a re-
sult, a model is learned to maximize the like-
lihood of predicting the soft label. Nonethe-
less, the amount of smoothing is the same in
all samples and remains fixed in training. In
other words, label smoothing does not reflect
the change in probability distribution mapped
by a model over the course of training. To ad-
dress this issue, we propose a regularization
scheme that brings dynamic nature into the
smoothing parameter by taking model proba-
bility distribution into account, thereby vary-
ing the parameter per instance. A model in
training self-regulates the extent of smoothing
on the fly during forward propagation. Fur-
thermore, inspired by recent work in bridg-
ing label smoothing and knowledge distilla-
tion, our work utilizes self-knowledge as a
prior label distribution in softening target la-
bels, and presents theoretical support for the
regularization effect by knowledge distillation
and the dynamic smoothing parameter. Our
regularizer is validated comprehensively, and
the result illustrates marked improvements in
model generalization and calibration, enhanc-
ing robustness and trustworthiness of a model.
1 Introduction
In common practice, a neural network is trained to
maximize the expected likelihood of observed tar-
gets, and the gradient with respect to the objective
updates the learnable model parameters. With hard
targets (one-hot encoded), the maximum objective
can be approached when a model assigns a high
probability mass to the corresponding target label
∗
This work was done while Dongkyu was an intern at
NVIDIA
over the output space. That is, due to the normaliz-
ing activation functions (i.e.
softmax
), a model is
trained in order for logits to have a marked differ-
ence between the target logit and the other classes
logits (Müller et al.,2019).
Despite its wide application and use, the max-
imum likelihood estimation with hard targets has
been found to incur an overconfident problem; the
predictive score of a model does not reflect the ac-
tual accuracy of the prediction. Consequently, this
leads to degradation in model calibration (Pereyra
et al.,2017), as well as in model performance
(Müller et al.,2019). Additionally, this problem
stands out more clearly with a limited number of
samples, as a model is more prone to overfitting. To
remedy such phenomenon, Szegedy et al. (2016)
proposed label smoothing, in which one-hot en-
coded targets are replaced with smoothed targets.
Label smoothing has boosted performance in com-
puter vision (Szegedy et al.,2016), and has been
highly preferred in other domains, such as Natural
Language Processing (Vaswani et al.,2017;Lewis
et al.,2020).
However, there are several aspects to be dis-
cussed in label smoothing. First, it comes with
certain downsides, namely the static smoothing
parameter. The smoothing regularizer fails to ac-
count for the change in probability mass over the
course of training. Despite the fact that a model
can benefit from adaptive control of the smoothing
extent depending on the signs of overfitting and
overconfidence, the smoothing parameter remains
fixed throughout training in all instances.
Another aspect of label smoothing to be con-
sidered is its connection to knowledge distillation
(Hinton et al.,2015). There have been attempts to
bridge label smoothing and knowledge distillation,
and the findings suggest that the latter is an adap-
tive form of the former (Tang et al.,2021;Yuan
et al.,2020). However, the regularization effect
on overconfidence by self-knowledge distillation is
arXiv:2210.13459v1 [cs.LG] 22 Oct 2022