Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation Dongkyu Lee12Ka Chun Cheung2Nevin L. Zhang1

2025-04-29 0 0 1.66MB 12 页 10玖币

侵权投诉

Adaptive Label Smoothing with Self-Knowledge

in Natural Language Generation

Dongkyu Lee1,2∗Ka Chun Cheung2Nevin L. Zhang1

1Department of Computer Science and Engineering, HKUST

2NVIDIA AI Technology Center, NVIDIA

dleear@cse.ust.hk chcheung@nvidia.com lzhang@cse.ust.hk

Abstract

Overconﬁdence has been shown to impair gen-

eralization and calibration of a neural network.

Previous studies remedy this issue by adding a

regularization term to a loss function, prevent-

ing a model from making a peaked distribution.

Label smoothing smoothes target labels with

a pre-deﬁned prior label distribution; as a re-

sult, a model is learned to maximize the like-

lihood of predicting the soft label. Nonethe-

less, the amount of smoothing is the same in

all samples and remains ﬁxed in training. In

other words, label smoothing does not reﬂect

the change in probability distribution mapped

by a model over the course of training. To ad-

dress this issue, we propose a regularization

scheme that brings dynamic nature into the

smoothing parameter by taking model proba-

bility distribution into account, thereby vary-

ing the parameter per instance. A model in

training self-regulates the extent of smoothing

on the ﬂy during forward propagation. Fur-

thermore, inspired by recent work in bridg-

ing label smoothing and knowledge distilla-

tion, our work utilizes self-knowledge as a

prior label distribution in softening target la-

bels, and presents theoretical support for the

regularization effect by knowledge distillation

and the dynamic smoothing parameter. Our

regularizer is validated comprehensively, and

the result illustrates marked improvements in

model generalization and calibration, enhanc-

ing robustness and trustworthiness of a model.

1 Introduction

In common practice, a neural network is trained to

maximize the expected likelihood of observed tar-

gets, and the gradient with respect to the objective

updates the learnable model parameters. With hard

targets (one-hot encoded), the maximum objective

can be approached when a model assigns a high

probability mass to the corresponding target label

∗

This work was done while Dongkyu was an intern at

NVIDIA

over the output space. That is, due to the normaliz-

ing activation functions (i.e.

softmax

), a model is

trained in order for logits to have a marked differ-

ence between the target logit and the other classes

logits (Müller et al.,2019).

Despite its wide application and use, the max-

imum likelihood estimation with hard targets has

been found to incur an overconﬁdent problem; the

predictive score of a model does not reﬂect the ac-

tual accuracy of the prediction. Consequently, this

leads to degradation in model calibration (Pereyra

et al.,2017), as well as in model performance

(Müller et al.,2019). Additionally, this problem

stands out more clearly with a limited number of

samples, as a model is more prone to overﬁtting. To

remedy such phenomenon, Szegedy et al. (2016)

proposed label smoothing, in which one-hot en-

coded targets are replaced with smoothed targets.

Label smoothing has boosted performance in com-

puter vision (Szegedy et al.,2016), and has been

highly preferred in other domains, such as Natural

Language Processing (Vaswani et al.,2017;Lewis

et al.,2020).

However, there are several aspects to be dis-

cussed in label smoothing. First, it comes with

certain downsides, namely the static smoothing

parameter. The smoothing regularizer fails to ac-

count for the change in probability mass over the

course of training. Despite the fact that a model

can beneﬁt from adaptive control of the smoothing

extent depending on the signs of overﬁtting and

overconﬁdence, the smoothing parameter remains

ﬁxed throughout training in all instances.

Another aspect of label smoothing to be con-

sidered is its connection to knowledge distillation

(Hinton et al.,2015). There have been attempts to

bridge label smoothing and knowledge distillation,

and the ﬁndings suggest that the latter is an adap-

tive form of the former (Tang et al.,2021;Yuan

et al.,2020). However, the regularization effect

on overconﬁdence by self-knowledge distillation is

arXiv:2210.13459v1 [cs.LG] 22 Oct 2022

still poorly understood and explored.

To tackle the issues mentioned above, this

work presents adaptive label smoothing with self-

knowledge as a prior label distribution. Our reg-

ularizer allows a model to self-regulate the extent

of smoothing based on the entropic level of model

probability distribution,

varying the amount per

sample and per time step.

Furthermore, our theo-

retical analysis suggests that self-knowledge distil-

lation and the adaptive smoothing parameter have

a strong regularization effect by rescaling gradients

on logit space. To the best of our knowledge, our

work is the ﬁrst attempt in making

both smooth-

ing extent and prior label distribution adaptive

Our work validates the efﬁcacy of the proposed

regularization method on machine translation tasks,

achieving superior results in model performance

and model calibration compared to other baselines.

2 Preliminaries & Related Work

2.1 Label Smoothing

Label smoothing (Szegedy et al.,2016) was ﬁrst in-

troduced to prevent a model from making a peaked

probability distribution. Since its introduction, it

has been in wide application as a means of regular-

ization (Vaswani et al.,2017;Lewis et al.,2020). In

label smoothing, one-hot encoded ground-truth la-

bel (

) and a pre-deﬁned prior label distribution (

)

are mixed with the weight, the smoothing param-

eter (

), forming a smoothed ground-truth label.

A model with label smoothing is learned to max-

imize the likelihood of predicting the smoothed

label distribution. Speciﬁcally,

Lls =−

|C|

i=1

(1 −α)y(n)

ilog Pθ(yi|x(n))

+αqilog Pθ(yi|x(n))

(1)

|C|

denotes the number of classes,

(n)

the index of

a sample in a batch, and

Pθ

the probability distribu-

tion mapped by a model.

is commonly set to 0.1,

and remains ﬁxed throughout training (Vaswani

et al.,2017;Lewis et al.,2020). A popular choice

is an uniform distribution (

q∼U(|C|)

), while

unigram distribution is another option for dealing

with an imbalanced label distribution (Vaswani

et al.,2017;Szegedy et al.,2016;Müller et al.,

2019;Pereyra et al.,2017). The pre-deﬁned prior

label distribution remains unchanged, hence the

latter cross-entropy term in Equation 1is equiva-

lent to minimizing the KL divergence between the

model prediction and the pre-deﬁned label distri-

bution. In line with the idea, Pereyra et al. (2017)

proposed conﬁdence penalty (ConfPenalty) that

adds negative entropy term to the loss function,

thereby minimizing the KL divergence between the

uniform distribution and model probability distribu-

tion. Ghoshal et al. (2021) proposed low-rank adap-

tive label smoothing (LORAS) that jointly learns a

noise distribution for softening targets and model

parameters. Li et al. (2020); Krothapalli and Ab-

bott (2020) introduced smoothing schemes that are

data-dependent.

2.2 Knowledge Distillation

Knowledge distillation (Hinton et al.,2015) aims to

transfer the dark knowledge of (commonly) a larger

and better performing teacher model to a student

model (Buciluundeﬁned et al.,2006). The idea is

to mix the ground-truth label with the model prob-

ability distribution of a teacher model, resulting in

an adaptive version of label smoothing (Tang et al.,

2021).

Lkd =−

|C|

i=1

(1 −α)y(n)

ilog Pθ(yi|x(n))

+α¯

Pφ(yi|x(n)) log ¯

Pθ(yi|x(n))

(2)

and

denote the parameters of a teacher model

and a student model respectively.

indicates a

probability distribution smoothed with a temper-

ature. Similar to label smoothing,

remains un-

changed in training; thus a student model is learned

to minimize the KL divergence between its prob-

ability distribution and that of the teacher model.

When

Pφ

follows a uniform distribution with the

temperature set to 1, the loss function of knowl-

edge distillation is identical to that of uniform label

smoothing.

Training a large teacher model can be computa-

tionally expensive; for this reason, there have been

attempts to replace the teacher model with the stu-

dent model itself, called self-knowledge distillation

(Zhang et al.,2019;Yuan et al.,2020;Kim et al.,

2021;Zhang and Sabuncu,2020). TF-KD (Yuan

et al.,2020) trains a student with a pre-trained

teacher that is identical to the student in terms of

structure. SKD-PRT (Kim et al.,2021) utilizes

the previous epoch checkpoint as a teacher with

linear increase in

. (Zhang and Sabuncu,2020)

incorporates beta distribution sampling (BETA)

and self-knowledge distillation (SD), and intro-

duce instance-speciﬁc prior label distribution. (Yun

et al.,2020) utilizes self-knowledge distillation to

minimize the predictive distribution of samples

with the same class, encouraging consistent proba-

bility distribution within the same class.

3 Approach

The core components of label smoothing are two-

fold: smoothing parameter (

) and prior label dis-

tribution. The components determine how much

to smooth the target label using which distribu-

tion, a process that requires careful choice of se-

lection. In this section, we illustrate how to make

the smoothing parameter adaptive. We also demon-

strate how our adaptive smoothing parameter and

self-knowledge distillation as a prior distribution

act as a form of regularization with theoretical anal-

ysis on the gradients.

3.1 Adaptive α

An intuitive and ideal way of softening the hard

target is to bring dynamic nature into choosing

;

a sample with low entropic level in model predic-

tion, an indication of peaked probability distribu-

tion, receives a high smoothing parameter to fur-

ther smooth the target label. In another scenario, in

which high entropy of model prediction (ﬂat distri-

bution) is seen, the smoothing factor is decreased.

With the intuition, our method computes the

smoothing parameter on the ﬂy during the forward

propagation in training, relying on the entropic

level of model probability distribution per sample,

and per time step in case of sequential classiﬁca-

tion.1

H(Pθ(y|x(n))) = −

|C|

i=1

Pθ(yi|x(n))

log Pθ(yi|x(n))

(3)

The entropy quantiﬁes the level of probability mass

distributed across the label space; therefore, low

entropy is an indication of overﬁtting and overcon-

ﬁdence (Pereyra et al.,2017;Meister et al.,2020).

Since entropy does not have a ﬁxed range be-

tween 0 and 1, one simple scheme is to normal-

ize the entropy with maximum entropy (

log |C|

Hence, the normalization is capable of handling

variable size of class set among different datasets.

α(n)= 1 −H(Pθ(y|x(n)))

log |C|(4)

For notational simplicity, time step is not included in the

equation hereafter.

With this mechanism, a sample with high entropy is

trained with low

, and a sample with low entropy

receives high

. The computation for

is excluded

from the computation graph for the gradient calcu-

lation, hence, the gradient does not ﬂow through

adaptive α(n).

There are two essential beneﬁts of adopting the

adaptive smoothing parameter. As the smooth-

ing extent is determined by its own probability

mass over the output space, the hyperparameter

search for

is removed. Furthermore, it is strongly

connected to the gradient rescaling effect on self-

knowledge distillation, which will be dealt in Sec-

tion 3.3 in detail.

3.2 Self-Knowledge As A Prior

Similar to (Kim et al.,2021;Liu et al.,2021), our

regularizer loads a past student model checkpoint

as teacher network parameters in the course of

training, though with a core difference in the se-

lection process. The intuition is to utilize past self-

knowledge which generalizes well, thereby hinder-

ing the model from overﬁtting to observations in

the training set.

φt= argmax

θi∈Θt

g(f(X0;θi), Y 0)(5)

Θt

is a set of past model checkpoints up to the

current epoch

in training, and function

is a spe-

ciﬁc task, which in our work is machine translation.

and

are sets of input and ground-truth sam-

ples from a validation dataset

, and the function

could be any proper evaluation metric for model

generalization (i.e. accuracy).

Our work utilizes

the

-gram matching score, BLEU (Papineni et al.,

2002) being the function

for ﬁnding the suitable

prior label distribution.

Equation 5depicts how the selection process

of a self-teacher depends on the generalization of

each past epoch checkpoint. In other words, a past

checkpoint with the least generalization error is uti-

lized as the self-teacher, a source of self-knowledge,

to send generalized supervision. Furthermore, at

every epoch, with Equation 5, the proposed ap-

proach replaces the self-teacher with the one with

the best generalization.

Combining the adaptive smoothing parameter

and self-knowledge as a prior distribution, our loss

Note that validation dataset is used to calculate gener-

alization error, not to train. Therefore, it is similar to early

stopping (Prechelt,2012).

Depending on the objective of the function

, such as loss,

argminθcan also be used.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveLabelSmoothingwithSelf-KnowledgeinNaturalLanguageGenerationDongkyuLee1;2KaChunCheung2NevinL.Zhang11DepartmentofComputerScienceandEngineering,HKUST2NVIDIAAITechnologyCenter,NVIDIAdleear@cse.ust.hkchcheung@nvidia.comlzhang@cse.ust.hkAbstractOvercondencehasbeenshowntoimpairgen-eralizationandc...

展开>> 收起<<

Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation Dongkyu Lee12Ka Chun Cheung2Nevin L. Zhang1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation Dongkyu Lee12Ka Chun Cheung2Nevin L. Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: