Are All Losses Created Equal A Neural Collapse Perspective Jinxin Zhou

2025-04-27 0 0 3.31MB 32 页 10玖币
侵权投诉
Are All Losses Created Equal:
A Neural Collapse Perspective
Jinxin Zhou
Ohio State University
zhou.3820@osu.edu
Chong You
Google Research
cyou@google.com
Xiao Li
University of Michigan
xlxiao@umich.edu
Kangning Liu
New York University
kl3141@nyu.edu
Sheng Liu
New York University
shengliu@nyu.edu
Qing Qu
University of Michigan
qingqu@umich.edu
Zhihui Zhu
Ohio State University
zhu.3440@osu.edu
Abstract
While cross entropy (CE) is the most commonly used loss function to train deep
neural networks for classification tasks, many alternative losses have been devel-
oped to obtain better empirical performance. Among them, which one is the best
to use is still a mystery, because there seem to be multiple factors affecting the
answer, such as properties of the dataset, the choice of network architecture, and
so on. This paper studies the choice of loss function by examining the last-layer
features of deep networks, drawing inspiration from a recent line work showing
that the global optimal solution of CE and mean-square-error (MSE) losses ex-
hibits a Neural Collapse (NC) phenomenon. That is, for sufficiently large net-
works trained until convergence, (i) all features of the same class collapse to the
corresponding class mean and (ii) the means associated with different classes are
in a configuration where their pairwise distances are all equal and maximized.
We extend such results and show through global solution and landscape analyses
that a broad family of loss functions including commonly used label smoothing
(LS) and focal loss (FL) exhibits NC. Hence, all relevant losses (i.e., CE, LS,
FL, MSE) produce equivalent features on training data. In particular, based on
the unconstrained feature model assumption, we provide either the global land-
scape analysis for LS loss or the local landscape analysis for FL loss and show
that the (only!) global minimizers are NC solutions, while all other critical points
are strict saddles whose Hessian exhibit negative curvature directions either in the
global scope for LS loss or in the local scope for FL loss near the optimal solution.
The experiments further show that NC features obtained from all relevant losses
(i.e., CE, LS, FL, MSE) lead to largely identical performance on test data as well,
provided that the network is sufficiently large and trained until convergence. The
source code is available at https://github.com/jinxinzhou/nc_loss.
1 Introduction
Loss function is an indispensable component in the training of deep neural networks (DNNs). While
cross-entropy (CE) loss is one of the most popular choices for classification tasks, studies over the
past few years have suggested many improved versions of CE that bring better empirical perfor-
mance. Some notable examples include label smoothing (LS) [1] where one-hot label is replaced by
a smoothed label, focal loss (FL) [2] which puts more emphasis on hard misclassified samples and
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02192v2 [cs.LG] 8 Oct 2022
reduces the relative loss on the already well-classified samples, and so on. Aside from CE and its
variants, the mean squared error (MSE) loss which was typically used for regression tasks is recently
demonstrated to have a competitive performance when compared to CE for classification tasks [3].
Despite the existence of many loss functions there is however a lack of consensus as to which one is
the best to use, and the answer seems to depend on multiple factors such as properties of the dataset,
choice of network architecture, and so on [4]. In this work, we aim to understand the effect of
loss function in classification tasks from the perspective of characterizing the last-layer features and
classifier of a DNN trained under different losses. Our study is motivated by a sequence of recent
work that identify an intriguing Neural Collapse (NC) phenomenon in trained networks, which
refers to the following properties of the last-layer features and classifier:
(i) Variability Collapse:all features of the same class collapse to the corresponding class mean.
(ii) Convergence to Simplex ETF:the means associated with different classes are in a Simplex
Equiangular Tight Frame (ETF) configuration where their pairwise distances are all equal and
maximized.
(iii) Convergence to Self-duality:the class means are ideally aligned with the last-layer linear
classifiers.
(iv) Simple Decision Rule:the last-layer classifier is equivalent to a Nearest Class-Center decision
rule.
This NC phenomena is first discovered by Papyan et al. [5,6] under canonical classification prob-
lems trained with the CE loss. Following with the CE loss, Han et al. [7] recently reported that DNNs
trained with MSE loss for classification problems also exhibit similar NC phenomena. These results
imply that deep networks are essentially learning maximally separable features between classes, and
a max-margin classifier in the last layer upon these learned features. The intriguing empirical obser-
vation motivated a surge of theoretical investigation [722], mostly under a simplified unconstrained
feature model [10] or layer-peeled model [12] that treats the last-layer features of each samples be-
fore the final classifier as free optimization variables. Under the simplified unconstrained feature
model, it has been proved that the NC solution is the only global optimal solution for the CE and
MSE losses which are also proved to have benign global landscape, explaining why the global NC
solution can be obtained.
Contributions. While previous work provide thorough analysis for NC under CE and MSE losses,
the theoretical analysis beyond CE and MSE losses is still limited, and their work only focus on one
specific loss without a general format. In this paper, we consider a broad family of loss functions
that includes CE and some other popular loss functions such as LS and FL as special cases. Under
the unconstrained feature model, we theoretically demonstrate in Section 3that the NC solution
is the only global optimal solution to the family of loss functions. Moreover, we provide a global
landscape analysis, showing that the LS loss function is a strict saddle function and FL loss function
is a local strict saddle function [2325]. A (local) strict saddle function is a function for which every
critical point is either a global solution or a strict saddle point with negative curvature (locally).
Hence, our result suggests that any optimizer can escape strict saddle points and converge to the
global solution responding to NC for LS and FL. As far as we know, this paper is the first work that
conducts global optimal solution and benign optimization landscape analysis beyond the scope of
CE and MSE losses.
Our theoretical results explained above have important implications for understanding the role of
loss function in training DNNs for classification tasks. Because all losses lead to NC solutions,
their corresponding features are equivalent up to a rotation of the feature space. In other words, our
analysis provides a theoretical justification for the following claim:
All losses (i.e., CE, LS, FL, MSE) lead to largely identical features on training data by large DNNs
and sufficiently many iterations.
We also provide an experimental verification of this claim through experiments in Section 4.1.
While NC reveals that all losses are equivalent at training time, it does not have a direct implication
for the features associated with test data as well as the generalization performance [26]. In partic-
ular, a recent work [27] shows empirically that NC does not occur for the features associated with
test data. Nonetheless, we show through empirical evidence that for large DNNs, NC on training
2
data well predicts the test performance. In particular, our empirical study in Section 4.2 shows the
following:
All losses (CE, LS, FL, MSE) lead to largely identical performance on test data by large DNNs.
Our conclusion that all losses are created equal appears to go against existing evidence on the ad-
vantages of some losses over the others. Here we emphasize that our conclusion has an important
premise, namely the neural network has sufficient approximation power and the training is per-
formed for sufficiently many iterations. Hence, our conclusion implies that the better performance
with particular choices of loss functions comes as a result that the training does not produce a glob-
ally optimal (i.e., NC) solution. In such cases different losses lead to different solutions on the
training data, and correspondingly different performance on test data. Such an understanding may
provide important practical guidance on what loss to choose in different cases (e.g., different model
sizes and different training time budgets), as well as for the design of new and better losses in the
future. We note that our conclusion is based on natural accuracy, rather than model transferability
or robustness, which is worth additional efforts to exploit and is left as future work.
2 The Problem Setup
A typical deep neural network Ψ(·) : RD7→ RKconsists of a multi-layer nonlinear compositional
feature mapping Φ(·) : RD7→ Rdand a linear classifier (W,b), which can be generally expressed
as
ΨΘ(x) = WΦθ(x) + b,(1)
where we use θto represent the network parameters in the feature mapping and WRK×dand
bRKto represent the linear classifier’s weight and bias, respectively. Therefore, all the network
parameters are the set of Θ={θ,W,b}. For the input x, the output of the feature mapping Φθ(x)
is usually termed as the representation or feature learned from the network.
With an appropriate loss function, the parameters Θof the whole network are optimized to learn the
underlying relation from the input sample xto their corresponding target yso that the output of the
network ΨΘ(x)approximates the corresponding target, i.e. ΨΘ(x)yin term of the expectation
over a distribution Dof input-output data pairs (x,y). While it is hard to get access to the ground-
truth distribution Din most cases, one can approximate the distribution Dthrough sampling enough
data pairs i.i.d. from D. In this paper, we study the multi-class balanced classification tasks with K
class and nsamples per class, where we use the one-hot vector ykRKwith unity only in k-th
entry (1 kK)to denote the label of the i-th sample xk,i RDin the k-th class. We then
learn the parameters Θvia minimizing the following empirical risk over the total N=nK training
samples
min
Θ
1
N
K
X
k=1
n
X
i=1 LΘ(xk,i),yk) + λ
2kΘk2
F,(2)
where λ > 0is the regularization parameter (a.k.a., the weight decay parameter2) and
LΘ(xk,i),yk)is a predefined loss function that appropriately measures the difference between
the output ΨΘ(xk,i)and the target yk. Some common loss functions used for training deep neural
networks will be specified in the next section.
2.1 Commonly Used Training Losses
In this subsection, we first present four common loss functions for classification task. To simplify
the notation, let z=WΦθ(x) + bdenote the network’s output (“logit”) vector for the input x.
Assume zbelongs to the k-th class. Also let ysmooth
k= (1 α)yk+α
K1Kdenote the smoothed
targets of k-th class, where 0α < 1and 1KRKis a vector with all entries equal to one.
We will use z`,yk,` and ysmooth
`to denote the `-th entry of z,ykand ysmooth
k, respectively, where
ysmooth
k,k = 1 K1
Kαand ysmooth
k,` =α
Kfor k6=`.
2Without weight decay, the features and classifiers will tend to blow up for CE and many other losses.
3
Cross entropy (CE) is perhaps the most common loss for multi-class classification in deep learn-
ing. It measures the distance between the target distribution ykand the network output distribution
obtained by applying the softmax function on z, resulting in the following expression
LCE(z,yk) = log exp(zk)
PK
j=1 exp(zj)!.(3)
Focal loss (FL) [2]is first proposed to deal with the extreme foreground-background class imbalance
in dense object detection, which adaptively focuses less on the well-classified samples. Recent
work [28,29] reports that focal loss also improves calibration and automatically forms curriculum
learning in multi-class classification setting. Letting γ0denote the tunable focusing parameter,
the focal loss can be expressed as:
LFL(z,yk) = 1exp(zk)
PK
j=1 exp(zj)!γ
log exp(zk)
PK
j=1 exp(zj)!.(4)
Label smoothing (LS) [1]replaces the hard targets in CE with smoothed targets ysmooth
kobtained
from mixing the original targets ykwith a uniform distribution over all entries 1K. Experiments
in [30,31] find that classification models trained with label smoothing have better calibration and
generalization. Denoting by 0α1the tunable smoothing parameter, the label smoothing loss
function can be formulated as:
LLS(z,yk) =
K
X
`=1
ysmooth
k,` log exp(z`)
PK
j=1 exp(zj)!.(5)
When α= 0, the above label smoothing loss reduces to the CE loss.
Mean square error (MSE) is often used for regression but not classification task. The recent work
[3] shows that classification networks trained with MSE loss achieve on par performance compared
to those trained with the CE loss. Throughout our paper, we use the rescaled MSE version [3]:
LMSE(z,yk) = κ(zkβ)2+
K
X
`6=k
z2
`,(6)
where κ > 0and β > 0are hyperparameters.
2.2 Problem Formulation Based on Unconstrained Feature Models
Because of the interaction between a large number of nonlinear layers in the feature mapping Φθ,
it is tremendously challenging to analyze the optimization of deep neural networks. To simplify the
difficulty of deep neural network analysis, a series of recent works of theoretically studying NC
phenomenon use a so-called unconstrained feature model (or layer-peeled model in [12]) which
treats the last-layer features as free optimization variables h= Φ(x)Rd. The reason behind
the unconstrained feature model is that modern highly overparameterized deep networks are able
to approximate any continuous functions [3235] and the characterization of NC are only related
with the last layer features. We adopt the same approach and study the effects of different training
losses on the last-layer representations of the network under the unconstrained feature model. For
convenient, let us denote
W:= w1w2··· wK>RK×d,
H:= [H1H2··· Hn]Rd×N,and
Y:= [Y1Y2··· YK]RK×N,
where wkis the k-th row vector of W, all the features in the k-th class are denoted as Hi:=
[h1,i ··· hK,i]Rd×Kand hk,i is the feature of the i-th sample in the k-th class, and Yk:=
[yk··· yk]RK×nfor all k= 1,2,··· , K and i= 1,2,··· , n. Based on the unconstrained
feature model, we consider a slight variant of (2), given by
min
W,H,bf(W,H,b) := 1
N
K
X
k=1
n
X
i=1 L(W hk,i +b,yk) + λW
2kWk2
F+λH
2kHk2
F+λb
2kbk2
2,
(7)
4
where λW,λH, λb>0are the penalty parameters for W,H, and b, respectively.
By viewing the last-layer feature Has a free optimization variable, the simplified objective function
(7) consider the weight decay about Wand H, which is slightly different from practice that the
weight decay is imposed on all the network parameters Θas shown in (2). Nonetheless, the un-
derlying rationale is that the weight decay on Θimplicitly penalizes the energy of the features (i.e.,
kHkF) [16].
As NC phenomena for the learned features and classifiers is first discovered for neural networks
trained with the CE loss [5], the CE loss has been mostly studied through the above simplified
unconstrained feature model [8,9,11,12,16] to understand the NC phenomena. The work [7,10,14,
22] also studied the MSE loss, but the analysis there shows the solutions of the learned features and
classifiers depend crucially on the bias term, while for CE loss with or without the bias term have no
effect on the learned features and classifiers under the unconstrained feature model. The other losses
such as focal loss and label smoothing have been less studied, though they are widely employed in
practice to obtain better performance. This will be the subject of next section.
3 Understanding Loss Functions Through Unconstrained Features Model
In this section, we study the effect of different loss functions through the unconstrained features
model. We will first present a contrastive property for general loss function LGL in Definition 1.
We will then study the global optimality conditions in terms of the learned features and classifiers
as well as geometric properties for (7) with such a general loss function LGL.
3.1 A Contrastive Property for the Loss Functions
In this paper, we aim to provide a unified analysis for different loss functions. Towards that goal, we
first present some common properties behind the CE, FL and FL to motivate the discussion. Taking
CE as an example, we can lower bound it by
LCE(z,yk)log 1+(K1) exp Pj6=k(zjzk)
K1=φCE
X
j6=k
(zjzk)
(8)
where φCE(t) = log 1+(K1) exp t
K1, and the inequality achieves equality when zj=
zj0for all j, j06=k. This requirement is reasonable because the commonly used losses treat all the
outputs except for the k-th output zidentically. Since φCE is an increasing function, minimizing
the CE loss LCE(z,yk)is equivalent to maximizing (K1)zkPj6=Kzj, which contrasts the
k-th output zksimultaneously to all the other outputs zjfor all j6=k. Thus, we call (8) as a
contrastive property. Maximizing (K1)zkPj6=Kzjwould lead to a positive (and relatively
large) zkand negative (and relatively small) zj. In particular, within the unit sphere kzk2= 1,
(K1)zkPj6=Kzjachieves its maximizer when zk=qK1
Kand zj=q1
K(K1) for
all j6=k, which satisfies the requirement zj=zj0for all j, j06=k. Thus, zk=qK1
Kand
zj=q1
K(K1) is also the global minimizer for φCE within the unit sphere kzk2= 1. As the
global minimizer is unique for each class, it encourages intra-class compactness. On the other hand,
the minimizers to different classes are maximally distant, promoting inter-class separability.
Motivated by the above discussion, we now introduce the following properties for a general loss
function LGL(z,yk).
Definition 1 (Contrastive property).We say a loss function LGL(z,yk)satisfies the contrastive
property if there exists a function φsuch that LGL(z,yk)can be lower bounded by
LGL(z,yk)φ
X
j6=k
(zjzk)
(9)
where the equality holds only when zj=z0
jfor all j, j06=k. Moreover, φ(t)satisfies
t= arg min
tφ(t) + c|t|is unique for any c > 0,and t0.(10)
5
摘要:

AreAllLossesCreatedEqual:ANeuralCollapsePerspectiveJinxinZhouOhioStateUniversityzhou.3820@osu.eduChongYouGoogleResearchcyou@google.comXiaoLiUniversityofMichiganxlxiao@umich.eduKangningLiuNewYorkUniversitykl3141@nyu.eduShengLiuNewYorkUniversityshengliu@nyu.eduQingQuUniversityofMichiganqingqu@umich.ed...

展开>> 收起<<
Are All Losses Created Equal A Neural Collapse Perspective Jinxin Zhou.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:3.31MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注