Are All Losses Created Equal A Neural Collapse Perspective Jinxin Zhou

2025-04-27 0 0 3.31MB 32 页 10玖币

侵权投诉

Are All Losses Created Equal:

A Neural Collapse Perspective

Jinxin Zhou

Ohio State University

zhou.3820@osu.edu

Chong You

Google Research

cyou@google.com

Xiao Li

University of Michigan

xlxiao@umich.edu

Kangning Liu

New York University

kl3141@nyu.edu

Sheng Liu

New York University

shengliu@nyu.edu

Qing Qu

University of Michigan

qingqu@umich.edu

Zhihui Zhu∗

Ohio State University

zhu.3440@osu.edu

Abstract

While cross entropy (CE) is the most commonly used loss function to train deep

neural networks for classiﬁcation tasks, many alternative losses have been devel-

oped to obtain better empirical performance. Among them, which one is the best

to use is still a mystery, because there seem to be multiple factors affecting the

answer, such as properties of the dataset, the choice of network architecture, and

so on. This paper studies the choice of loss function by examining the last-layer

features of deep networks, drawing inspiration from a recent line work showing

that the global optimal solution of CE and mean-square-error (MSE) losses ex-

hibits a Neural Collapse (NC) phenomenon. That is, for sufﬁciently large net-

works trained until convergence, (i) all features of the same class collapse to the

corresponding class mean and (ii) the means associated with different classes are

in a conﬁguration where their pairwise distances are all equal and maximized.

We extend such results and show through global solution and landscape analyses

that a broad family of loss functions including commonly used label smoothing

(LS) and focal loss (FL) exhibits NC. Hence, all relevant losses (i.e., CE, LS,

FL, MSE) produce equivalent features on training data. In particular, based on

the unconstrained feature model assumption, we provide either the global land-

scape analysis for LS loss or the local landscape analysis for FL loss and show

that the (only!) global minimizers are NC solutions, while all other critical points

are strict saddles whose Hessian exhibit negative curvature directions either in the

global scope for LS loss or in the local scope for FL loss near the optimal solution.

The experiments further show that NC features obtained from all relevant losses

(i.e., CE, LS, FL, MSE) lead to largely identical performance on test data as well,

provided that the network is sufﬁciently large and trained until convergence. The

source code is available at https://github.com/jinxinzhou/nc_loss.

1 Introduction

Loss function is an indispensable component in the training of deep neural networks (DNNs). While

cross-entropy (CE) loss is one of the most popular choices for classiﬁcation tasks, studies over the

past few years have suggested many improved versions of CE that bring better empirical perfor-

mance. Some notable examples include label smoothing (LS) [1] where one-hot label is replaced by

a smoothed label, focal loss (FL) [2] which puts more emphasis on hard misclassiﬁed samples and

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02192v2 [cs.LG] 8 Oct 2022

reduces the relative loss on the already well-classiﬁed samples, and so on. Aside from CE and its

variants, the mean squared error (MSE) loss which was typically used for regression tasks is recently

demonstrated to have a competitive performance when compared to CE for classiﬁcation tasks [3].

Despite the existence of many loss functions there is however a lack of consensus as to which one is

the best to use, and the answer seems to depend on multiple factors such as properties of the dataset,

choice of network architecture, and so on [4]. In this work, we aim to understand the effect of

loss function in classiﬁcation tasks from the perspective of characterizing the last-layer features and

classiﬁer of a DNN trained under different losses. Our study is motivated by a sequence of recent

work that identify an intriguing Neural Collapse (NC) phenomenon in trained networks, which

refers to the following properties of the last-layer features and classiﬁer:

(i) Variability Collapse:all features of the same class collapse to the corresponding class mean.

(ii) Convergence to Simplex ETF:the means associated with different classes are in a Simplex

Equiangular Tight Frame (ETF) conﬁguration where their pairwise distances are all equal and

maximized.

(iii) Convergence to Self-duality:the class means are ideally aligned with the last-layer linear

classiﬁers.

(iv) Simple Decision Rule:the last-layer classiﬁer is equivalent to a Nearest Class-Center decision

rule.

This NC phenomena is ﬁrst discovered by Papyan et al. [5,6] under canonical classiﬁcation prob-

lems trained with the CE loss. Following with the CE loss, Han et al. [7] recently reported that DNNs

trained with MSE loss for classiﬁcation problems also exhibit similar NC phenomena. These results

imply that deep networks are essentially learning maximally separable features between classes, and

a max-margin classiﬁer in the last layer upon these learned features. The intriguing empirical obser-

vation motivated a surge of theoretical investigation [7–22], mostly under a simpliﬁed unconstrained

feature model [10] or layer-peeled model [12] that treats the last-layer features of each samples be-

fore the ﬁnal classiﬁer as free optimization variables. Under the simpliﬁed unconstrained feature

model, it has been proved that the NC solution is the only global optimal solution for the CE and

MSE losses which are also proved to have benign global landscape, explaining why the global NC

solution can be obtained.

Contributions. While previous work provide thorough analysis for NC under CE and MSE losses,

the theoretical analysis beyond CE and MSE losses is still limited, and their work only focus on one

speciﬁc loss without a general format. In this paper, we consider a broad family of loss functions

that includes CE and some other popular loss functions such as LS and FL as special cases. Under

the unconstrained feature model, we theoretically demonstrate in Section 3that the NC solution

is the only global optimal solution to the family of loss functions. Moreover, we provide a global

landscape analysis, showing that the LS loss function is a strict saddle function and FL loss function

is a local strict saddle function [23–25]. A (local) strict saddle function is a function for which every

critical point is either a global solution or a strict saddle point with negative curvature (locally).

Hence, our result suggests that any optimizer can escape strict saddle points and converge to the

global solution responding to NC for LS and FL. As far as we know, this paper is the ﬁrst work that

conducts global optimal solution and benign optimization landscape analysis beyond the scope of

CE and MSE losses.

Our theoretical results explained above have important implications for understanding the role of

loss function in training DNNs for classiﬁcation tasks. Because all losses lead to NC solutions,

their corresponding features are equivalent up to a rotation of the feature space. In other words, our

analysis provides a theoretical justiﬁcation for the following claim:

All losses (i.e., CE, LS, FL, MSE) lead to largely identical features on training data by large DNNs

and sufﬁciently many iterations.

We also provide an experimental veriﬁcation of this claim through experiments in Section 4.1.

While NC reveals that all losses are equivalent at training time, it does not have a direct implication

for the features associated with test data as well as the generalization performance [26]. In partic-

ular, a recent work [27] shows empirically that NC does not occur for the features associated with

test data. Nonetheless, we show through empirical evidence that for large DNNs, NC on training

data well predicts the test performance. In particular, our empirical study in Section 4.2 shows the

following:

All losses (CE, LS, FL, MSE) lead to largely identical performance on test data by large DNNs.

Our conclusion that all losses are created equal appears to go against existing evidence on the ad-

vantages of some losses over the others. Here we emphasize that our conclusion has an important

premise, namely the neural network has sufﬁcient approximation power and the training is per-

formed for sufﬁciently many iterations. Hence, our conclusion implies that the better performance

with particular choices of loss functions comes as a result that the training does not produce a glob-

ally optimal (i.e., NC) solution. In such cases different losses lead to different solutions on the

training data, and correspondingly different performance on test data. Such an understanding may

provide important practical guidance on what loss to choose in different cases (e.g., different model

sizes and different training time budgets), as well as for the design of new and better losses in the

future. We note that our conclusion is based on natural accuracy, rather than model transferability

or robustness, which is worth additional efforts to exploit and is left as future work.

2 The Problem Setup

A typical deep neural network Ψ(·) : RD7→ RKconsists of a multi-layer nonlinear compositional

feature mapping Φ(·) : RD7→ Rdand a linear classiﬁer (W,b), which can be generally expressed

ΨΘ(x) = WΦθ(x) + b,(1)

where we use θto represent the network parameters in the feature mapping and W∈RK×dand

b∈RKto represent the linear classiﬁer’s weight and bias, respectively. Therefore, all the network

parameters are the set of Θ={θ,W,b}. For the input x, the output of the feature mapping Φθ(x)

is usually termed as the representation or feature learned from the network.

With an appropriate loss function, the parameters Θof the whole network are optimized to learn the

underlying relation from the input sample xto their corresponding target yso that the output of the

network ΨΘ(x)approximates the corresponding target, i.e. ΨΘ(x)≈yin term of the expectation

over a distribution Dof input-output data pairs (x,y). While it is hard to get access to the ground-

truth distribution Din most cases, one can approximate the distribution Dthrough sampling enough

data pairs i.i.d. from D. In this paper, we study the multi-class balanced classiﬁcation tasks with K

class and nsamples per class, where we use the one-hot vector yk∈RKwith unity only in k-th

entry (1 ≤k≤K)to denote the label of the i-th sample xk,i ∈RDin the k-th class. We then

learn the parameters Θvia minimizing the following empirical risk over the total N=nK training

samples

min

k=1

i=1 L(ΨΘ(xk,i),yk) + λ

2kΘk2

F,(2)

where λ > 0is the regularization parameter (a.k.a., the weight decay parameter2) and

L(ΨΘ(xk,i),yk)is a predeﬁned loss function that appropriately measures the difference between

the output ΨΘ(xk,i)and the target yk. Some common loss functions used for training deep neural

networks will be speciﬁed in the next section.

2.1 Commonly Used Training Losses

In this subsection, we ﬁrst present four common loss functions for classiﬁcation task. To simplify

the notation, let z=WΦθ(x) + bdenote the network’s output (“logit”) vector for the input x.

Assume zbelongs to the k-th class. Also let ysmooth

k= (1 −α)yk+α

K1Kdenote the smoothed

targets of k-th class, where 0≤α < 1and 1K∈RKis a vector with all entries equal to one.

We will use z`,yk,` and ysmooth

`to denote the `-th entry of z,ykand ysmooth

k, respectively, where

ysmooth

k,k = 1 −K−1

Kαand ysmooth

k,` =α

Kfor k6=`.

2Without weight decay, the features and classiﬁers will tend to blow up for CE and many other losses.

Cross entropy (CE) is perhaps the most common loss for multi-class classiﬁcation in deep learn-

ing. It measures the distance between the target distribution ykand the network output distribution

obtained by applying the softmax function on z, resulting in the following expression

LCE(z,yk) = −log exp(zk)

j=1 exp(zj)!.(3)

Focal loss (FL) [2]is ﬁrst proposed to deal with the extreme foreground-background class imbalance

in dense object detection, which adaptively focuses less on the well-classiﬁed samples. Recent

work [28,29] reports that focal loss also improves calibration and automatically forms curriculum

learning in multi-class classiﬁcation setting. Letting γ≥0denote the tunable focusing parameter,

the focal loss can be expressed as:

LFL(z,yk) = − 1−exp(zk)

j=1 exp(zj)!γ

log exp(zk)

j=1 exp(zj)!.(4)

Label smoothing (LS) [1]replaces the hard targets in CE with smoothed targets ysmooth

kobtained

from mixing the original targets ykwith a uniform distribution over all entries 1K. Experiments

in [30,31] ﬁnd that classiﬁcation models trained with label smoothing have better calibration and

generalization. Denoting by 0≤α≤1the tunable smoothing parameter, the label smoothing loss

function can be formulated as:

LLS(z,yk) = −

`=1

ysmooth

k,` log exp(z`)

j=1 exp(zj)!.(5)

When α= 0, the above label smoothing loss reduces to the CE loss.

Mean square error (MSE) is often used for regression but not classiﬁcation task. The recent work

[3] shows that classiﬁcation networks trained with MSE loss achieve on par performance compared

to those trained with the CE loss. Throughout our paper, we use the rescaled MSE version [3]:

LMSE(z,yk) = κ(zk−β)2+

`6=k

`,(6)

where κ > 0and β > 0are hyperparameters.

2.2 Problem Formulation Based on Unconstrained Feature Models

Because of the interaction between a large number of nonlinear layers in the feature mapping Φθ,

it is tremendously challenging to analyze the optimization of deep neural networks. To simplify the

difﬁculty of deep neural network analysis, a series of recent works of theoretically studying NC

phenomenon use a so-called unconstrained feature model (or layer-peeled model in [12]) which

treats the last-layer features as free optimization variables h= Φ(x)∈Rd. The reason behind

the unconstrained feature model is that modern highly overparameterized deep networks are able

to approximate any continuous functions [32–35] and the characterization of NC are only related

with the last layer features. We adopt the same approach and study the effects of different training

losses on the last-layer representations of the network under the unconstrained feature model. For

convenient, let us denote

W:= w1w2··· wK>∈RK×d,

H:= [H1H2··· Hn]∈Rd×N,and

Y:= [Y1Y2··· YK]∈RK×N,

where wkis the k-th row vector of W, all the features in the k-th class are denoted as Hi:=

[h1,i ··· hK,i]∈Rd×Kand hk,i is the feature of the i-th sample in the k-th class, and Yk:=

[yk··· yk]∈RK×nfor all k= 1,2,··· , K and i= 1,2,··· , n. Based on the unconstrained

feature model, we consider a slight variant of (2), given by

min

W,H,bf(W,H,b) := 1

k=1

i=1 L(W hk,i +b,yk) + λW

2kWk2

F+λH

2kHk2

F+λb

2kbk2

(7)

where λW,λH, λb>0are the penalty parameters for W,H, and b, respectively.

By viewing the last-layer feature Has a free optimization variable, the simpliﬁed objective function

(7) consider the weight decay about Wand H, which is slightly different from practice that the

weight decay is imposed on all the network parameters Θas shown in (2). Nonetheless, the un-

derlying rationale is that the weight decay on Θimplicitly penalizes the energy of the features (i.e.,

kHkF) [16].

As NC phenomena for the learned features and classiﬁers is ﬁrst discovered for neural networks

trained with the CE loss [5], the CE loss has been mostly studied through the above simpliﬁed

unconstrained feature model [8,9,11,12,16] to understand the NC phenomena. The work [7,10,14,

22] also studied the MSE loss, but the analysis there shows the solutions of the learned features and

classiﬁers depend crucially on the bias term, while for CE loss with or without the bias term have no

effect on the learned features and classiﬁers under the unconstrained feature model. The other losses

such as focal loss and label smoothing have been less studied, though they are widely employed in

practice to obtain better performance. This will be the subject of next section.

3 Understanding Loss Functions Through Unconstrained Features Model

In this section, we study the effect of different loss functions through the unconstrained features

model. We will ﬁrst present a contrastive property for general loss function LGL in Deﬁnition 1.

We will then study the global optimality conditions in terms of the learned features and classiﬁers

as well as geometric properties for (7) with such a general loss function LGL.

3.1 A Contrastive Property for the Loss Functions

In this paper, we aim to provide a uniﬁed analysis for different loss functions. Towards that goal, we

ﬁrst present some common properties behind the CE, FL and FL to motivate the discussion. Taking

CE as an example, we can lower bound it by

LCE(z,yk)≥log 1+(K−1) exp Pj6=k(zj−zk)

K−1=φCE 

X

j6=k

(zj−zk)

(8)

where φCE(t) = log 1+(K−1) exp t

K−1, and the inequality achieves equality when zj=

zj0for all j, j06=k. This requirement is reasonable because the commonly used losses treat all the

outputs except for the k-th output zidentically. Since φCE is an increasing function, minimizing

the CE loss LCE(z,yk)is equivalent to maximizing (K−1)zk−Pj6=Kzj, which contrasts the

k-th output zksimultaneously to all the other outputs zjfor all j6=k. Thus, we call (8) as a

contrastive property. Maximizing (K−1)zk−Pj6=Kzjwould lead to a positive (and relatively

large) zkand negative (and relatively small) zj. In particular, within the unit sphere kzk2= 1,

(K−1)zk−Pj6=Kzjachieves its maximizer when zk=qK−1

Kand zj=−q1

K(K−1) for

all j6=k, which satisﬁes the requirement zj=zj0for all j, j06=k. Thus, zk=qK−1

Kand

zj=−q1

K(K−1) is also the global minimizer for φCE within the unit sphere kzk2= 1. As the

global minimizer is unique for each class, it encourages intra-class compactness. On the other hand,

the minimizers to different classes are maximally distant, promoting inter-class separability.

Motivated by the above discussion, we now introduce the following properties for a general loss

function LGL(z,yk).

Deﬁnition 1 (Contrastive property).We say a loss function LGL(z,yk)satisﬁes the contrastive

property if there exists a function φsuch that LGL(z,yk)can be lower bounded by

LGL(z,yk)≥φ

X

j6=k

(zj−zk)

(9)

where the equality holds only when zj=z0

jfor all j, j06=k. Moreover, φ(t)satisﬁes

t∗= arg min

tφ(t) + c|t|is unique for any c > 0,and t∗≤0.(10)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AreAllLossesCreatedEqual:ANeuralCollapsePerspectiveJinxinZhouOhioStateUniversityzhou.3820@osu.eduChongYouGoogleResearchcyou@google.comXiaoLiUniversityofMichiganxlxiao@umich.eduKangningLiuNewYorkUniversitykl3141@nyu.eduShengLiuNewYorkUniversityshengliu@nyu.eduQingQuUniversityofMichiganqingqu@umich.ed...

展开>> 收起<<

Are All Losses Created Equal A Neural Collapse Perspective Jinxin Zhou.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are All Losses Created Equal A Neural Collapse Perspective Jinxin Zhou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: