reduces the relative loss on the already well-classified samples, and so on. Aside from CE and its
variants, the mean squared error (MSE) loss which was typically used for regression tasks is recently
demonstrated to have a competitive performance when compared to CE for classification tasks [3].
Despite the existence of many loss functions there is however a lack of consensus as to which one is
the best to use, and the answer seems to depend on multiple factors such as properties of the dataset,
choice of network architecture, and so on [4]. In this work, we aim to understand the effect of
loss function in classification tasks from the perspective of characterizing the last-layer features and
classifier of a DNN trained under different losses. Our study is motivated by a sequence of recent
work that identify an intriguing Neural Collapse (NC) phenomenon in trained networks, which
refers to the following properties of the last-layer features and classifier:
(i) Variability Collapse:all features of the same class collapse to the corresponding class mean.
(ii) Convergence to Simplex ETF:the means associated with different classes are in a Simplex
Equiangular Tight Frame (ETF) configuration where their pairwise distances are all equal and
maximized.
(iii) Convergence to Self-duality:the class means are ideally aligned with the last-layer linear
classifiers.
(iv) Simple Decision Rule:the last-layer classifier is equivalent to a Nearest Class-Center decision
rule.
This NC phenomena is first discovered by Papyan et al. [5,6] under canonical classification prob-
lems trained with the CE loss. Following with the CE loss, Han et al. [7] recently reported that DNNs
trained with MSE loss for classification problems also exhibit similar NC phenomena. These results
imply that deep networks are essentially learning maximally separable features between classes, and
a max-margin classifier in the last layer upon these learned features. The intriguing empirical obser-
vation motivated a surge of theoretical investigation [7–22], mostly under a simplified unconstrained
feature model [10] or layer-peeled model [12] that treats the last-layer features of each samples be-
fore the final classifier as free optimization variables. Under the simplified unconstrained feature
model, it has been proved that the NC solution is the only global optimal solution for the CE and
MSE losses which are also proved to have benign global landscape, explaining why the global NC
solution can be obtained.
Contributions. While previous work provide thorough analysis for NC under CE and MSE losses,
the theoretical analysis beyond CE and MSE losses is still limited, and their work only focus on one
specific loss without a general format. In this paper, we consider a broad family of loss functions
that includes CE and some other popular loss functions such as LS and FL as special cases. Under
the unconstrained feature model, we theoretically demonstrate in Section 3that the NC solution
is the only global optimal solution to the family of loss functions. Moreover, we provide a global
landscape analysis, showing that the LS loss function is a strict saddle function and FL loss function
is a local strict saddle function [23–25]. A (local) strict saddle function is a function for which every
critical point is either a global solution or a strict saddle point with negative curvature (locally).
Hence, our result suggests that any optimizer can escape strict saddle points and converge to the
global solution responding to NC for LS and FL. As far as we know, this paper is the first work that
conducts global optimal solution and benign optimization landscape analysis beyond the scope of
CE and MSE losses.
Our theoretical results explained above have important implications for understanding the role of
loss function in training DNNs for classification tasks. Because all losses lead to NC solutions,
their corresponding features are equivalent up to a rotation of the feature space. In other words, our
analysis provides a theoretical justification for the following claim:
All losses (i.e., CE, LS, FL, MSE) lead to largely identical features on training data by large DNNs
and sufficiently many iterations.
We also provide an experimental verification of this claim through experiments in Section 4.1.
While NC reveals that all losses are equivalent at training time, it does not have a direct implication
for the features associated with test data as well as the generalization performance [26]. In partic-
ular, a recent work [27] shows empirically that NC does not occur for the features associated with
test data. Nonetheless, we show through empirical evidence that for large DNNs, NC on training
2