
A Law of Data Separation in Deep Learning
Hangfeng Hea,b and Weijie J. Suc,1
aDepartment of Computer Science, University of Rochester, Rochester, NY 14627; bGoergen Institute for Data Science, University of Rochester, Rochester, NY 14627;
cDepartment of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104
This manuscript was compiled on August 14, 2023
While deep learning has enabled significant advances in many areas
of science, its black-box nature hinders architecture design for future
artificial intelligence applications and interpretation for high-stakes
decision makings. We addressed this issue by studying the funda-
mental question of how deep neural networks process data in the
intermediate layers. Our finding is a simple and quantitative law that
governs how deep neural networks separate data according to class
membership throughout all layers for classification. This law shows
that each layer improves data separation at a constant geometric rate,
and its emergence is observed in a collection of network architec-
tures and datasets during training. This law offers practical guidelines
for designing architectures, improving model robustness and out-of-
sample performance, as well as interpreting the predictions.
deep learning
|
data separation
|
constant geometric rate
|
intermediate
layers
D
eep learning methodologies have achieved remarkable suc-
cess across a wide range of data-intensive tasks in image
recognition, biological research, and scientific computing (
2
–
5
).
In contrast to other machine learning techniques (
6
), however,
the practice of deep learning relies heavily on a plethora of
heuristics and tricks that are not well justified. This situation
often makes deep learning-based approaches ungrounded for
some applications or necessitates the need for huge computa-
tional resources for exhaustive search, making it difficult to
fully realize the potential of this set of methodologies (7).
This unfortunate situation is in part owing to a lack of un-
derstanding of how the prediction depends on the intermediate
layers of deep neural networks (
8
–
10
). In particular, little is
known about how the data of different classes (e.g., images
of cats and dogs) in classification problems are gradually sep-
arated from the bottom layers to the top layers in modern
architectures such as AlexNet (
2
) and residual neural net-
works (
11
). Any knowledge about data separation, especially
quantitative characterization, would offer useful principles and
insights for designing network architectures, training processes,
and model interpretation.
The main finding of this paper is a quantitative delineation
of the data separation process throughout all layers of deep
neural networks. As an illustration, Fig. 1plots a certain value
that measures how well the data are separated according to
their class membership at each layer for feedforward neural
networks trained on the Fashion-MNIST dataset (
12
). This
value in the logarithmic scale decays, in a distinct manner,
linearly in the number of layers the data have passed through.
The Pearson correlation coefficients between the logarithm of
this value and the layer index range from
−
0
.
997 and
−
1in
Fig. 1.
The measure shown in Fig. 1is canonical for measuring
data separation in classification problems. Let
xki
denote an
intermediate output of a neural network on the
i
th point of
Class
k
for 1
≤i≤nk
,
¯xk
denote the sample mean of Class
k
, and
¯x
denote the mean of all
n
:=
n1
+
···
+
nK
data
points. We define the between-class sum of squares and the
within-class sum of squares as
SSb:= 1
n
K
X
k=1
nk(¯xk−¯x)(¯xk−¯x)⊤
SSw:= 1
n
K
X
k=1
nk
X
i=1
(xki −¯xk)(xki −¯xk)⊤,
respectively. The former matrix represents the between-class
“signal” for classification, whereas the latter denotes the within-
class variability. Writing
SS+
b
for the Moore–Penrose inverse
of
SSb
(The matrix
SSb
has rank at most
K−
1and is not
invertible in general since the dimension of the data is typically
larger than the number of classes.), the ratio matrix
SSwSS+
b
can be thought of as the inverse signal-to-noise ratio. We use
its trace
D:= Tr(SSwSS+
b)[1]
to measure how well the data are separated (
13
,
14
). This
value, which is referred to as the separation fuzziness, is large
when the data points are not concentrated to their class means
or, equivalently, are not well separated, and vice versa.
1. Main Results
Given an
L
-layer feedforward neural network, let
Dl
denote
the separation fuzziness (Eq. 1) of the training data passing
through the first
l
layers for 0
≤l≤L−
1.
∗
Fig. 1suggests
that the dynamics of Dlfollows the relation
Dl.
=ρlD0[2]
∗
For clarification,
D0
is calculated from the raw data, and
D1
is calculated from the data that have
passed through the first layer but not the second layer.
Significance Statement
The practice of deep learning has long been shrouded in mys-
tery, leading many to believe that the inner workings of these
black-box models are chaotic during training. In this paper, we
challenge this belief by presenting a simple and approximate
law that deep neural networks follow when processing data in
the intermediate layers. This empirical law is observed in a
class of modern network architectures for vision tasks, and its
emergence is shown to bring important benefits for the trained
models. The significance of this law is that it allows for a new
perspective that provides useful insights into the practice of
deep learning.
H.H and W.J.S. designed research, performed research, analyzed data, and wrote the paper.
The authors declare no competing interest.
1To whom correspondence should be addressed. E-mail: suw@wharton.upenn.edu.
PNAS | August 14, 2023 | vol. XXX | no. XX | 1–14
arXiv:2210.17020v2 [cs.LG] 11 Aug 2023