A Law of Data Separation in Deep Learning

2025-04-22 0 0 4.23MB 14 页 10玖币
侵权投诉
A Law of Data Separation in Deep Learning
Hangfeng Hea,b and Weijie J. Suc,1
aDepartment of Computer Science, University of Rochester, Rochester, NY 14627; bGoergen Institute for Data Science, University of Rochester, Rochester, NY 14627;
cDepartment of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104
This manuscript was compiled on August 14, 2023
While deep learning has enabled significant advances in many areas
of science, its black-box nature hinders architecture design for future
artificial intelligence applications and interpretation for high-stakes
decision makings. We addressed this issue by studying the funda-
mental question of how deep neural networks process data in the
intermediate layers. Our finding is a simple and quantitative law that
governs how deep neural networks separate data according to class
membership throughout all layers for classification. This law shows
that each layer improves data separation at a constant geometric rate,
and its emergence is observed in a collection of network architec-
tures and datasets during training. This law offers practical guidelines
for designing architectures, improving model robustness and out-of-
sample performance, as well as interpreting the predictions.
deep learning
|
data separation
|
constant geometric rate
|
intermediate
layers
D
eep learning methodologies have achieved remarkable suc-
cess across a wide range of data-intensive tasks in image
recognition, biological research, and scientific computing (
2
5
).
In contrast to other machine learning techniques (
6
), however,
the practice of deep learning relies heavily on a plethora of
heuristics and tricks that are not well justified. This situation
often makes deep learning-based approaches ungrounded for
some applications or necessitates the need for huge computa-
tional resources for exhaustive search, making it difficult to
fully realize the potential of this set of methodologies (7).
This unfortunate situation is in part owing to a lack of un-
derstanding of how the prediction depends on the intermediate
layers of deep neural networks (
8
10
). In particular, little is
known about how the data of different classes (e.g., images
of cats and dogs) in classification problems are gradually sep-
arated from the bottom layers to the top layers in modern
architectures such as AlexNet (
2
) and residual neural net-
works (
11
). Any knowledge about data separation, especially
quantitative characterization, would offer useful principles and
insights for designing network architectures, training processes,
and model interpretation.
The main finding of this paper is a quantitative delineation
of the data separation process throughout all layers of deep
neural networks. As an illustration, Fig. 1plots a certain value
that measures how well the data are separated according to
their class membership at each layer for feedforward neural
networks trained on the Fashion-MNIST dataset (
12
). This
value in the logarithmic scale decays, in a distinct manner,
linearly in the number of layers the data have passed through.
The Pearson correlation coefficients between the logarithm of
this value and the layer index range from
0
.
997 and
1in
Fig. 1.
The measure shown in Fig. 1is canonical for measuring
data separation in classification problems. Let
xki
denote an
intermediate output of a neural network on the
i
th point of
Class
k
for 1
ink
,
¯xk
denote the sample mean of Class
k
, and
¯x
denote the mean of all
n
:=
n1
+
···
+
nK
data
points. We define the between-class sum of squares and the
within-class sum of squares as
SSb:= 1
n
K
X
k=1
nk(¯xk¯x)(¯xk¯x)
SSw:= 1
n
K
X
k=1
nk
X
i=1
(xki ¯xk)(xki ¯xk),
respectively. The former matrix represents the between-class
“signal” for classification, whereas the latter denotes the within-
class variability. Writing
SS+
b
for the Moore–Penrose inverse
of
SSb
(The matrix
SSb
has rank at most
K
1and is not
invertible in general since the dimension of the data is typically
larger than the number of classes.), the ratio matrix
SSwSS+
b
can be thought of as the inverse signal-to-noise ratio. We use
its trace
D:= Tr(SSwSS+
b)[1]
to measure how well the data are separated (
13
,
14
). This
value, which is referred to as the separation fuzziness, is large
when the data points are not concentrated to their class means
or, equivalently, are not well separated, and vice versa.
1. Main Results
Given an
L
-layer feedforward neural network, let
Dl
denote
the separation fuzziness (Eq. 1) of the training data passing
through the first
l
layers for 0
lL
1.
Fig. 1suggests
that the dynamics of Dlfollows the relation
Dl.
=ρlD0[2]
For clarification,
D0
is calculated from the raw data, and
D1
is calculated from the data that have
passed through the first layer but not the second layer.
Significance Statement
The practice of deep learning has long been shrouded in mys-
tery, leading many to believe that the inner workings of these
black-box models are chaotic during training. In this paper, we
challenge this belief by presenting a simple and approximate
law that deep neural networks follow when processing data in
the intermediate layers. This empirical law is observed in a
class of modern network architectures for vision tasks, and its
emergence is shown to bring important benefits for the trained
models. The significance of this law is that it allows for a new
perspective that provides useful insights into the practice of
deep learning.
H.H and W.J.S. designed research, performed research, analyzed data, and wrote the paper.
The authors declare no competing interest.
1To whom correspondence should be addressed. E-mail: suw@wharton.upenn.edu.
PNAS | August 14, 2023 | vol. XXX | no. XX | 1–14
arXiv:2210.17020v2 [cs.LG] 11 Aug 2023
SGD
Depth=4 Depth=8 Depth=20
Momentum
Depth=4 Depth=8 Depth=20
Adam
Depth=4 Depth=8 Depth=20
0123
4 5 6 7
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (
12
). Throughout the paper, the
x
axis represents the layer index and the
y
axis
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are
1
.
000,
0
.
998,
0
.
997,
0
.
999,
0
.
998,
0
.
998,
1
.
000,
0
.
997, and
0
.
999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in the supplementary materials.
2| www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He and Su
Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three
different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (
1
). Throughout the paper, the
x
axis represents the layer index and the
y
axis
represents the separation fuzziness defined in Eq. 1, unless otherwise specified. The Pearson correlation coefficients, by row first, are
1
.
000,
0
.
998,
0
.
997,
0
.
999,
0
.
998,
0
.
998,
1
.
000,
0
.
997, and
0
.
999. The last two rows show the intermediate data points on the plane of the first two principal components for the 8-layer
network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in SI Appendix.
2| He and Su
Epoch=0 Epoch=10 Epoch=20
Epoch=30 Epoch=50 Epoch=100
Epoch=200 Epoch=300 Epoch=600
Fig. 2. A 20-layer feedforward neural network trained on Fashion-MNIST. The law of equi-separation starts to emerge at epoch 100 and becomes more clear as training
proceeds. As in Fig. 1and all the other figures in the main text, the xaxis represents the layer index and the yaxis represents the separation fuzziness.
for some decay ratio 0
<<
1. Alternatively, this law implies
60 log Dl+1 log Dl.
=log 1
, showing that the neural network
61
makes equal progress in reducing
log D
over each layer on the
62
training data. Hence, we call this the law of equi-separation.
63
This law is the first quantitative and geometric characteriza-
64
tion of the data separation process in the intermediate layers.
65
Indeed, it is unexpected because the intermediate output of
66
the neural network does not exhibit any quantitative patterns,
67 as shown by the last two rows of Fig. 1.68
The decay ratio
depends on the depth of the neural
69
network, dataset, training time, and network architecture, and
70
is also aected, to a lesser extent, by optimization methods and
71
many other hyperparameters. For the 20-layer network trained
72 using Adam (12) (the bottom-right plot of Fig. 1), the decay73
ratio
is 0
.
818. Thus, the half-life is
ln 2
ln 1
=
0.693
ln 1
=3
.
45,
74
suggesting that this 20-layer neural network reduces the value
75 of the separation fuzziness in every three and a half layers.76
This law manifests in the terminal phase of training (
14
),
77
where we continue to train the model to interpolate in-sample
78
data. At initialization, the separation fuzziness may even in-
79
crease from the bottom to top layers. During the early stages
80
of training, the bottom layers tend to learn faster at reduc-
81
ing the separation fuzziness compared to the top layers (see
82
Fig. 16 in the supplementary materials). However, as training
83
progresses, the top layers eventually catch up as the bottom
84
layers have learned the necessary features. Finally, each layer
85
becomes roughly equally capable of reducing the separation
86
fuzziness multiplicatively. This dynamics of data separation
87
during training is illustrated in Fig. 2. Neural networks in
88
the terminal phase of training also exhibit certain symmetric
89
geometries in the last layer such as neural collapse (
14
,
15
) and
90
minority collapse (
9
). However, it is worthwhile mentioning
91
that the law of equi-separation emerges earlier than neural
92
collapse during training (see Fig. 10 in the supplementary
93
materials). 94
The pervasive law of equi-separation consistently prevails
95
across diverse datasets, learning rates, and class imbalances,
96
as illustrated in Fig. 3. Additionally, Fig. 15 in the supple-
97
mentary materials demonstrates its applicability in a finer,
98
class-wise context. This law is further exemplified in contem-
99
porary network architectures for vision tasks, such as AlexNet
100
and VGGNet (
16
), as shown in Fig. 4(see the supplementary
101
materials for additional convolutional neural network experi-
102
ments in Fig. 11 and Fig. 12). Moreover, the law manifests
103
in residual neural networks and densely connected convolu-
104
tional networks (
17
) when separation fuzziness is assessed
105
at each block, as depicted in Fig. 7and Fig. 8, respectively.
106
Intriguingly, this law appears slightly more pronounced in feed-
107
forward neural networks when compared with other network
108
architectures. 109
He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 3
Fig. 2. A 20-layer feedforward neural network trained on Fashion-MNIST. The law of equi-separation starts to emerge at epoch 100 and becomes more clear as training
proceeds. As in Fig. 1and all the other figures in the main text, the xaxis represents the layer index and the yaxis represents the separation fuzziness.
for some decay ratio 0
< ρ <
1. Alternatively, this law implies
log Dl+1 log Dl.
=log 1
ρ
, showing that the neural network
makes equal progress in reducing
log D
over each layer on the
training data. Hence, we call this the law of equi-separation.
This law is the first quantitative and geometric characteriza-
tion of the data separation process in the intermediate layers.
Indeed, it is unexpected because the intermediate output of
the neural network does not exhibit any quantitative patterns,
as shown by the last two rows of Fig. 1.
The decay ratio
ρ
depends on the depth of the neural
network, dataset, training time, and network architecture, and
is also affected, to a lesser extent, by optimization methods and
many other hyperparameters. For the 20-layer network trained
using Adam (
1
) (the bottom-right plot of Fig. 1), the decay
ratio
ρ
is 0
.
818. Thus, the half-life is
ln 2
ln ρ1
=
0.693
ln ρ1
= 3
.
45,
suggesting that this 20-layer neural network reduces the value
of the separation fuzziness in every three and a half layers.
This law manifests in the terminal phase of training (
14
),
where we continue to train the model to interpolate in-sample
data. At initialization, the separation fuzziness may even
increase from the bottom to top layers. During the early
stages of training, the bottom layers tend to learn faster at
reducing the separation fuzziness compared to the top layers
(see Fig. S7 in SI Appendix). However, as training progresses,
the top layers eventually catch up as the bottom layers have
learned the necessary features. Finally, each layer becomes
roughly equally capable of reducing the separation fuzziness
multiplicatively. This dynamics of data separation during
training is illustrated in Fig. 2. Neural networks in the terminal
phase of training also exhibit certain symmetric geometries
in the last layer such as neural collapse (
14
,
15
) and minority
collapse (
10
). However, it is worthwhile mentioning that the
law of equi-separation emerges earlier than neural collapse
during training (see Fig. S1 in SI Appendix).
The pervasive law of equi-separation consistently prevails
across diverse datasets, learning rates, and class imbalances,
as illustrated in Fig. 3. Additionally, Fig. S6 in SI Appendix
demonstrates its applicability in a finer, class-wise context.
This law is further exemplified in contemporary network archi-
tectures for vision tasks, such as AlexNet and VGGNet (
16
), as
shown in Fig. 4(see SI Appendix for additional convolutional
neural network experiments in Fig. S2 and Fig. S3). Moreover,
the law manifests in residual neural networks and densely con-
nected convolutional networks (
17
) when separation fuzziness
is assessed at each block, as depicted in Fig. 7and Fig. 8,
respectively. Intriguingly, this law appears slightly more pro-
nounced in feedforward neural networks when compared with
other network architectures.
The separation dynamics of neural networks have been
extensively investigated in prior research studies (
18
22
). For
instance, (
18
) employed linear classifiers as probes to assess the
separability of intermediate outputs. In (
19
), the author scru-
He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 3
摘要:

ALawofDataSeparationinDeepLearningHangfengHea,bandWeijieJ.Suc,1aDepartmentofComputerScience,UniversityofRochester,Rochester,NY14627;bGoergenInstituteforDataScience,UniversityofRochester,Rochester,NY14627;cDepartmentofStatisticsandDataScience,UniversityofPennsylvania,Philadelphia,PA19104Thismanuscrip...

展开>> 收起<<
A Law of Data Separation in Deep Learning.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:4.23MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注