A Law of Data Separation in Deep Learning

2025-04-22 1 0 4.23MB 14 页 10玖币

Hangfeng Hea,b and Weijie J. Suc,1

aDepartment of Computer Science, University of Rochester, Rochester, NY 14627; bGoergen Institute for Data Science, University of Rochester, Rochester, NY 14627;

cDepartment of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104

This manuscript was compiled on August 14, 2023

While deep learning has enabled signiﬁcant advances in many areas

of science, its black-box nature hinders architecture design for future

artiﬁcial intelligence applications and interpretation for high-stakes

decision makings. We addressed this issue by studying the funda-

mental question of how deep neural networks process data in the

intermediate layers. Our ﬁnding is a simple and quantitative law that

governs how deep neural networks separate data according to class

membership throughout all layers for classiﬁcation. This law shows

that each layer improves data separation at a constant geometric rate,

and its emergence is observed in a collection of network architec-

tures and datasets during training. This law offers practical guidelines

for designing architectures, improving model robustness and out-of-

sample performance, as well as interpreting the predictions.

deep learning

data separation

constant geometric rate

intermediate

layers

eep learning methodologies have achieved remarkable suc-

cess across a wide range of data-intensive tasks in image

recognition, biological research, and scientiﬁc computing (

–

In contrast to other machine learning techniques (

), however,

the practice of deep learning relies heavily on a plethora of

heuristics and tricks that are not well justiﬁed. This situation

often makes deep learning-based approaches ungrounded for

some applications or necessitates the need for huge computa-

tional resources for exhaustive search, making it diﬃcult to

fully realize the potential of this set of methodologies (7).

This unfortunate situation is in part owing to a lack of un-

derstanding of how the prediction depends on the intermediate

layers of deep neural networks (

–

). In particular, little is

known about how the data of diﬀerent classes (e.g., images

of cats and dogs) in classiﬁcation problems are gradually sep-

arated from the bottom layers to the top layers in modern

architectures such as AlexNet (

) and residual neural net-

works (

). Any knowledge about data separation, especially

quantitative characterization, would oﬀer useful principles and

insights for designing network architectures, training processes,

and model interpretation.

The main ﬁnding of this paper is a quantitative delineation

of the data separation process throughout all layers of deep

neural networks. As an illustration, Fig. 1plots a certain value

that measures how well the data are separated according to

their class membership at each layer for feedforward neural

networks trained on the Fashion-MNIST dataset (

). This

value in the logarithmic scale decays, in a distinct manner,

linearly in the number of layers the data have passed through.

The Pearson correlation coeﬃcients between the logarithm of

this value and the layer index range from

−

997 and

−

1in

Fig. 1.

The measure shown in Fig. 1is canonical for measuring

data separation in classiﬁcation problems. Let

xki

denote an

intermediate output of a neural network on the

th point of

Class

for 1

≤i≤nk

¯xk

denote the sample mean of Class

, and

¯x

denote the mean of all

···

data

points. We deﬁne the between-class sum of squares and the

within-class sum of squares as

SSb:= 1

k=1

nk(¯xk−¯x)(¯xk−¯x)⊤

SSw:= 1

k=1

i=1

(xki −¯xk)(xki −¯xk)⊤,

respectively. The former matrix represents the between-class

“signal” for classiﬁcation, whereas the latter denotes the within-

class variability. Writing

SS+

for the Moore–Penrose inverse

SSb

(The matrix

SSb

has rank at most

K−

1and is not

invertible in general since the dimension of the data is typically

larger than the number of classes.), the ratio matrix

SSwSS+

can be thought of as the inverse signal-to-noise ratio. We use

its trace

D:= Tr(SSwSS+

b)[1]

to measure how well the data are separated (

). This

value, which is referred to as the separation fuzziness, is large

when the data points are not concentrated to their class means

or, equivalently, are not well separated, and vice versa.

1. Main Results

Given an

-layer feedforward neural network, let

denote

the separation fuzziness (Eq. 1) of the training data passing

through the ﬁrst

layers for 0

≤l≤L−

∗

Fig. 1suggests

that the dynamics of Dlfollows the relation

Dl.

=ρlD0[2]

∗

For clariﬁcation,

is calculated from the raw data, and

is calculated from the data that have

passed through the ﬁrst layer but not the second layer.

Signiﬁcance Statement

The practice of deep learning has long been shrouded in mys-

tery, leading many to believe that the inner workings of these

black-box models are chaotic during training. In this paper, we

challenge this belief by presenting a simple and approximate

law that deep neural networks follow when processing data in

the intermediate layers. This empirical law is observed in a

class of modern network architectures for vision tasks, and its

emergence is shown to bring important beneﬁts for the trained

models. The signiﬁcance of this law is that it allows for a new

perspective that provides useful insights into the practice of

deep learning.

H.H and W.J.S. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no competing interest.

1To whom correspondence should be addressed. E-mail: suw@wharton.upenn.edu.

PNAS | August 14, 2023 | vol. XXX | no. XX | 1–14

arXiv:2210.17020v2 [cs.LG] 11 Aug 2023

SGD

Depth=4 Depth=8 Depth=20

Momentum

Depth=4 Depth=8 Depth=20

Adam

Depth=4 Depth=8 Depth=20

0123

4 5 6 7

Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three

different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (

). Throughout the paper, the

axis represents the layer index and the

axis

represents the separation fuzziness deﬁned in Eq. 1, unless otherwise speciﬁed. The Pearson correlation coefﬁcients, by row ﬁrst, are

≠

000,

≠

998,

≠

997,

≠

999,

≠

998,

≠

998,

≠

000,

≠

997, and

≠

999. The last two rows show the intermediate data points on the plane of the ﬁrst two principal components for the 8-layer

network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in the supplementary materials.

2| www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX He and Su

Fig. 1. Illustration of the law of equi-separation in feedforward neural networks with ReLU activations trained on the Fashion-MNIST dataset. The three rows correspond to three

different training methods, stochastic gradient descent (SGD), SGD with momentum, and Adam (

). Throughout the paper, the

axis represents the layer index and the

axis

represents the separation fuzziness deﬁned in Eq. 1, unless otherwise speciﬁed. The Pearson correlation coefﬁcients, by row ﬁrst, are

−

000,

−

998,

−

997,

−

999,

−

998,

−

998,

−

000,

−

997, and

−

999. The last two rows show the intermediate data points on the plane of the ﬁrst two principal components for the 8-layer

network trained using Adam. Layer 0 corresponds to the raw input. More details can be found in SI Appendix.

2| He and Su

Epoch=0 Epoch=10 Epoch=20

Epoch=30 Epoch=50 Epoch=100

Epoch=200 Epoch=300 Epoch=600

Fig. 2. A 20-layer feedforward neural network trained on Fashion-MNIST. The law of equi-separation starts to emerge at epoch 100 and becomes more clear as training

proceeds. As in Fig. 1and all the other ﬁgures in the main text, the xaxis represents the layer index and the yaxis represents the separation fuzziness.

for some decay ratio 0

<ﬂ<

1. Alternatively, this law implies

60 log Dl+1 ≠log Dl.

=≠log 1

ﬂ

, showing that the neural network

makes equal progress in reducing

log D

over each layer on the

training data. Hence, we call this the law of equi-separation.

This law is the ﬁrst quantitative and geometric characteriza-

tion of the data separation process in the intermediate layers.

Indeed, it is unexpected because the intermediate output of

the neural network does not exhibit any quantitative patterns,

67 as shown by the last two rows of Fig. 1.68

The decay ratio

ﬂ

depends on the depth of the neural

network, dataset, training time, and network architecture, and

is also aﬀected, to a lesser extent, by optimization methods and

many other hyperparameters. For the 20-layer network trained

72 using Adam (12) (the bottom-right plot of Fig. 1), the decay73

ratio

ﬂ

is 0

818. Thus, the half-life is

ln 2

ln ﬂ≠1

0.693

ln ﬂ≠1

45,

suggesting that this 20-layer neural network reduces the value

75 of the separation fuzziness in every three and a half layers.76

This law manifests in the terminal phase of training (

where we continue to train the model to interpolate in-sample

data. At initialization, the separation fuzziness may even in-

crease from the bottom to top layers. During the early stages

of training, the bottom layers tend to learn faster at reduc-

ing the separation fuzziness compared to the top layers (see

Fig. 16 in the supplementary materials). However, as training

progresses, the top layers eventually catch up as the bottom

layers have learned the necessary features. Finally, each layer

becomes roughly equally capable of reducing the separation

fuzziness multiplicatively. This dynamics of data separation

during training is illustrated in Fig. 2. Neural networks in

the terminal phase of training also exhibit certain symmetric

geometries in the last layer such as neural collapse (

) and

minority collapse (

). However, it is worthwhile mentioning

that the law of equi-separation emerges earlier than neural

collapse during training (see Fig. 10 in the supplementary

materials). 94

The pervasive law of equi-separation consistently prevails

across diverse datasets, learning rates, and class imbalances,

as illustrated in Fig. 3. Additionally, Fig. 15 in the supple-

mentary materials demonstrates its applicability in a ﬁner,

class-wise context. This law is further exempliﬁed in contem-

porary network architectures for vision tasks, such as AlexNet

100

and VGGNet (

), as shown in Fig. 4(see the supplementary

101

materials for additional convolutional neural network experi-

102

ments in Fig. 11 and Fig. 12). Moreover, the law manifests

103

in residual neural networks and densely connected convolu-

104

tional networks (

) when separation fuzziness is assessed

105

at each block, as depicted in Fig. 7and Fig. 8, respectively.

106

Intriguingly, this law appears slightly more pronounced in feed-

107

forward neural networks when compared with other network

108

architectures. 109

He and Su PNAS | March 22, 2023 | vol. XXX | no. XX | 3

Fig. 2. A 20-layer feedforward neural network trained on Fashion-MNIST. The law of equi-separation starts to emerge at epoch 100 and becomes more clear as training

proceeds. As in Fig. 1and all the other ﬁgures in the main text, the xaxis represents the layer index and the yaxis represents the separation fuzziness.

for some decay ratio 0

< ρ <

1. Alternatively, this law implies

log Dl+1 −log Dl.

=−log 1

, showing that the neural network

makes equal progress in reducing

log D

over each layer on the

training data. Hence, we call this the law of equi-separation.

This law is the ﬁrst quantitative and geometric characteriza-

tion of the data separation process in the intermediate layers.

Indeed, it is unexpected because the intermediate output of

the neural network does not exhibit any quantitative patterns,

as shown by the last two rows of Fig. 1.

The decay ratio

depends on the depth of the neural

network, dataset, training time, and network architecture, and

is also aﬀected, to a lesser extent, by optimization methods and

many other hyperparameters. For the 20-layer network trained

using Adam (

) (the bottom-right plot of Fig. 1), the decay

ratio

is 0

818. Thus, the half-life is

ln 2

ln ρ−1

0.693

ln ρ−1

= 3

45,

suggesting that this 20-layer neural network reduces the value

of the separation fuzziness in every three and a half layers.

This law manifests in the terminal phase of training (

where we continue to train the model to interpolate in-sample

data. At initialization, the separation fuzziness may even

increase from the bottom to top layers. During the early

stages of training, the bottom layers tend to learn faster at

reducing the separation fuzziness compared to the top layers

(see Fig. S7 in SI Appendix). However, as training progresses,

the top layers eventually catch up as the bottom layers have

learned the necessary features. Finally, each layer becomes

roughly equally capable of reducing the separation fuzziness

multiplicatively. This dynamics of data separation during

training is illustrated in Fig. 2. Neural networks in the terminal

phase of training also exhibit certain symmetric geometries

in the last layer such as neural collapse (

) and minority

collapse (

). However, it is worthwhile mentioning that the

law of equi-separation emerges earlier than neural collapse

during training (see Fig. S1 in SI Appendix).

The pervasive law of equi-separation consistently prevails

across diverse datasets, learning rates, and class imbalances,

as illustrated in Fig. 3. Additionally, Fig. S6 in SI Appendix

demonstrates its applicability in a ﬁner, class-wise context.

This law is further exempliﬁed in contemporary network archi-

tectures for vision tasks, such as AlexNet and VGGNet (

), as

shown in Fig. 4(see SI Appendix for additional convolutional

neural network experiments in Fig. S2 and Fig. S3). Moreover,

the law manifests in residual neural networks and densely con-

nected convolutional networks (

) when separation fuzziness

is assessed at each block, as depicted in Fig. 7and Fig. 8,

respectively. Intriguingly, this law appears slightly more pro-

nounced in feedforward neural networks when compared with

other network architectures.

The separation dynamics of neural networks have been

extensively investigated in prior research studies (

–

). For

instance, (

) employed linear classiﬁers as probes to assess the

separability of intermediate outputs. In (

), the author scru-

He and Su PNAS | August 14, 2023 | vol. XXX | no. XX | 3

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALawofDataSeparationinDeepLearningHangfengHea,bandWeijieJ.Suc,1aDepartmentofComputerScience,UniversityofRochester,Rochester,NY14627;bGoergenInstituteforDataScience,UniversityofRochester,Rochester,NY14627;cDepartmentofStatisticsandDataScience,UniversityofPennsylvania,Philadelphia,PA19104Thismanuscrip...

展开>> 收起<<

A Law of Data Separation in Deep Learning.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Law of Data Separation in Deep Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: