When Expressivity Meets Trainability Fewer than n Neurons Can Work Jiawei Zhang

2025-05-06 0 0 1.35MB 39 页 10玖币

侵权投诉

When Expressivity Meets Trainability: Fewer than n

Neurons Can Work

Jiawei Zhang∗

Shenzhen Research Institute of Big Data

The Chinese University of Hong Kong,

Shenzhen, China

jiaweizhang2@link.cuhk.edu.cn

Yushun Zhang∗

Shenzhen Research Institute of Big Data

The Chinese University of Hong Kong,

Shenzhen, China

yushunzhang@link.cuhk.edu.cn

Mingyi Hong

University of Minnesota - Twin Citie

MN, USA

mhong@umn.edu

Ruoyu Sun †

University of Illinois at Urbana-Champaign

IL, USA

ruoyus@illinois.edu

Zhi-Quan Luo

Shenzhen Research Institute of Big Data

The Chinese University of Hong Kong,

Shenzhen, China

luozq@cuhk.edu.cn

Abstract

Modern neural networks are often quite wide, causing large memory and com-

putation costs. It is thus of great interest to train a narrower network. However,

training narrow neural nets remains a challenging task. We ask two theoretical

questions: Can narrow networks have as strong expressivity as wide ones? If so,

does the loss function exhibit a benign optimization landscape? In this work, we

provide partially afﬁrmative answers to both questions for 1-hidden-layer networks

with fewer than

(sample size) neurons when the activation is smooth. First, we

prove that as long as the width

m≥2n/d

(where

is the input dimension), its

expressivity is strong, i.e., there exists at least one global minimizer with zero

training loss. Second, we identify a nice local region with no local-min or saddle

points. Nevertheless, it is not clear whether gradient descent can stay in this nice re-

gion. Third, we consider a constrained optimization formulation where the feasible

region is the nice local region, and prove that every KKT point is a nearly global

minimizer. It is expected that projected gradient methods converge to KKT points

under mild technical conditions, but we leave the rigorous convergence analysis

to future work. Thorough numerical results show that projected gradient methods

on this constrained formulation signiﬁcantly outperform SGD for training narrow

neural nets.

1 Introduction

Modern neural networks are huge (e.g. [

]). Reducing the size of neural nets is appealing for

many reasons: ﬁrst, small networks are more suitable for embedded systems and portable devices;

second, using smaller networks can reduce power consumption, contributing to “green computing”.

∗Equal contribution. These authors are listed in alphabetical order.

†Corresponding author: Ruoyu Sun.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arXiv:2210.12001v1 [cs.LG] 21 Oct 2022

There are many ways to reduce network size, such as quantization, sparciﬁcation and reducing the

width (e.g. [20, 77]). In this work, we focus on reducing the width (training narrow nets).

Reducing the network width often leads to signiﬁcantly worse performance

. What is the possible

cause? From the theoretical perspective, there are three possible causes: worse generalization power,

worse trainability (how effective a network can be optimized), and weaker expressivity (how complex

the function a network can represent; see Deﬁnition 1). Our simulation shows that the training error

deteriorates signiﬁcantly as the width shrinks, which implies that the trainability and/or expressivity

are important causes of the worse performance (see Section 5.1 for more evidence ). We do not

discuss generalization power for now, and leave it to future work.

So how is the training error related to expressivity and trainability? The training error is the sum of

two parts (see, e.g., [

]): the expressive error (which is the best a given network can do; also the

global minimal error) and the optimization error (which is the gap between training error and the

global minimal error; occurs because the algorithm may not ﬁnd global-min). The two errors are of

different nature, and thus need to be discussed separately.

It is understandable that narrower networks might have weaker expressive power. What about

optimization? There is also evidence that smaller width causes optimization difﬁculty. A number of

recent works show that increasing the width of neural networks helps create a benign empirical loss

landscape ([

]), while narrow networks (width

< sample size

) suffer from bad landscape

([

]). Therefore, if we want to improve the performance of narrow networks, it is

likely that both expressiveness and trainability need to be improved.

The above discussion leads to the following two questions:

(Q1) Can a narrow network have as strong expressivity as a wide one?

(Q2) If so, can a local search method ﬁnd a (near) globally optimal solution?

The key challenges in answering these questions are listed below:

•

It is not clear whether a narrow network has strong expressivity or not. Many existing works

focus on verifying the relationship between zero-training-error solutions and stationary

points, but they neglect the (non)existence of such solutions (e.g. [

], [

]). For narrow

networks, the (non)-existence of zero-training-error-solution is not clear.

•

Even if zero-training-error solutions do exist, it is still not clear how to reach those solutions

because the landscape of a narrow neural network can be highly non-convex.

•

Even assuming that we can identify a region that contains zero-training-error solutions and

has a good landscape, it is potentially difﬁcult to keep the iterates inside such a good region.

One may think of imposing an explicit constraint, but this approach might introduce bad

local minimizers on the boundary [6].

In this work, we (partially) answer

(Q1)

and

(Q2)

for a 1-hidden-layer nets with fewer than

neurons.

Our main contributions are as follows:

•Expressiveness and nice local landscape.

We prove that, as long as the width

is larger

than

2n/d

(where

is the sample size and

is the input dimension), then the expressivity

of the 1-hidden-layer net is strong, i.e., w.p.1. there exists at least one global-min with zero

empirical loss. In addition, such a solution is surrounded by a good local landscape with

no local-min or saddles. Note that our results do not exclude the possibility that there are

sub-optimal local minimizers on the global landscape.

•Every KKT point is an approximated global minimizer.

For the original unconstrained

optimization problem, the nice local landscape does not guarantee the global statement

of “every stationary point is a global minimizer”. We propose a constrained optimization

problem that restricts the hidden weights to be close to the identiﬁed nice region. We show

This can be veriﬁed on our empirical studies in Section 5. Another evidence is that structure pruning

(reducing the number of channels in convolutional neural nets (CNN)) is known to achieve worse performance

than unstructured pruning; this is an undesirable situation since many practitioners prefer structure pruning (due

to hardware reasons).

that every Karush–Kuhn–Tucker (KKT) point is an approximated global minimizer of the

unconstrained training problem 4.

•

In real-data experiments, our proposed training regime can signiﬁcantly outperforms SGD

for training narrow networks. We also perform ablation studies to show that the new elements

proposed in our method are useful.

2 Background and Related Works

The expressivity of neural networks has been a popular topic in machine learning for decades. There

are two lines of works: One focuses on the inﬁnite-sample expressivity, showing what functions of

the entire domain can and cannot be represented by certain classes of neural networks (e.g. [

]).

Another line of works characterize the ﬁnite-sample expressivity, i.e. how many parameters are

required to memorize ﬁnite samples (e.g. [

]). The term “expressivity” in this

work means the ﬁnite-sample expressivity; see Deﬁnition 1 in Section 4.1. A major research question

regarding expressivity in the area is to show deep neural networks have much stronger expressivity

than the shallow ones (e.g. [

]). However, all these works neglect the

trainability.

In the ﬁnite-sample case, wide networks (width

poly(n)

) have both strong representation power

(i.e. the globally minimal training error is zero) and strong trainability (e.g. for wide enough nets,

Gradient Descent (GD) converges to global minima [

]). While these wide networks are

often called “over-parameterized”, we notice that the number of parameters of a width-

network

is actually at least

, which is much larger than

. If comparing

with the number of parameters

(instead of neurons), the transition from under-parameterization and over-parameterization for a

one-hidden-layer fully-connected net (FCN) does not occur at width-

, but at width-

n/d

. In this

work, we will analyze networks with width in the range

[n/d, n)

, which we call “narrow networks”

(though rigorously speaking, we shall call them “narrow but still overparameterized networks”).

There are a few works on the trainability of narrow nets (one-hidden-layer networks with

m≥n/d

neurons). Soudry and Carmon [

], Xie et al. [

] show that for such networks, stationary points

with full-rank NTK (neural tangent kernel) are zero-loss global minima. However, it is not clear

whether the NTK stays full rank during the training trajectory. In addition, these two works do not

discuss whether a zero-loss global minimizer exists.

There are two interesting related works [

] pointed out by the reviewers. Bubeck et al. [

] study

how many neurons are required for memorizing a ﬁnite dataset by 1-hidden-layer networks. They

prove the following results. Their ﬁrst result is an “existence” result: there exists a network with

width

m≥4n

which can memorize

input-label pairs (their Proposition 4). However, in this setting

they did not provide an algorithm to ﬁnd the zero-loss solution. Their second result is related to

algorithms: they proposed a training algorithm that achieves accuracy up to error



for a neural net

with width

m≥On

log(1/)



. This result requires width dependent on the precision



; for instance,

when the desired accuracy

= 1/n

, the required width is at least

On2

d.

In contrast, in our work,

the required number of neurons is just 2n/d, which is independent of .

Daniely [

] also studies the expressivity and trainability of 1-hidden-layer networks. To memorize

n(1 −)

random data points via SGD, their required width is

O(n/d)

. They assumed

n=dc

where

c > 0

is a ﬁxed constant (appeared in Sec. 3.3 of [

]), in which case the hidden factor in

O(log[d(log d)]c)

. In other words, if

n, d → ∞

with the scaling

n=dc

for a ﬁxed constant

, then

their bound is roughly

O(n/d)

up to a log-factor. Nevertheless, for more general scaling of

n, d

, the

exponent

c= (log n)/(log d)

may not be a constant and the hidden factor may not be a log-factor

(e.g. for ﬁxed

and

n→ ∞

). We tracked their proof and ﬁnd that the width bound for general

n, d

O(n/d) (log(dlog n))log n/ log d

, which can be larger than

On2/d

(see detailed computation

and explanation in Appendix C). In contrast, our required width is

2n/d

for arbitrary

and

. Our

bound is always smaller than nwhen d > 2.

This result describes the loss landscape of the constrained optimization problem, not directly related to

algorithm convergence. Nevertheless, it is expected that ﬁrst-order methods converge to KKT points and

thus approximate global minimizers. A rigorous convergence analysis may require verifying extra technical

conditions, which is left to future work.

Additionally, there is a major difference between Daniely [

] and our work: they analyze the original

unconstrained problem and SGD; in contrast, we analyze a constrained problem. This may be the

reason why we can get a stronger bound on width. In the experiments in Section 5, we observe that

SGD performs badly when the width is small (see the 1st column in Figure 4 (b)). Therefore, we

suspect an algorithmic change is needed to train narrow nets with such width (due to the training

difﬁculty), and we indeed propose a new method to train narrow nets.

Due to the space constraints, we defer more related works in Appendix B.

3 Challenges For Analyzing Narrow Nets

In this section, we discuss why it is challenging to achieve expressivity and trainability together for

narrow nets. Consider a dataset {(xi, yi)}n

i=1 ⊂Rd×Rand a 1-hidden-layer neural network:

f(x;θ) =

j=1

vjσwT

jx,(1)

where

σwT

jx

is the output of the

-th hidden nodes with hidden weights

σ(·)

is the activation

function, and

is the corresponding outer weight (bias terms are ignored for simplicity). To learn

such a neural network, we search for the optimal parameter

θ= (w, v)

by minimizing the following

empirical (training) loss:

min

θ`(θ) = 1

i=1

(yi−f(xi;θ))2,(2)

The gradient of the above problem w.r.t. hidden weights

w={wi}m

i=1

is given by:

∇w`(θ) =

J(w;v)T(f(w;v)−y)∈Rmd×1,where J(w;v)∈Rn×md is the Jacobian matrix w.r.t w:

J(w;v) := 





∇wf(w;x1, v)T

∇wf(w;xn, v)T





=





v1σ0wT

1x1xT

1··· vmσ0wT

mx1xT

v1σ0wT

1xnxT

n··· vmσ0wT

mxnxT





∈Rn×md.(3)

First order methods like GD converge to a stationary point

θ∗= (w∗, v∗)

(i.e. with zero gradient)

under mild conditions [

]. For problem

(2)

, it is easy to show that if (i)

(w∗, v∗)

is a stationary point,

(ii)

J(w∗;v∗)∈Rn×md

is full row rank and

n≤md

, then

(w∗, v∗)

is a global-min (this claim can

be proved by setting the partial gradient of

(2)

over

to be zero). In other words, for training a

network with width m≥n/d, an important tool is to ensure the full-rankness of the Jacobian.

Recent works have shown that it is possible to guarantee the full rankness of the Jacobian matrix

along the training trajectories, however, the required width is above

Ω(poly(n))

. Roughly speaking,

the proof sketch is the following: (i) with high probability,

J(w;v)

is non-singular locally around

the random initialization ([

], [

]), (ii) increasing the width can effectively bound the parameter

movement from initialization, so the “nice” property of non-singualr

J(w, v)

holds throughout the

training, leading to a linear convergence rate [

]. Under this general framework, a number of

convergence results are developed for wide networks with width

Ω(poly(n))

([

–

55, 80, 81]). This idea is also illustrated in Figure 1 (a).

We notice that there is a huge gap between the necessary condition

m≥n/d

and the common

condition

m≥Ω(poly(n)).

We suspect that it is possible to train a narrow net with width

Θ(n/d)

small loss. To achieve this goal, we need to understand why existing arguments require a large width

and cannot apply to a network with width Θ(n/d).

The ﬁrst reason is about trainability. The above arguments no longer hold when the width is not large

enough to control the movement of hidden weights. In this case, the iterates may easily travel far

away from the initial point and get stuck at some singular-Jacobian critical points with high training

loss (see Figure 1 (b). Also see Figure 4 (b) & (d) for more empirical evidence). In other words, GD

may get stuck at sub-optimal stationary points for narrow nets.

The second reason, and also an easily ignored one, is the expressivity (a.k.a. the representation power,

see Deﬁnition 1 for a formal statement). In above discussion, we implicitly assumed that there exists

a zero-loss global minimizer, which is equivalent to “there exists a network conﬁguration such that

the network can memorize the data”. For networks with width at least

this assumption can be

(a) Wide networks (b) Narrow networks

Figure 1: The parameter movement under different regimes. The shaded area indicates the region

where

J(w;v)

is non-singular, the black circle denotes the region that GD iterates will explore, and

the red circle is the constraint designed in our training regime, it will be discussed in Section 4.3.

justiﬁed in the following way. The feature matrix

Φ(w) := 





σwT

1x1,...,σwT

mx1

σwT

1xn,...,σwT

mxn





∈Rn×m(4)

can span the whole space

when it is full rank and

m≥n

, thus the network can perfectly ﬁt any

label

y∈Rn

even without training the hidden layer. It is important to note that when the width

below the sample size

, full row-rankness does not ensure that the row space of the feature matrix is

the whole space Rn. In other words, it is not clear whether a global-min with zero loss exists.

In the next section, we will describe how we obtain strong expressive power with

Θ(n/d)

neurons,

and how to avoid sub-optimal stationary points.

4 Main Results

4.1 Problem Settings and Preliminaries

We denote {(xi, yi)}n

i=1 ⊂Rd×Ras the training samples, where xi∈Rd,yi∈R. For theoretical

analysis, we focus on 1-hidden-layer neural networks

f(x;θ) = Pm

j=1 vjσwT

jx∈R

, where

σ(·)

is the activation function,

wj∈Rd

and

vj∈R

are the parameters to be trained. Note that we only

consider the case where f(x;θ)∈Rhas 1-dimensional output for notation simplicity.

To learn such a neural network, we search for the optimal parameter

θ= (w, v)

by mini-

mizing the empirical loss (2), and sometimes we also use

`(w;v)

f(w;x, v)

to emphasize

the role of

. We use the following shorthanded notations:

x:= (xT

1;. . . ;xT

n)∈Rn×d

y:= (y1, . . . , yn)T∈Rn

w:= (w1, . . . , wm)m

j=1 ∈Rd×m

v:= (v1, . . . , vm)m

j=1 ∈Rm

, and

f(w;v) := (f(x1;w, v), f(x2;w, v), . . . , f(xn;w, v))T∈Rn

. We denote the Jacobian matrix of

f(w;v)

w.r.t

J(w;v)

, which can be seen in

(3)

. We deﬁne the feature matrix

Φ(w)

as in

(4)

We denote the operator

∇w

as “taking the gradient w.r.t.

”, and the same goes for

∇v

. Throughout

the paper, ‘w.p.1’ is the abbreviation for ‘with probability one’; when we say ‘in the neighborhood of

initialization’, it means ‘wis in the neighborhood of the initialization w0’.

Now, we formally deﬁne the term “expressivity”. As discussed in Section 2, we focus on the

ﬁnite-sample (as opposed to inﬁnite-sample) expressivity, which is relevant in practical training.

Deﬁnition 1.

(Expressivity) We say a neural net function class

F={f(x;θ); θ∈Θ}

has strong

(

-sample) expressivity if for any

input-output pairs

D={(xi, yi)}n

i=1 ⊂Rd×R

where

’s are

distinct, there exists a

θ(D)∈Θ

such that

f(xi;ˆ

θ(D)) = yi

i= 1,··· , n

. Or equivalently, the

optimal value of empirical loss

(2)

equals 0 for any

. Sometimes we may drop the word “strong”

for brevity.

Figure 2: A simple example of the mir-

rored LeCun’s initialization.

Next, let us describe the mirrored LeCun’s initialization

in Algorithm 1. The idea is that through this initialization,

the hidden outputs will cancel out with the outer weights,

so that we get zero initial output for any input

; see Fig-

ure 2 for a simple illustration. Note that similar symmetric

initialization strategies are also proposed in some recent

works such as [

] and [

]. However, our purpose is dif-

ferent. More explanation can be seen in the ﬁnal paragraph

of Section 4.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhenExpressivityMeetsTrainability:FewerthannNeuronsCanWorkJiaweiZhangShenzhenResearchInstituteofBigDataTheChineseUniversityofHongKong,Shenzhen,Chinajiaweizhang2@link.cuhk.edu.cnYushunZhangShenzhenResearchInstituteofBigDataTheChineseUniversityofHongKong,Shenzhen,Chinayushunzhang@link.cuhk.edu.cnMin...

展开>> 收起<<

When Expressivity Meets Trainability Fewer than n Neurons Can Work Jiawei Zhang.pdf

共39页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

When Expressivity Meets Trainability Fewer than n Neurons Can Work Jiawei Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: