Learning Ability of Interpolating Deep Convolutional Neural Networks

2025-05-02 0 0 838.09KB 42 页 10玖币

侵权投诉

Tian-Yi Zhou∗, Xiaoming Huo

H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

765 Ferst Drive, Atlanta GA 30332-0205

Abstract

It is frequently observed that overparameterized neural networks generalize well. Regarding such phe-

nomena, existing theoretical work mainly devotes to linear settings or fully-connected neural networks.

This paper studies the learning ability of an important family of deep neural networks, deep convolu-

tional neural networks (DCNNs), under both underparameterized and overparameterized settings. We

establish the ﬁrst learning rates of underparameterized DCNNs without parameter or function variable

structure restrictions presented in the literature. We also show that by adding well-deﬁned layers to a

non-interpolating DCNN, we can obtain some interpolating DCNNs that maintain the good learning

rates of the non-interpolating DCNN. This result is achieved by a novel network deepening scheme

designed for DCNNs. Our work provides theoretical veriﬁcation of how overﬁtted DCNNs generalize

well.

Keywords: Convolutional neural networks, deep learning, benign overﬁtting, learning rates,

overparameterized network

1. Introduction

Neural networks are computing systems with powerful applications in many disciplines, such as data

analysis and pattern and sequence recognition. In particular, deep neural networks with well-designed

structures, numerous trainable free parameters, and massive-scale input data have outstanding perfor-

mances in function approximation [32, 18], classiﬁcation [19, 11], regression [30], and feature extraction

[23]. The success of deep neural networks in practice has motivated research activities intended to

rigorously explain their capability and power, in addition to the literature on shallow neural networks

[26].

∗Corresponding author

Email address: tzhou306@gatech.edu (Tian-Yi Zhou)

Preprint submitted to Applied and Computational Harmonic Analysis August 17, 2023

arXiv:2210.14184v2 [stat.ML] 16 Aug 2023

In this paper, we study an important family of neural networks known as convolutional neural

networks. Given that neural networks, in general, are powerful and versatile, researchers have been

working to improve their computational eﬃciency further. When the data dimension is large such as

the AlexNet [19] of input dimension about 150,000, fully-connected neural networks are not feasible.

Structures are often imposed on neural networks to reduce the number of trainable free parameters and

get feasible deep learning algorithms for various practical tasks [20]. The structure we are interested

in is induced by one-dimensional convolution (1-D convolution), and the resulting networks are deep

convolutional neural networks (DCNNs) [31]. The convolutional structure of DCNNs reduces the

computational complexity and is believed to capture local shift-invariance properties of image and

speech data. Such features of DCNNs contribute to the massive popularity of DCNNs in image

processing and speech recognition.

In recent years, there has been a line of work studying overparameterization in deep learning. It

is frequently observed that overparameterized deep neural networks, such as DCNNs, generalize well

while achieving zero training error [6]. This phenomenon, known as benign overﬁtting, seems to con-

front the classical bias-variance trade-oﬀ in statistical theory. Such a mismatch between observations

and classical theory sparked avid research attempting to understand how benign overﬁtting occurs.

Theoretical work studying benign overﬁtting was initiated in [4], where a linear regression setting with

Gaussian data and noise was considered. It presented conditions for minimum-norm interpolators to

generalize well. In a non-linear setting induced by the ReLU activation function, benign overﬁtting is

veriﬁed for deep fully-connected neural networks in [22]. On top of that, a recent work [10] shows that

training shallow neural networks with shared weights by gradient descent can achieve an arbitrarily

small training error.

In this paper, we study the learning ability of DCNNs under both underparameterized and overpa-

rameterized settings. We aim to show that an overparameterized DCNN can be constructed to have

the same convergence rate as a given underparameterized one while it perfectly ﬁts the input data. In

other words, we intend to prove that interpolating DCNNs generalize well.

The main contributions of the paper are as follows. Our ﬁrst result rigorously proves that for an

arbitrary DCNN with good learning rates, we can add more layers to build overparameterized DCNNs

satisfying the interpolation condition while retaining good learning rates. Here, “learning rates” refers

to rates of convergence of the output function to the regression function in a regression setting. Our

second result establishes the learning rates of DCNNs in general. Previously in [33], convergence

rates of approximating functions in some Sobolev spaces by DCNNs were given without generalization

analysis. Moreover, learning rates of DCNNs for learning radial functions were given in [24], where

the bias vectors and ﬁlters are assumed to be bounded, with bounds depending on the sample size

and depths. More recently, learning rates for learning additive ridge functions were presented in [14].

Unlike these existing works, the learning rates we derive do not require any restrictions on norms of

the ﬁlters or bias vectors, or variable structures of the target functions. Without the boundedness of

free parameters, the standard covering number arguments do not apply. To overcome such a challenge,

we derive a special estimate of the pseudo-dimension of the hypothesis space generated by a DCNN.

Previously, a pseudo-dimension estimate was given in [3] for fully-connected neural networks, using

the piecewise polynomial property of the activation function. We shall apply our pseudo-dimension

estimate to, in turn, bound the empirical covering number of the hypothesis space. In such a way, we

can achieve our results without restrictions on free parameters.

Combining our ﬁrst and second results, we prove that for any input data, there exist some over-

parameterized DCNNs which interpolate the data and achieve a good learning rate. The third result

provides theoretical support for the possible existence of benign overﬁtting under the DCNN setting.

The rest of this paper is organized as follows. In Section 2, we introduce notations and deﬁnitions

used throughout the paper, including the deﬁnition of DCNNs to be studied. In Section 3, we present

our main results that describe how a DCNN achieves benign overﬁtting. The proof of our ﬁrst result

is given in Section 4, and the proofs of our second and third results are provided in Section 5. In

Section 6, we present the results of numerical experiments which corroborate our theoretical ﬁndings.

Lastly, in Section 7, we present some discussions and compare our work with the existing literature.

2. Problem Formulation

In this section, we deﬁne the DCNNs to be studied in this paper and the corresponding hypothesis

space (Subsection 2.1). Then, we introduce the regression setting with data and the regression function

(Subsection 2.2).

2.1. Deep Convolutional Neural Networks and the Corresponding Hypothesis Space

To begin with, we formulate the 1-D convolution. Let w={wj}+∞

j=−∞ be a ﬁlter supported

in {0,1, . . . , s}for some ﬁlter length s∈N, which means wj̸= 0 only for 0 ≤j≤s. Suppose

x={xj}+∞

j=−∞ is another sequence supported in {1,2, . . . , d}for some d∈Nand is denoted as an

input vector x= (x1, . . . , xd)T∈Rdfor networks in the following. The 1-D convolution of wwith x,

denoted by w∗x, is deﬁned as

(w∗x)i=X

k∈Z

wi−kxk=

k=1

wi−kxk, i ∈Z.(2.1)

We can see, from (2.1), the convoluted sequence w∗xis supported in {1,2, . . . , d +s}and can be

expressed in the following matrix form

[(w∗x)i]d+s

i=1 =Twx, (2.2)

where

(Tw)i,i =w0,if i= 1,2, . . . , d,

(Tw)i+1,i =w1,if i= 1,2, . . . , d,

. (2.3)

(Tw)i+s,i =ws,if i= 1,2, . . . , d,

(Tw)i,j = 0,otherwise.

Twis a (d+s)×dsparse Toeplitz-type matrix often referred as the “convolutional matrix.” The

sparsity of Twcan be attributed to the large number of zero entries. This approach is known as

“zero-padding,” where we have expanded the vector x∈Rdto a sequence on Zby adding zero entries

outside the support {1,2, . . . , d}.

Now we deﬁne DCNNs by means of convolutional matrices. We take the ReLU activation function

σ:R→Rgiven by σ(u) = max{0, u}. It acts on vectors componentwise.

Deﬁnition 1. A DCNN of depth J∈Nconsists of a sequence of function vectors {h(j):Rd→Rdj}J

j=1

of widths {dj:= d+js}J

j=0 deﬁned with a sequence of ﬁlters w={w(j)}J

j=1 each of ﬁlter length s∈N

and a sequence of bias vectors b={b(j)∈Rdj}by h(0)(x) = xand iteratively

h(j)(x) = σT(j)h(j−1)(x)−b(j), j = 1,2, . . . , J, (2.4)

where T(j):= Tw(j)=hw(j)

i−ki1≤i≤dj−1+s,1≤k≤dj−1

is a (dj−1+s)×dj−1convolutional matrix. The

hypothesis space generated by this DCNN is given by

HJ,s =span nc·h(J)(x) + a:c∈RdJ, a ∈R,w,bo.(2.5)

We often take the bias vector b(j)of the so-called “identical-in-middle” form

b(j)= [b1,··· , bs−1, bs,··· , bs

| {z }

dj−2s+2

, bdj−s+2,··· , bdj]T(2.6)

with dj−2(s−1) repeated entries in the middle. This special shape of the bias vector b(j), together

with the sparsity of the convolutional matrix Tw, tells us that the j-th layer of the DCNN involves

only (s+ 1) + (2s−1) = 3sfree parameters.

To reduce data redundancy, we introduce a downsampling operator Dm:RK→R⌊K/m⌋with a

scaling parameter m∈Nby

Dm(v)=(vim)⌊K/m⌋

i=1 , v ∈RK,(2.7)

where ⌊u⌋denotes the integer part of u > 0. In other words, the downsampling operator Dmonly

“picks up” the m-th, 2m-th, . . . , ⌊K/m⌋m-th entries of v.

Deﬁnition 2. A downsampled DCNN of depth Jwith downsampling at layer J1∈ {1, . . . , J −1}has

widths d0=dand

dj=





dj−1+s, if j̸=J1,

⌊(dj−1+s)/d⌋,if j=J1,

(2.8)

and function vectors {h(j):Rd→Rdj}J

j=1 given by h(0)(x) = xand iteratively

h(j)(x) = 





σT(j)h(j−1)(x)−b(j),if j̸=J1,

σDdT(j)h(j−1)(x)−b(j),if j=J1.

(2.9)

In other words, the downsampling operation aims to reduce the width of a certain layer of DCNN

while preserving information on data features. The hypothesis space is deﬁned in the same way as

(2.5).

In this paper, we take bias vectors b(j)to satisfy (2.6) for j= 1,2, . . . , J −1. If no additional

constraints are imposed, the number of free parameters for an output function from the hypothesis

space (including ﬁlters and, biases, coeﬃcients) equals

J−1

j=1

(s+ 1 + 2s−1) + (s+ 1) + dJ+dJ+ 1 = 3s(J−1) + s+ 2 + 2dJ.(2.10)

DCNNs considered in this paper are based on a “zero-padding” approach and have increasing

widths. In the literature, DCNNs without zero-padding have also been introduced [15, 19], and they

have decreasing widths, leading to limited approximation abilities and the necessity of channels for

learning. Moreover, DCNNs induced by group convolutions were studied with nice approximation

properties presented in [28].

2.2. Data and Regression Function

Consider a training sample D:= {zi= (xi, yi)}n

i=1 drawn independently and identically distributed

from an unknown distribution ρon Z:= Ω ×Y. Throughout this paper, we assume that Ω is a closed

bounded subset of Rdand Y= [−M, M] for some M≥1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningAbilityofInterpolatingDeepConvolutionalNeuralNetworksTian-YiZhou∗,XiaomingHuoH.MiltonStewartSchoolofIndustrialandSystemsEngineering,GeorgiaInstituteofTechnology765FerstDrive,AtlantaGA30332-0205AbstractItisfrequentlyobservedthatoverparameterizedneuralnetworksgeneralizewell.Regardingsuchphe-no...

展开>> 收起<<

Learning Ability of Interpolating Deep Convolutional Neural Networks.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Ability of Interpolating Deep Convolutional Neural Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: