Learning Ability of Interpolating Deep Convolutional Neural Networks

2025-05-02 0 0 838.09KB 42 页 10玖币
侵权投诉
Learning Ability of Interpolating Deep Convolutional Neural Networks
Tian-Yi Zhou, Xiaoming Huo
H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology
765 Ferst Drive, Atlanta GA 30332-0205
Abstract
It is frequently observed that overparameterized neural networks generalize well. Regarding such phe-
nomena, existing theoretical work mainly devotes to linear settings or fully-connected neural networks.
This paper studies the learning ability of an important family of deep neural networks, deep convolu-
tional neural networks (DCNNs), under both underparameterized and overparameterized settings. We
establish the first learning rates of underparameterized DCNNs without parameter or function variable
structure restrictions presented in the literature. We also show that by adding well-defined layers to a
non-interpolating DCNN, we can obtain some interpolating DCNNs that maintain the good learning
rates of the non-interpolating DCNN. This result is achieved by a novel network deepening scheme
designed for DCNNs. Our work provides theoretical verification of how overfitted DCNNs generalize
well.
Keywords: Convolutional neural networks, deep learning, benign overfitting, learning rates,
overparameterized network
1. Introduction
Neural networks are computing systems with powerful applications in many disciplines, such as data
analysis and pattern and sequence recognition. In particular, deep neural networks with well-designed
structures, numerous trainable free parameters, and massive-scale input data have outstanding perfor-
mances in function approximation [32, 18], classification [19, 11], regression [30], and feature extraction
[23]. The success of deep neural networks in practice has motivated research activities intended to
rigorously explain their capability and power, in addition to the literature on shallow neural networks
[26].
Corresponding author
Email address: tzhou306@gatech.edu (Tian-Yi Zhou)
Preprint submitted to Applied and Computational Harmonic Analysis August 17, 2023
arXiv:2210.14184v2 [stat.ML] 16 Aug 2023
In this paper, we study an important family of neural networks known as convolutional neural
networks. Given that neural networks, in general, are powerful and versatile, researchers have been
working to improve their computational efficiency further. When the data dimension is large such as
the AlexNet [19] of input dimension about 150,000, fully-connected neural networks are not feasible.
Structures are often imposed on neural networks to reduce the number of trainable free parameters and
get feasible deep learning algorithms for various practical tasks [20]. The structure we are interested
in is induced by one-dimensional convolution (1-D convolution), and the resulting networks are deep
convolutional neural networks (DCNNs) [31]. The convolutional structure of DCNNs reduces the
computational complexity and is believed to capture local shift-invariance properties of image and
speech data. Such features of DCNNs contribute to the massive popularity of DCNNs in image
processing and speech recognition.
In recent years, there has been a line of work studying overparameterization in deep learning. It
is frequently observed that overparameterized deep neural networks, such as DCNNs, generalize well
while achieving zero training error [6]. This phenomenon, known as benign overfitting, seems to con-
front the classical bias-variance trade-off in statistical theory. Such a mismatch between observations
and classical theory sparked avid research attempting to understand how benign overfitting occurs.
Theoretical work studying benign overfitting was initiated in [4], where a linear regression setting with
Gaussian data and noise was considered. It presented conditions for minimum-norm interpolators to
generalize well. In a non-linear setting induced by the ReLU activation function, benign overfitting is
verified for deep fully-connected neural networks in [22]. On top of that, a recent work [10] shows that
training shallow neural networks with shared weights by gradient descent can achieve an arbitrarily
small training error.
In this paper, we study the learning ability of DCNNs under both underparameterized and overpa-
rameterized settings. We aim to show that an overparameterized DCNN can be constructed to have
the same convergence rate as a given underparameterized one while it perfectly fits the input data. In
other words, we intend to prove that interpolating DCNNs generalize well.
The main contributions of the paper are as follows. Our first result rigorously proves that for an
arbitrary DCNN with good learning rates, we can add more layers to build overparameterized DCNNs
satisfying the interpolation condition while retaining good learning rates. Here, “learning rates” refers
to rates of convergence of the output function to the regression function in a regression setting. Our
second result establishes the learning rates of DCNNs in general. Previously in [33], convergence
rates of approximating functions in some Sobolev spaces by DCNNs were given without generalization
analysis. Moreover, learning rates of DCNNs for learning radial functions were given in [24], where
2
the bias vectors and filters are assumed to be bounded, with bounds depending on the sample size
and depths. More recently, learning rates for learning additive ridge functions were presented in [14].
Unlike these existing works, the learning rates we derive do not require any restrictions on norms of
the filters or bias vectors, or variable structures of the target functions. Without the boundedness of
free parameters, the standard covering number arguments do not apply. To overcome such a challenge,
we derive a special estimate of the pseudo-dimension of the hypothesis space generated by a DCNN.
Previously, a pseudo-dimension estimate was given in [3] for fully-connected neural networks, using
the piecewise polynomial property of the activation function. We shall apply our pseudo-dimension
estimate to, in turn, bound the empirical covering number of the hypothesis space. In such a way, we
can achieve our results without restrictions on free parameters.
Combining our first and second results, we prove that for any input data, there exist some over-
parameterized DCNNs which interpolate the data and achieve a good learning rate. The third result
provides theoretical support for the possible existence of benign overfitting under the DCNN setting.
The rest of this paper is organized as follows. In Section 2, we introduce notations and definitions
used throughout the paper, including the definition of DCNNs to be studied. In Section 3, we present
our main results that describe how a DCNN achieves benign overfitting. The proof of our first result
is given in Section 4, and the proofs of our second and third results are provided in Section 5. In
Section 6, we present the results of numerical experiments which corroborate our theoretical findings.
Lastly, in Section 7, we present some discussions and compare our work with the existing literature.
2. Problem Formulation
In this section, we define the DCNNs to be studied in this paper and the corresponding hypothesis
space (Subsection 2.1). Then, we introduce the regression setting with data and the regression function
(Subsection 2.2).
2.1. Deep Convolutional Neural Networks and the Corresponding Hypothesis Space
To begin with, we formulate the 1-D convolution. Let w={wj}+
j=−∞ be a filter supported
in {0,1, . . . , s}for some filter length sN, which means wj̸= 0 only for 0 js. Suppose
x={xj}+
j=−∞ is another sequence supported in {1,2, . . . , d}for some dNand is denoted as an
input vector x= (x1, . . . , xd)TRdfor networks in the following. The 1-D convolution of wwith x,
denoted by wx, is defined as
(wx)i=X
kZ
wikxk=
d
X
k=1
wikxk, i Z.(2.1)
3
We can see, from (2.1), the convoluted sequence wxis supported in {1,2, . . . , d +s}and can be
expressed in the following matrix form
[(wx)i]d+s
i=1 =Twx, (2.2)
where
(Tw)i,i =w0,if i= 1,2, . . . , d,
(Tw)i+1,i =w1,if i= 1,2, . . . , d,
.
.
. (2.3)
(Tw)i+s,i =ws,if i= 1,2, . . . , d,
(Tw)i,j = 0,otherwise.
Twis a (d+s)×dsparse Toeplitz-type matrix often referred as the “convolutional matrix.” The
sparsity of Twcan be attributed to the large number of zero entries. This approach is known as
“zero-padding,” where we have expanded the vector xRdto a sequence on Zby adding zero entries
outside the support {1,2, . . . , d}.
Now we define DCNNs by means of convolutional matrices. We take the ReLU activation function
σ:RRgiven by σ(u) = max{0, u}. It acts on vectors componentwise.
Definition 1. A DCNN of depth JNconsists of a sequence of function vectors {h(j):RdRdj}J
j=1
of widths {dj:= d+js}J
j=0 defined with a sequence of filters w={w(j)}J
j=1 each of filter length sN
and a sequence of bias vectors b={b(j)Rdj}by h(0)(x) = xand iteratively
h(j)(x) = σT(j)h(j1)(x)b(j), j = 1,2, . . . , J, (2.4)
where T(j):= Tw(j)=hw(j)
iki1idj1+s,1kdj1
is a (dj1+s)×dj1convolutional matrix. The
hypothesis space generated by this DCNN is given by
HJ,s =span nc·h(J)(x) + a:cRdJ, a R,w,bo.(2.5)
We often take the bias vector b(j)of the so-called “identical-in-middle” form
b(j)= [b1,··· , bs1, bs,··· , bs
| {z }
dj2s+2
, bdjs+2,··· , bdj]T(2.6)
with dj2(s1) repeated entries in the middle. This special shape of the bias vector b(j), together
with the sparsity of the convolutional matrix Tw, tells us that the j-th layer of the DCNN involves
only (s+ 1) + (2s1) = 3sfree parameters.
4
To reduce data redundancy, we introduce a downsampling operator Dm:RKRK/mwith a
scaling parameter mNby
Dm(v)=(vim)K/m
i=1 , v RK,(2.7)
where udenotes the integer part of u > 0. In other words, the downsampling operator Dmonly
“picks up” the m-th, 2m-th, . . . , K/mm-th entries of v.
Definition 2. A downsampled DCNN of depth Jwith downsampling at layer J1∈ {1, . . . , J 1}has
widths d0=dand
dj=
dj1+s, if j̸=J1,
(dj1+s)/d,if j=J1,
(2.8)
and function vectors {h(j):RdRdj}J
j=1 given by h(0)(x) = xand iteratively
h(j)(x) =
σT(j)h(j1)(x)b(j),if j̸=J1,
σDdT(j)h(j1)(x)b(j),if j=J1.
(2.9)
In other words, the downsampling operation aims to reduce the width of a certain layer of DCNN
while preserving information on data features. The hypothesis space is defined in the same way as
(2.5).
In this paper, we take bias vectors b(j)to satisfy (2.6) for j= 1,2, . . . , J 1. If no additional
constraints are imposed, the number of free parameters for an output function from the hypothesis
space (including filters and, biases, coefficients) equals
J1
X
j=1
(s+ 1 + 2s1) + (s+ 1) + dJ+dJ+ 1 = 3s(J1) + s+ 2 + 2dJ.(2.10)
DCNNs considered in this paper are based on a “zero-padding” approach and have increasing
widths. In the literature, DCNNs without zero-padding have also been introduced [15, 19], and they
have decreasing widths, leading to limited approximation abilities and the necessity of channels for
learning. Moreover, DCNNs induced by group convolutions were studied with nice approximation
properties presented in [28].
2.2. Data and Regression Function
Consider a training sample D:= {zi= (xi, yi)}n
i=1 drawn independently and identically distributed
from an unknown distribution ρon Z:= ×Y. Throughout this paper, we assume that Ω is a closed
bounded subset of Rdand Y= [M, M] for some M1.
5
摘要:

LearningAbilityofInterpolatingDeepConvolutionalNeuralNetworksTian-YiZhou∗,XiaomingHuoH.MiltonStewartSchoolofIndustrialandSystemsEngineering,GeorgiaInstituteofTechnology765FerstDrive,AtlantaGA30332-0205AbstractItisfrequentlyobservedthatoverparameterizedneuralnetworksgeneralizewell.Regardingsuchphe-no...

展开>> 收起<<
Learning Ability of Interpolating Deep Convolutional Neural Networks.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:838.09KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注