In this paper, we study an important family of neural networks known as convolutional neural
networks. Given that neural networks, in general, are powerful and versatile, researchers have been
working to improve their computational efficiency further. When the data dimension is large such as
the AlexNet [19] of input dimension about 150,000, fully-connected neural networks are not feasible.
Structures are often imposed on neural networks to reduce the number of trainable free parameters and
get feasible deep learning algorithms for various practical tasks [20]. The structure we are interested
in is induced by one-dimensional convolution (1-D convolution), and the resulting networks are deep
convolutional neural networks (DCNNs) [31]. The convolutional structure of DCNNs reduces the
computational complexity and is believed to capture local shift-invariance properties of image and
speech data. Such features of DCNNs contribute to the massive popularity of DCNNs in image
processing and speech recognition.
In recent years, there has been a line of work studying overparameterization in deep learning. It
is frequently observed that overparameterized deep neural networks, such as DCNNs, generalize well
while achieving zero training error [6]. This phenomenon, known as benign overfitting, seems to con-
front the classical bias-variance trade-off in statistical theory. Such a mismatch between observations
and classical theory sparked avid research attempting to understand how benign overfitting occurs.
Theoretical work studying benign overfitting was initiated in [4], where a linear regression setting with
Gaussian data and noise was considered. It presented conditions for minimum-norm interpolators to
generalize well. In a non-linear setting induced by the ReLU activation function, benign overfitting is
verified for deep fully-connected neural networks in [22]. On top of that, a recent work [10] shows that
training shallow neural networks with shared weights by gradient descent can achieve an arbitrarily
small training error.
In this paper, we study the learning ability of DCNNs under both underparameterized and overpa-
rameterized settings. We aim to show that an overparameterized DCNN can be constructed to have
the same convergence rate as a given underparameterized one while it perfectly fits the input data. In
other words, we intend to prove that interpolating DCNNs generalize well.
The main contributions of the paper are as follows. Our first result rigorously proves that for an
arbitrary DCNN with good learning rates, we can add more layers to build overparameterized DCNNs
satisfying the interpolation condition while retaining good learning rates. Here, “learning rates” refers
to rates of convergence of the output function to the regression function in a regression setting. Our
second result establishes the learning rates of DCNNs in general. Previously in [33], convergence
rates of approximating functions in some Sobolev spaces by DCNNs were given without generalization
analysis. Moreover, learning rates of DCNNs for learning radial functions were given in [24], where
2