A N EWPERSPECTIVE FOR UNDERSTANDING GENERALIZATION GAP OF DEEPNEURAL NETWORKS TRAINED WITH LARGE BATCH SIZES

2025-04-30 0 0 978.36KB 20 页 10玖币

侵权投诉

A NEW PERSPECTIVE FOR UNDERSTANDING GENERALIZATION

GAP OF DEEP NEURAL NETWORKS TRAINED WITH LARGE

BATCH SIZES

Oyebade K. Oyedotun †

Spire Global, Luxembourg

oyebade.oyedotun@spire.com

Konstantinos Papadopoulos

Post Luxembourg, Luxembourg

papad.konst@gmail.com

Djamila Aouada

Interdisciplinary Centre for Security, Reliability and Trust (SnT),

University of Luxembourg, Luxembourg

djamila.aouada@uni.lu

ABSTRACT

Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient

descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen

batch size, available computing resources can be optimally utilized (including parallelization) for

fast model training. However, many works report the progressive loss of model generalization when

the training batch size is increased beyond some limits. This is a scenario commonly referred to

as generalization gap. Although several works have proposed different methods for alleviating the

generalization gap problem, a unanimous account for understanding generalization gap is still lack-

ing in the literature. This is especially important given that recent works have observed that several

proposed solutions for generalization gap problem such learning rate scaling and increased training

budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and

provide new perspectives for the source of generalization loss for DNNs trained with a large batch

size. Our analysis suggests that large training batch size results in increased near-rank loss of units’

activation (i.e. output) tensors, which consequently impacts model optimization and generalization.

Extensive experiments are performed for validation on popular DNN models such as VGG-16, resid-

ual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST

datasets.

Keywords Deep neural network ·generalization gap ·large batch size ·optimization ·near-rank loss

1 Introduction

Small neural network models typically contain 2 to 4 hidden layers with a moderate number of hidden units per

layer [1; 2; 3]. Hence, small DNN models can be trained by standard gradient descent [4], where all the available

training samples are used for model updates. However, there has been a consistent increase in the difﬁculty of learning

problems tackled over time; recent benchmarking datasets typically contain thousands or several millions of training

samples, e.g, Places [5], ImageNet [6] and UMD faces [7] datasets. Thus, there has been a consistent increase in the

depth and number of DNN model parameters ever since. Subsequently, mini-batch gradient descent, which uses a

This work was funded by the National Research Fund (FNR), Luxembourg, under the project reference

CPPP17/IS/11643091/IDform/Aouada and BRIDGES2020/IS/14755859/MEET-A/Aouada

†This work was done while at SnT, University of Luxembourg, Luxembourg.

arXiv:2210.12184v1 [cs.LG] 21 Oct 2022

speciﬁed portion of available training samples for model updates every iteration, has been favoured for training large

DNN models, in view of fast training time and effective utilization of available computing resources. Therefore, fast

training of large DNNs have become an important research problem.

One simple approach for speeding up DNN training is employing large batch sizes to ﬁt the capacity of available

Graphics Processing Unit (GPU) memory; with the parallelization of recent high-end GPU, batch sizes up to several

thousands have been reported [8; 9]. Unfortunately, it has also been observed that increasing the training batch size

beyond some certain limits results in the degradation of the generalization of models; this performance loss with

increase in batch size is commonly referred to as generalization gap [10]. Consequently, most works [11; 12] have

been dedicated to proposing various approaches for reducing the generalization gap of models trained with large batch

sizes; only a few works [13; 14; 15] have studied the cause of such performance degradation.

As such, in this paper, our high-level exposition is investigating the source of generalization loss observed in DNN

models trained with large batch sizes. Speciﬁcally, our major contributions are as follows:

1. A novel explanation and interesting insights for why DNN models trained with large batch sizes incur gener-

alization loss based on near-rank loss of hidden units’ activation tensors. This perspective on generalization

gap is the ﬁrst in the literature to the best of our knowledge.

2. Extensive experimental results on standard datasets (i.e. CIFAR-10, CIFAR-100, Fashion-MNIST and

MNIST) and popular DNN models (i.e. VGG-16, ResNet-56 and LeNet-5) are reported for the validation

of the positions given.

The remainder of this paper is organized as follows. Related works are discussed in Section 2. In Section 3, the

background and problem statement are presented. Section 4 presents our proposed analysis of the generalization gap

problem. Section 5 reports the supporting experimental results. Main insights from formal and experimental results

are given in Section 6. The paper is concluded in Section 7.

2 Related work

It is well-known that DNNs trained with large data batches have lesser generalization capacity than those trained with

small batches. i.e. the generalization gap problem. In [13], the concept of sharp and ﬂat minimizers is studied in rela-

tion to the generalization gap; it notes that increasing the number of training iterations does not alleviate the problem.

It is further observed that DNNs trained with large data batches converge to sharp minima, while DNNs trained with

small data batches converge to ﬂat minima. Interestingly, sharp and ﬂat minima have been shown to lead to poor and

good model generalizations, respectively [16; 17]. An additional explanation [13] is that noisy gradient estimation

from small training data batches allow optimization to escape basins of sharp minima that the model would be stuck

in for large data batches with lesser gradients stochasticity.

The work in [10] explores DNN optimization as high dimensional particles with random walk of random potentials; it

posits diffusion rates concept for different training batch sizes. It suggests that DNNs trained with large data batches

have slower diffusion rates, and thus require an exponential number of model updates to reach the ﬂat minima regime.

In [14], it is shown that the landscape of robust optimization is ﬂat minima and less susceptible to adversarial attacks;

this training regime is easily reached using small data batches. Further analysis shows that large training data batches

resulted in models converging to solutions with a larger Hessian spectrum [14]. It is tempting to suppose that the prob-

lem of generalization gap seen in models trained with large batch sizes is related to the smaller number of parameters’

updates they employ relative to models trained with small batch sizes; this direction of explanation was presented

in [10; 12]. Subsequently, it may be expected that training models with large batch sizes for more epochs will indeed

resolve the generalization gap problem. Unfortunately, it has been observed in [13; 18] that the generalization gap

persists with an unlimited training budget. The work [18] as well observed that learning rate scaling that has been pro-

posed as a solution do not resolve generalization gap, when the batch size is very large. Subsequently, it is natural to

pursue new and interesting explanations for generalization gap, which can result in the formulation of better solutions

for the problem.

The addition of noise to computed gradients for alleviating the problem of generalization gap is seen in [19]. Therein,

covariance noise was computed and added to the original gradients, and it was subsequently shown to mitigate the

loss of model generalization. Another work in [20], it was observed that gradient noise can be decomposed as the

multiplication of gradient matrix and sampling noise related to how data batches are sampled. Importantly, specially

computed noises were used for regularizing gradient descent and therefore model generalization. In the work [21], for

smoothing gradients computed from large-batch distributed training, local extragradient was proposed for improved

model optimization for model trained with large batch sizes. The approach showed interesting results on different

DNN architectures such as ResNet, LSTM and transformers. Furthermore, the work in [22], it was reported that the

ratio of batch size to learning rate negatively correlates to model generalization. As such, for good model general-

ization, it was advocated that the batch size should be kept as small as possible, and the learning rate keep as large a

possible. It is shown in [23] that contrary to earlier claims, both SGD and second-order approximation for gradient

descent, Kronecker-Factored Approximate Curvature (K-FAC), reﬂect generalization loss for DNNs trained with large

batch sizes.

3 Background and Problem Statement

This section discusses the background of the work that reﬂects the setting under which we investigate the generalization

gap problem. Subsequently, the problem statement that speciﬁcally shows the generalization gap problem is presented.

3.1 Background on batch size impact for DNN

This section discusses preliminaries for the impact of batch size on DNN generalization. Given an arbitrary dataset

D={(xn,yn)}N

n=1 with input x∈Rhn

0, target output y∈Rcand sample index n, the pair (xn,yn)is typically

fed in batches as in X∈Rhn

0×bsand Y∈Rc×bsduring DNN training; where hn

0and bsare the input dimension

and batch size, respectively. For multilayer perceptron (MLP) models, the batch output at layer lis of the form

H(X)l∈Rhn

l×bs, where hn

lis the number of units in hidden layer l. Given the input, H(X)l−1∈Rhn

l−1×bs, to

an arbitrary DNN layer lparameterized by the weight Wl∈Rhn

l×hn

l−1, we can write the transformation, H(X)l,

learned as

H(X)l=ϕ(WlH(X)l−1),(1)

where ϕis the element-wise activation function; the bias term is omitted. Assuming the DNN has Llayers with output

layer weight WL∈Rp×hn

L, the ﬁnal output, Y, is

Y=ϕ(WLϕ(WL−1···ϕ(W1H(X)0))),(2)

where H(X)0is the input, X, to the DNN.

3.2 Problem statement

Consider a DNN model denoted Mwith speciﬁc architecture, parameters initialization scheme and hyperparameters

settings except for batch size. Let the classiﬁcation test error of Mbe Mtest

err . Our main exposition in this paper is

understanding why in practice, we observe the relation

Mtest

err ∝bs:bs1,(3)

as seen in the work [13; 10; 14; 18]. Increasing bsgenerally leads to an increase in the model error rate. How the

increase in bschanges the generalization dynamics of trained DNNs has remained a challenging research question

with different directions of investigation. Most of the works that have attempted to unravel the origin of generalization

approached it from perspectives that decoupled model optimization and generalization performance. In contrast to ear-

lier works, the problem of generalization gap is investigated simultaneously from both optimization and generalization

perspectives.

4 Proposed analysis of the generalization gap problem

Herein, the proposed analysis for generalization gap in relation to the rank loss of hidden representations, information

loss and optimization success are presented. However, the relevance of studying linear DNNs and basic concepts are

ﬁrst introduced.

4.1 Relevance of studying linear DNNs

The theoretical analysis of practical DNNs is generally complicated, and thus the linear activation (no non-linearities)

function along with other necessary assumptions are often made for tractability. Our formal analysis in this paper

assume the linear activation (i.e. linear DNNs) as well. As such, following existing literature, we ﬁrst emphasize

the relevance of studying linear DNNs, including their usefulness for understanding DNNs with non-linear activation

functions (non-linear DNNs). In [24], it was observed that analytically results obtained using linear DNNs conform

with results from non-linear DNNs. This is expected given that the loss function of a linear multilayer DNN is non-

convex with respect to the model parameters, similar to non-linear DNNs. Furthermore, the relevance of linear DNNs

for analytical study is seen in the work [25]. Interestingly, we note that much stronger assumptions are common in

the literature. For example, the assumption of no nonlinearities and no batch-norm can be found in [26]. In [27], the

assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the number of units in the different layers are more than

the number of units in the input or output layers, and (iv) data is whitened. The assumptions of (i) no batch-norm,

(ii) inﬁnite number of hidden units, and (iii) inﬁnite number of training samples can be found in [28]. In [29], the

assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the thinnest layer is either the input layer or the output

layer, and (iv) arbitrary convex differentiable loss can be found. Nonetheless, the results from the aforementioned

works have proven quite useful in practice. We show later on in our experiments that both non-linear and linear DNNs

exhibit similar training characteristics for the generalization gap problem, which is not surprising.

4.2 Preliminaries

For analytical simplicity, we consider MLP networks that represent hidden layer units’ activations with matrices.

However, we show later on that the extension of analytical results to Convolutional Neural Networks (CNNs) that

represents hidden layer units’ activations as 4-dimensional tensors is straightforward.

The rank of hidden layer units’ activations, H(X)l∈Rhn

l×bs:hn

l≥bs, describes the maximum number of linearly

independent columns of H(X)l. The rank of H(X)lcan be obtained as its number of non-zero singular values,

nz , via singular value decomposition (SVD). Speciﬁcally, a matrix with full rank means that all its singular values

are non-zero. Conversely, the number of zero singular values, Sl

z, shows the degree of rank loss.

Deﬁnition 1 Near-rank loss in this paper is taken to mean a scenario where the singular values denoted by σare very

small (i.e. σ1).

Our main analytical observations for the problem of generalization gap are summarized as follows in Section 4.3.

4.3 Units’ activation near-rank loss and optimization

In this section, the limits of the singular values of random matrices are related to their dimensions. For stating the

ﬁrst proposition, we consider that the hidden layer representations, H(X)l∈Rhn

l×bs, are samples of some unknown

distribution, p(H(X)l), and then characterize the behaviour of the maximum and minimum singular vlaues for bs−→

∞. In the second proposition, we assume that the enteries in H(X)l∈Rhn

l×bsfollow a Gaussian distribution.

Subsequently, we characterize the limits of the singular values of H(X)lfor bs∈R.

Proposition 1 (The asymptotic behaviour of the singular values of matrix with increase in dimension): For a matrix

A∈Rm×n:m≥n, from Marchenko-Pastur law, the singular values, σ, concentrate in the range [σmin(A)∼

√m−√n, σmax(A)∼√m+√n]as m, n −→ ∞; where σmax(A)and σmin(A)are the maximum and minimum

singular values of A, respectively.

Proof. See Rudelson-Vershynin [30] for proof. 

Remark 1 In fact, [30] notes that Proposition 1 holds for general distributions. As such, we can conclude from

Proposition 1 that H(X)l∈Rhn

l×bs:bs−→ ∞ results in small and large distribution ranges, which are admissible

for σmin(H(X)l)and σmax(H(X)l), respectively. Accordingly, as bsincreases, we have the following scenarios

(i) higher probability for H(X)lto have a small σmin(H(X)l); and (ii) higher probability for H(X)lto have a

larger condition number.

Proposition 2 (The non-asymptotic behaviour of the singular values of matrix with increase in dimension): For a

Gaussian random matrix A∈Rm×n:m≥n, the expected minimum and maximum singular values are given as

√m−√n≤Eσmin(A)≤Eσmax(A)≤√m+√n(4)

Proof. See Theorem 2.6 in Rudelson-Vershynin [30]. 

In Section A1.1 and Section A1.2 of the supplementary material, we empirically study the distributions of singular

values and expected values of the minimum singular values for random matrices, respectively, where the entries are

drawn from popular distributions including the Gaussian, uniform and lognormal. Subsequently, for the aforemen-

tioned distributions, in Section A1.2 of the supplementary material, we show that for a ﬁxed m,Eσmin(A)−→ 0, as

nbecomes large. This observation is akin to that in Eqn.(4), so that it is applicable for other popular distributions.

Remark 2 Given H(X)l

1∈Rhn

l×bs1:hn

l> bs1and H(X)l

2∈Rhn

l×bs2:hn

l> bs2with bs2> bs1, then

Eσmin(H(X)l

1)>Eσmin(H(X)l

2)using Proposition 2. Importantly, given that hn

lis ﬁxed as it is typical for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANEWPERSPECTIVEFORUNDERSTANDINGGENERALIZATIONGAPOFDEEPNEURALNETWORKSTRAINEDWITHLARGEBATCHSIZESOyebadeK.OyedotunySpireGlobal,Luxembourgoyebade.oyedotun@spire.comKonstantinosPapadopoulosPostLuxembourg,Luxembourgpapad.konst@gmail.comDjamilaAouadaInterdisciplinaryCentreforSecurity,ReliabilityandTrust(Sn...

展开>> 收起<<

A N EWPERSPECTIVE FOR UNDERSTANDING GENERALIZATION GAP OF DEEPNEURAL NETWORKS TRAINED WITH LARGE BATCH SIZES.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A N EWPERSPECTIVE FOR UNDERSTANDING GENERALIZATION GAP OF DEEPNEURAL NETWORKS TRAINED WITH LARGE BATCH SIZES

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: