A N EWPERSPECTIVE FOR UNDERSTANDING GENERALIZATION GAP OF DEEPNEURAL NETWORKS TRAINED WITH LARGE BATCH SIZES

2025-04-30 0 0 978.36KB 20 页 10玖币
侵权投诉
A NEW PERSPECTIVE FOR UNDERSTANDING GENERALIZATION
GAP OF DEEP NEURAL NETWORKS TRAINED WITH LARGE
BATCH SIZES
Oyebade K. Oyedotun
Spire Global, Luxembourg
oyebade.oyedotun@spire.com
Konstantinos Papadopoulos
Post Luxembourg, Luxembourg
papad.konst@gmail.com
Djamila Aouada
Interdisciplinary Centre for Security, Reliability and Trust (SnT),
University of Luxembourg, Luxembourg
djamila.aouada@uni.lu
ABSTRACT
Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient
descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen
batch size, available computing resources can be optimally utilized (including parallelization) for
fast model training. However, many works report the progressive loss of model generalization when
the training batch size is increased beyond some limits. This is a scenario commonly referred to
as generalization gap. Although several works have proposed different methods for alleviating the
generalization gap problem, a unanimous account for understanding generalization gap is still lack-
ing in the literature. This is especially important given that recent works have observed that several
proposed solutions for generalization gap problem such learning rate scaling and increased training
budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and
provide new perspectives for the source of generalization loss for DNNs trained with a large batch
size. Our analysis suggests that large training batch size results in increased near-rank loss of units’
activation (i.e. output) tensors, which consequently impacts model optimization and generalization.
Extensive experiments are performed for validation on popular DNN models such as VGG-16, resid-
ual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST
datasets.
Keywords Deep neural network ·generalization gap ·large batch size ·optimization ·near-rank loss
1 Introduction
Small neural network models typically contain 2 to 4 hidden layers with a moderate number of hidden units per
layer [1; 2; 3]. Hence, small DNN models can be trained by standard gradient descent [4], where all the available
training samples are used for model updates. However, there has been a consistent increase in the difficulty of learning
problems tackled over time; recent benchmarking datasets typically contain thousands or several millions of training
samples, e.g, Places [5], ImageNet [6] and UMD faces [7] datasets. Thus, there has been a consistent increase in the
depth and number of DNN model parameters ever since. Subsequently, mini-batch gradient descent, which uses a
This work was funded by the National Research Fund (FNR), Luxembourg, under the project reference
CPPP17/IS/11643091/IDform/Aouada and BRIDGES2020/IS/14755859/MEET-A/Aouada
This work was done while at SnT, University of Luxembourg, Luxembourg.
arXiv:2210.12184v1 [cs.LG] 21 Oct 2022
specified portion of available training samples for model updates every iteration, has been favoured for training large
DNN models, in view of fast training time and effective utilization of available computing resources. Therefore, fast
training of large DNNs have become an important research problem.
One simple approach for speeding up DNN training is employing large batch sizes to fit the capacity of available
Graphics Processing Unit (GPU) memory; with the parallelization of recent high-end GPU, batch sizes up to several
thousands have been reported [8; 9]. Unfortunately, it has also been observed that increasing the training batch size
beyond some certain limits results in the degradation of the generalization of models; this performance loss with
increase in batch size is commonly referred to as generalization gap [10]. Consequently, most works [11; 12] have
been dedicated to proposing various approaches for reducing the generalization gap of models trained with large batch
sizes; only a few works [13; 14; 15] have studied the cause of such performance degradation.
As such, in this paper, our high-level exposition is investigating the source of generalization loss observed in DNN
models trained with large batch sizes. Specifically, our major contributions are as follows:
1. A novel explanation and interesting insights for why DNN models trained with large batch sizes incur gener-
alization loss based on near-rank loss of hidden units’ activation tensors. This perspective on generalization
gap is the first in the literature to the best of our knowledge.
2. Extensive experimental results on standard datasets (i.e. CIFAR-10, CIFAR-100, Fashion-MNIST and
MNIST) and popular DNN models (i.e. VGG-16, ResNet-56 and LeNet-5) are reported for the validation
of the positions given.
The remainder of this paper is organized as follows. Related works are discussed in Section 2. In Section 3, the
background and problem statement are presented. Section 4 presents our proposed analysis of the generalization gap
problem. Section 5 reports the supporting experimental results. Main insights from formal and experimental results
are given in Section 6. The paper is concluded in Section 7.
2 Related work
It is well-known that DNNs trained with large data batches have lesser generalization capacity than those trained with
small batches. i.e. the generalization gap problem. In [13], the concept of sharp and flat minimizers is studied in rela-
tion to the generalization gap; it notes that increasing the number of training iterations does not alleviate the problem.
It is further observed that DNNs trained with large data batches converge to sharp minima, while DNNs trained with
small data batches converge to flat minima. Interestingly, sharp and flat minima have been shown to lead to poor and
good model generalizations, respectively [16; 17]. An additional explanation [13] is that noisy gradient estimation
from small training data batches allow optimization to escape basins of sharp minima that the model would be stuck
in for large data batches with lesser gradients stochasticity.
The work in [10] explores DNN optimization as high dimensional particles with random walk of random potentials; it
posits diffusion rates concept for different training batch sizes. It suggests that DNNs trained with large data batches
have slower diffusion rates, and thus require an exponential number of model updates to reach the flat minima regime.
In [14], it is shown that the landscape of robust optimization is flat minima and less susceptible to adversarial attacks;
this training regime is easily reached using small data batches. Further analysis shows that large training data batches
resulted in models converging to solutions with a larger Hessian spectrum [14]. It is tempting to suppose that the prob-
lem of generalization gap seen in models trained with large batch sizes is related to the smaller number of parameters’
updates they employ relative to models trained with small batch sizes; this direction of explanation was presented
in [10; 12]. Subsequently, it may be expected that training models with large batch sizes for more epochs will indeed
resolve the generalization gap problem. Unfortunately, it has been observed in [13; 18] that the generalization gap
persists with an unlimited training budget. The work [18] as well observed that learning rate scaling that has been pro-
posed as a solution do not resolve generalization gap, when the batch size is very large. Subsequently, it is natural to
pursue new and interesting explanations for generalization gap, which can result in the formulation of better solutions
for the problem.
The addition of noise to computed gradients for alleviating the problem of generalization gap is seen in [19]. Therein,
covariance noise was computed and added to the original gradients, and it was subsequently shown to mitigate the
loss of model generalization. Another work in [20], it was observed that gradient noise can be decomposed as the
multiplication of gradient matrix and sampling noise related to how data batches are sampled. Importantly, specially
computed noises were used for regularizing gradient descent and therefore model generalization. In the work [21], for
smoothing gradients computed from large-batch distributed training, local extragradient was proposed for improved
model optimization for model trained with large batch sizes. The approach showed interesting results on different
DNN architectures such as ResNet, LSTM and transformers. Furthermore, the work in [22], it was reported that the
ratio of batch size to learning rate negatively correlates to model generalization. As such, for good model general-
ization, it was advocated that the batch size should be kept as small as possible, and the learning rate keep as large a
2
possible. It is shown in [23] that contrary to earlier claims, both SGD and second-order approximation for gradient
descent, Kronecker-Factored Approximate Curvature (K-FAC), reflect generalization loss for DNNs trained with large
batch sizes.
3 Background and Problem Statement
This section discusses the background of the work that reflects the setting under which we investigate the generalization
gap problem. Subsequently, the problem statement that specifically shows the generalization gap problem is presented.
3.1 Background on batch size impact for DNN
This section discusses preliminaries for the impact of batch size on DNN generalization. Given an arbitrary dataset
D={(xn,yn)}N
n=1 with input xRhn
0, target output yRcand sample index n, the pair (xn,yn)is typically
fed in batches as in XRhn
0×bsand YRc×bsduring DNN training; where hn
0and bsare the input dimension
and batch size, respectively. For multilayer perceptron (MLP) models, the batch output at layer lis of the form
H(X)lRhn
l×bs, where hn
lis the number of units in hidden layer l. Given the input, H(X)l1Rhn
l1×bs, to
an arbitrary DNN layer lparameterized by the weight WlRhn
l×hn
l1, we can write the transformation, H(X)l,
learned as
H(X)l=ϕ(WlH(X)l1),(1)
where ϕis the element-wise activation function; the bias term is omitted. Assuming the DNN has Llayers with output
layer weight WLRp×hn
L, the final output, Y, is
Y=ϕ(WLϕ(WL1···ϕ(W1H(X)0))),(2)
where H(X)0is the input, X, to the DNN.
3.2 Problem statement
Consider a DNN model denoted Mwith specific architecture, parameters initialization scheme and hyperparameters
settings except for batch size. Let the classification test error of Mbe Mtest
err . Our main exposition in this paper is
understanding why in practice, we observe the relation
Mtest
err bs:bs1,(3)
as seen in the work [13; 10; 14; 18]. Increasing bsgenerally leads to an increase in the model error rate. How the
increase in bschanges the generalization dynamics of trained DNNs has remained a challenging research question
with different directions of investigation. Most of the works that have attempted to unravel the origin of generalization
approached it from perspectives that decoupled model optimization and generalization performance. In contrast to ear-
lier works, the problem of generalization gap is investigated simultaneously from both optimization and generalization
perspectives.
4 Proposed analysis of the generalization gap problem
Herein, the proposed analysis for generalization gap in relation to the rank loss of hidden representations, information
loss and optimization success are presented. However, the relevance of studying linear DNNs and basic concepts are
first introduced.
4.1 Relevance of studying linear DNNs
The theoretical analysis of practical DNNs is generally complicated, and thus the linear activation (no non-linearities)
function along with other necessary assumptions are often made for tractability. Our formal analysis in this paper
assume the linear activation (i.e. linear DNNs) as well. As such, following existing literature, we first emphasize
the relevance of studying linear DNNs, including their usefulness for understanding DNNs with non-linear activation
functions (non-linear DNNs). In [24], it was observed that analytically results obtained using linear DNNs conform
with results from non-linear DNNs. This is expected given that the loss function of a linear multilayer DNN is non-
convex with respect to the model parameters, similar to non-linear DNNs. Furthermore, the relevance of linear DNNs
for analytical study is seen in the work [25]. Interestingly, we note that much stronger assumptions are common in
3
the literature. For example, the assumption of no nonlinearities and no batch-norm can be found in [26]. In [27], the
assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the number of units in the different layers are more than
the number of units in the input or output layers, and (iv) data is whitened. The assumptions of (i) no batch-norm,
(ii) infinite number of hidden units, and (iii) infinite number of training samples can be found in [28]. In [29], the
assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the thinnest layer is either the input layer or the output
layer, and (iv) arbitrary convex differentiable loss can be found. Nonetheless, the results from the aforementioned
works have proven quite useful in practice. We show later on in our experiments that both non-linear and linear DNNs
exhibit similar training characteristics for the generalization gap problem, which is not surprising.
4.2 Preliminaries
For analytical simplicity, we consider MLP networks that represent hidden layer units’ activations with matrices.
However, we show later on that the extension of analytical results to Convolutional Neural Networks (CNNs) that
represents hidden layer units’ activations as 4-dimensional tensors is straightforward.
The rank of hidden layer units’ activations, H(X)lRhn
l×bs:hn
lbs, describes the maximum number of linearly
independent columns of H(X)l. The rank of H(X)lcan be obtained as its number of non-zero singular values,
Sl
nz , via singular value decomposition (SVD). Specifically, a matrix with full rank means that all its singular values
are non-zero. Conversely, the number of zero singular values, Sl
z, shows the degree of rank loss.
Definition 1 Near-rank loss in this paper is taken to mean a scenario where the singular values denoted by σare very
small (i.e. σ1).
Our main analytical observations for the problem of generalization gap are summarized as follows in Section 4.3.
4.3 Units’ activation near-rank loss and optimization
In this section, the limits of the singular values of random matrices are related to their dimensions. For stating the
first proposition, we consider that the hidden layer representations, H(X)lRhn
l×bs, are samples of some unknown
distribution, p(H(X)l), and then characterize the behaviour of the maximum and minimum singular vlaues for bs
. In the second proposition, we assume that the enteries in H(X)lRhn
l×bsfollow a Gaussian distribution.
Subsequently, we characterize the limits of the singular values of H(X)lfor bsR.
Proposition 1 (The asymptotic behaviour of the singular values of matrix with increase in dimension): For a matrix
ARm×n:mn, from Marchenko-Pastur law, the singular values, σ, concentrate in the range [σmin(A)
mn, σmax(A)m+n]as m, n → ∞; where σmax(A)and σmin(A)are the maximum and minimum
singular values of A, respectively.
Proof. See Rudelson-Vershynin [30] for proof.
Remark 1 In fact, [30] notes that Proposition 1 holds for general distributions. As such, we can conclude from
Proposition 1 that H(X)lRhn
l×bs:bs results in small and large distribution ranges, which are admissible
for σmin(H(X)l)and σmax(H(X)l), respectively. Accordingly, as bsincreases, we have the following scenarios
(i) higher probability for H(X)lto have a small σmin(H(X)l); and (ii) higher probability for H(X)lto have a
larger condition number.
Proposition 2 (The non-asymptotic behaviour of the singular values of matrix with increase in dimension): For a
Gaussian random matrix ARm×n:mn, the expected minimum and maximum singular values are given as
mnEσmin(A)Eσmax(A)m+n(4)
Proof. See Theorem 2.6 in Rudelson-Vershynin [30].
In Section A1.1 and Section A1.2 of the supplementary material, we empirically study the distributions of singular
values and expected values of the minimum singular values for random matrices, respectively, where the entries are
drawn from popular distributions including the Gaussian, uniform and lognormal. Subsequently, for the aforemen-
tioned distributions, in Section A1.2 of the supplementary material, we show that for a fixed m,Eσmin(A)0, as
nbecomes large. This observation is akin to that in Eqn.(4), so that it is applicable for other popular distributions.
Remark 2 Given H(X)l
1Rhn
l×bs1:hn
l> bs1and H(X)l
2Rhn
l×bs2:hn
l> bs2with bs2> bs1, then
Eσmin(H(X)l
1)>Eσmin(H(X)l
2)using Proposition 2. Importantly, given that hn
lis fixed as it is typical for
4
摘要:

ANEWPERSPECTIVEFORUNDERSTANDINGGENERALIZATIONGAPOFDEEPNEURALNETWORKSTRAINEDWITHLARGEBATCHSIZESOyebadeK.OyedotunySpireGlobal,Luxembourgoyebade.oyedotun@spire.comKonstantinosPapadopoulosPostLuxembourg,Luxembourgpapad.konst@gmail.comDjamilaAouadaInterdisciplinaryCentreforSecurity,ReliabilityandTrust(Sn...

展开>> 收起<<
A N EWPERSPECTIVE FOR UNDERSTANDING GENERALIZATION GAP OF DEEPNEURAL NETWORKS TRAINED WITH LARGE BATCH SIZES.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:978.36KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注