Large-batch Optimization for Dense Visual Predictions Zeyue Xue

2025-05-03 0 0 2.07MB 23 页 10玖币

侵权投诉

Large-batch Optimization for Dense Visual

Predictions

Zeyue Xue∗

The University of Hong Kong

xuezeyue@connect.hku.hk

Jianming Liang∗

Beihang University

ljmmm1997@gmail.com

Guanglu Song

Sensetime Research

songguanglu@sensetime.com

Zhuofan Zong∗

Beihang Univerisity

zongzhuofan@gmail.com

Liang Chen∗

Peking University

clandzyy@pku.edu.cn

Yu Liu†

Sensetime Research

liuyuisanai@gmail.com

Ping Luo†

The University of Hong Kong,

Shanghai AI Laboratory

pluo@cs.hku.hk

Abstract

Training a large-scale deep neural network in a large-scale dataset is challeng-

ing and time-consuming. The recent breakthrough of large-batch optimization

is a promising way to tackle this challenge. However, although the current ad-

vanced algorithms such as LARS and LAMB succeed in classiﬁcation models,

the complicated pipelines of dense visual predictions such as object detection

and segmentation still suffer from the heavy performance drop in the large-batch

training regime. To address this challenge, we propose a simple yet effective

algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can

train dense visual predictors with very large batch size, enabling several beneﬁts

more appealing than prior arts. Firstly, AGVM can align the gradient variances

between different modules in the dense visual predictors, such as backbone, feature

pyramid network (FPN), detection, and segmentation heads. We show that training

with a large batch size can fail with the gradient variances misaligned among them,

which is a phenomenon primarily overlooked in previous work. Secondly, AGVM

is a plug-and-play module that generalizes well to many different architectures

(e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance

segmentation, semantic segmentation, and panoptic segmentation). It is also com-

patible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical

analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K

datasets demonstrate the superiority of AGVM. For example, it can train Faster

R-CNN+ResNet50 in 4 minutes without losing performance. AGVM demonstrates

more stable generalization performance than prior arts under extremely large batch

size (i.e., 10k). It enables training an object detector with one billion parameters in

just 3.5 hours, reducing the training time by 20.9

, whilst achieving 62.2 mAP on

COCO. The deliverables are released at https://github.com/Sense-X/AGVM.

*Work done during an internship at Sensetime Research.

†Corresponding authors.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11078v1 [cs.CV] 20 Oct 2022

  

  









   

  









 

   

  









  

  









    













  



 

      

   





















       

    





















       

    



















       

   

 



















      

   

 



















      

   

 



















      

   

 



















      

Minutes

mAP mAP PQ mIoU

     

 

















 

     

Figure 1:

First row

: Comparisons of the gradient variances (omitting learning rate in

Φ(i)

referred to Eq.

(3)

)

of different network modules in Mask R-CNN, including backbone, FPN, RPN, and heads. From left to right, the

models are trained using SGD with a mini-batch size of 32, 256, 512, and 1024, respectively. Note that smaller

batch size (32 in the ﬁrst ﬁgure) produces similar

Φ(i)

between different modules. When the batch size increases

from 256 to 1024 (

2nd ∼4th

ﬁgures), the gradient variance curves suffer from heavy misalignment between

modules. Speciﬁcally, the gradient variances are signiﬁcantly small in the RPN, FPN, detection head, and mask

head. We ﬁnd that the larger the variance gap, the lower the model performance (the best performance is achieved

when batch size equals 32).

Second row

: In ﬁgures from left to right, we compare the performance (right vertical

axis) and training time of AGVM (bar diagram, left vertical axis) in different visual tasks, including object

detection (

1st

ﬁgure), instance segmentation (

2nd

), panoptic segmentation (

3rd

), and semantic segmentation

(

4th

), where the models are trained using different methods with different batch sizes. The “

” indicates training

failure when using previous methods. Our method outperforms the recent approaches in all tasks with various

batch sizes, signiﬁcantly reducing training time.

1 Introduction

The recent successes in many tasks of dense visual predictions rely on the large-scale datasets [

the increase of computational power (e.g., GPUs), and the parallel training paradigm with large

sample batches. Sufﬁcient computational resource enables large-batch training, greatly reducing the

training time [

]. However, although simply scaling the batch size allows fewer iterations to update

the parameters of deep neural networks, it often leads to dramatic drop of generalization performance

[5,6,7].

To reduce the generalization gap in the large-batch training paradigm, LARS [

] scales the batch

size of a plain ResNet50 from 8k to 32k without losing accuracy, enabling to train an image

classiﬁcation model on ImageNet in a few minutes. However, different from the plain network

architectures in ImageNet classiﬁcation [

], many tasks of dense visual predictions, such as

object detection [

] and segmentation [

], are solved by more complicated

pipelines, which consist of multiple different modules, such as region proposal network (RPN) [

feature pyramid network (FPN) [

], detection head, and segmentation head. Nevertheless, the

recent advanced large-batch optimization methods such as LARS [

] and LAMB [

] are typically

not sufﬁcient to achieve good generalization performance in dense visual predictions. The long

training time of dense predictors greatly limits the researchers from making full use of the increasing

computational power and large-scale datasets.

To address the above challenge, we present a novel large-batch training algorithm, named Adaptive

Gradient Variance Modulator (AGVM), which can train different complicated dense predictors with

very large batch size, signiﬁcantly reducing their training time while maintaining the generalization

performance. The design of AGVM is motivated by a training phenomenon overlooked in prior

arts. We call it gradient variance misalignment, which would present when a visual dense prediction

pipeline contains many different modules and is trained with a large mini-batch, where different

modules (e.g., backbone, RPN, FPN, and heads) can have different gradient variance magnitudes,

impeding the generalization ability.

As shown in the ﬁrst row of Fig.1, where Mask R-CNN [

] with ResNet50 [

] as the backbone is

trained using different batch sizes, we compare the gradient variances of different network modules,

including backbone, FPN, RPN, detection head, and mask head. We see that when the batch size

is small (

in the ﬁrst ﬁgure), the gradient variances of different network modules are similar

throughout the training process. When the batch size increases from 256 to 1024 (

2nd ∼4th

ﬁgures), the gradient variances misalign in different modules whose variance gap enlarges during

training. Training fails when batch size equals 1024. More importantly, the gradient variances have

signiﬁcantly smaller values in the RPN, FPN, detection head, and mask head compared to that in the

backbone, and their gradient variances change sharply in the late stage of training (two ﬁgures in

the middle). We ﬁnd that such misalignment undesirably burdens the large-batch training, leading to

severe performance drop and even training failure. More observations on various visual tasks and

networks can be found in Appendix A.2.

The above empirical analysis naturally inspires us to design a simple yet effective method AGVM for

training dense visual predictors with multiple modules using very large batch size. AGVM directly

modulates the misaligned variance of gradient, making it consistent between different network

modules throughout training. As shown in the second row of Fig.1, AGVM signiﬁcantly outperforms

the recent approaches of large-batch training in four different visual prediction tasks with various

batch sizes from 32 to 2048. For example, AGVM enables us to train an object detector with a huge

batch size 1536 (where prior arts may fail), reducing training time by more than 35

compared to

the regular training setup.

This work makes three main

contributions

Firstly

, we carefully design AGVM, which to our

knowledge, is the ﬁrst large-batch optimization method for various dense prediction networks and

tasks. We evaluate AGVM in different architectures (e.g., CNNs and Transformers), solvers (e.g., SGD

and AdamW), and tasks (e.g., object detection, instance segmentation, semantic segmentation, and

panoptic segmentation).

Secondly

, we provide a convergence analysis of AGVM, which converges

to a stable point in a general non-convex optimization setting. We also conduct an empirical analysis

that reveals an important insight: the inconsistency of effective batch size between different modules

would aggravate the gradient variance misalignment when batch size is large, leading to performance

drop and even training failure. We believe this insight may facilitate future research for large-scale

training of complicated vision systems.

Thirdly

, extensive experiments are conducted to evaluate

AGVM, which achieves many new state-of-the-art performances on large-batch training. For example,

AGVM demonstrates more stable generalization performance than prior arts under extremely large

batch size (i.e., 10k). In particular, it enables training of the widely-used Faster R-CNN+ResNet50

within 4 minutes without performance drop. More importantly, AGVM can train a detector with one

billion parameters within just 3.5 hours, which reduces the training time by 20.9

, while achieving a

top-ranking mAP 62.2 on the COCO dataset.

2 Preliminary and Notation

Let

S={(xi, yi)}n

i=1

denote a dataset with

training samples, where

and

represent a data

point and its label respectively. We can estimate the value of a loss function

L:Rd→R

using

a mini-batch of samples that are randomly sampled, and obtain

l(wt) = 1

bPj∈StL(wt,(xj, yj))

where

denotes the mini-batch at the

-th iteration with batch size

|St|=b

and

represents the

parameters of a deep neural network. We can apply stochastic gradient descent (SGD), one of the

most representative algorithms, to update the parameters

. The SGD update equation with learning

rate ηtis:

wt+1 =wt−ηt∇l(wt),(1)

where ∇l(wt)represents the gradient of the loss function with respect to wt.

Layerwise Scaling Ratio.

In large-batch training, You et al.

[8]

observe that the ratio between the

norm of the layer weights and the norm of the gradients is unstable (i.e., oscillate a lot), leading to

training failure. You et al.

[8]

present the LARS algorithm, which adopts a layerwise scaling ratio,

kw(i)

tk/k∇l(w(i)

t)+λw(i)

, to modify the magnitude of the gradient of the

-th layer

∇l(w(i)

, where

w(i)

and

indicate the parameters of the

-th layer and the weight decay coefﬁcient, respectively.

Furthermore, LAMB [

] improves LARS by combining the AdamW optimizer with the layerwise

scaling ratio. It can be formulated as

rt=mt/√vt+

, where

mt=β1mt−1+ (1 −β1)∇l(wt)

and

vt=β2vt−1+ (1 −β2)∇l(wt)2

. The layerwise scaling ratio of LAMB can be computed by

kw(i)

tk/kr(i)

t+λw(i)

tk.

Sharpness-aware Minimization.

Large-batch training often converges to a sharp local minima,

resulting in undesired generalization performance. The sharpness-aware minimization (SAM) [

]

algorithm explicitly penalizes the sharp minima and ﬁnds the parameters whose neighbors (in an

lp-ball) have low training loss function values using the following objective function:

lSAM(wt) = max

kkp≤ρl(wt+).(2)

To solve the above equation, SAM applies one-step gradient ascent to determine

=

ρ∇l(wt)/k∇l(wt)k

. Its gradient is then approximated by

∇lSAM (wt)≈ ∇l(wt)|wt+

. How-

ever, SAM involves two sequential gradient computations at each iteration and thus doubles the

computational cost.

Gradient Variance Estimation.

Qin et al.

[24]

utilize the cosine similarity between two aggregated

gradients from the replicas in a distributed training system, to estimate the gradient variance between

SGD and GD efﬁciently. Speciﬁcally, we can compute the gradient for each sample in the

-th

mini-batch

of batch size

, denoted by

r1,t, ..., rj,t, ..., rb,t

. We have

∇l(wt) = 1

bPb

j=1 rj,t

. We

split the above gradients into two groups and average each group, obtaining

Gt,1=2

bPb

j=1 r2j−1,t

and

Gt,2=2

bPb

j=1 r2j,t

, respectively. Then the gradient variance can be measured by

Φt=

1−cos(Gt,1, Gt,2), where cos(·,·)is the cosine similarity function.

3 Our Approach

Our goal is to perform large-batch training for dense visual predictors with many different network

modules. As illustrated in Fig.1, the inconsistency of gradient variances among different modules

need to be modulated.

Gradient Variance across Modules.

We derive an updated (considering learning rate) gradient

variance to delve into the difference of network modules in complicated dense visual prediction

pipelines. The updated gradient variance of the

-th network module at the

-th iteration can be

formulated as:

Var(ηtg(i)

t) = n−b

2n−bη2

t(1 −E[cos(G(i)

t,1, G(i)

t,2)])

| {z }

Φ(i)

E[kg(i)

tk2],(3)

where

and

are the number of training samples and the mini-batch size, respectively.

ηt

the learning rate.

g(i)

indicates the gradient of the

-th network module.

G(i)

t,1

and

G(i)

t,2

are two

groups of the gradient estimation as discussed above. Since each entry in the vector

g(i)

could be

assumed i.i.d. in a massive dataset following [

Φ(i)

is thus proportional to the above updated

gradient variance. At each training iteration, we can approximate the updated gradient variance by

Φ(i)

t=η2

t(1 −cos(G(i)

t,1, G(i)

t,2))

. Note that

Φ(i)

for

-th module has been normalized by the number

of parameters, so

Φ(i)

of different modules are comparable. For consistency of presentation, we

still call

Φ(i)

gradient variance, which enables us to estimate the gradient variance of each network

module at each training iteration. More discussions can be found in Appendix A.1.

Adaptive Gradient Variance Modulator (AGVM).

Let

be a set of modules in a complicated

dense prediction pipeline, where

has

different modules. At the

-th iteration, we have a set of

learning rates,

{ˆη(i)

t|i∈ {1,2, ..., h}}

, corresponding to different modules. We treat the

Backbone

(

i= 1

) as the anchor and modulate other modules making their gradient variances consistent with the

Backbone

. Speciﬁcally, we adjust the module learning rates

ˆη(i)

by using the ratio between

Φ(1)

and

Φ(i)

t. The update rule for each network module can be written as:

w(i)

t+1 =w(i)

t−ˆη(i)

tg(i)

t,where ˆη(i)

t=ηtµ(i)

tand µ(i)

t=v

tΦ(1)

Φ(i)

,(4)

Table 1:

Comparisons between different methods.

“Generalization” indicates the methods’ generalization

ability for dense visual prediction tasks. The number of “+” in the column “stable to batch size scaling” means

the degree of stability when batch size is increased, whereas the number in the bracket means the maximum

applicable batch size without divergence on object detection. We measure the average extra overhead of the

Faster R-CNN+ResNet50 detector at each iteration using 128 NVIDIA A100 GPUs (total batch size is 1024).

The number in the column “extra overhead” indicates the ratio of extra overhead (an extra all-reduce call)

compared to the original computations. “N/A” means no extra overhead.

Method Solution Generalization Less hyperparam.

tuning

Stable to batch

size scaling

Extra

overhead

MegDet [28] Accumulate statistics of BN " " + (1024) N/A

SAM [23] Penalize sharp minima % % + (2048) 100%

LARS [8] Rectify layerwise gradient % % + (1024) N/A

LAMB [6] Rectify layerwise gradient % % ++ (4096) N/A

PMD-LAMB [29] Reduce historical effect " % ++ (4096) N/A

AGVM Balance gradient variance " " +++ (10k) 0.12%

where

ηt

is the global learning rate. However, simply adjusting the learning rates on-the-ﬂy would

easily yield training failure due to the transitory large variance ratio that impedes the optimization.

We propose a momentum update to address this problem. Let

α∈[0,1)

be a momentum coefﬁcient,

we have:

µ(i)

t←αµ(i)

t−1+ (1 −α)µ(i)

t,(5)

which can reduce the inﬂuence of unstable variance. Note that we update µ(i)

teach τiterations.

Discussion on Momentum and Weight Decay.

In practice, the weight decay is widely used as a

regularizer and is tightly coupled with the learning rate and the momentum. For instance, the gradient

g(i)

will be replaced by the momentum, such as

m(i)

t=β1m(i)

t−1+ (1 −β1)(g(i)

t+λw(i)

[

where

β1

and

indicate the momentum coefﬁcient and the weight decay coefﬁcient, respectively.

We observe that it’s also important to modulate the learning rate by Eq.

(46)

when weight decay is

presented. In addition, since the above

is a momentum-based moving average of

(g(i)

t+λw(i)

we can directly apply ˆη(i)

tonto m(i)

Extensions to Different Optimization Algorithms.

AGVM can be easily embedded into different

optimization algorithms such as SGD and AdamW. We demonstrate the details in Appendix A.6:

Alg.1 and Alg.2, respectively. They can be easily implemented using a deep learning framework

e.g., PyTorch [27].

Discussion on Convergence Rate.

With AGVM, the SGD and the AdamW optimizers still have

appealing convergence properties in the general non-convex settings. Considering some mild as-

sumptions in stochastic optimization and the case without heavy-ball momentum (

β1= 0

), SGD

and AdamW achieve

O(1/√T)

and

O(ln(T)/√T)

convergence rate respectively with appropriate

choice of the learning rate ηt. We present the analysis in Appendix A.4.

Comparisons with Existing Works.

The purpose of exploring large-batch training is to speed up

model training with increasing computational power, as well as enabling us to explore the larger

dataset. As shown in Table 1, the seminal works such as LARS [

], LAMB [

], and SAM [

] have

made great contributions to large-batch training for plain vision pipelines e.g., image-level prediction,

despite that they often require hyper-parameter tuning by experienced engineers. For complicated

pipelines of dense visual predictions, they are typically not sufﬁcient to achieve desired generalization

performance. MegDet [

] and PMD-LAMB [

] contribute the preliminary attempts by applying

large-batch training on object detection. Different from these approaches, we revisit the design

paradigm of the complicated dense visual perception pipelines and present a simple yet effective

solution, AGVM, which is insensitive to hyperparameter tuning and can be easily plugged into many

visual perception pipelines. For example, AGVM can perform stable training with an unprecedented

batch size 10K, which could greatly reduce the training time. Moreover, AGVM adds a negligible

computational overhead in training, unlike SAM which involves two sequential (non-parallelizable)

gradient computations at each iteration, resulting in a signiﬁcant increase of the training time.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Large-batchOptimizationforDenseVisualPredictionsZeyueXueTheUniversityofHongKongxuezeyue@connect.hku.hkJianmingLiangBeihangUniversityljmmm1997@gmail.comGuangluSongSensetimeResearchsongguanglu@sensetime.comZhuofanZongBeihangUniverisityzongzhuofan@gmail.comLiangChenPekingUniversityclandzyy@pku.edu....

展开>> 收起<<

Large-batch Optimization for Dense Visual Predictions Zeyue Xue.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large-batch Optimization for Dense Visual Predictions Zeyue Xue

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: