Large-batch Optimization for Dense Visual Predictions Zeyue Xue

2025-05-03 0 0 2.07MB 23 页 10玖币
侵权投诉
Large-batch Optimization for Dense Visual
Predictions
Zeyue Xue
The University of Hong Kong
xuezeyue@connect.hku.hk
Jianming Liang
Beihang University
ljmmm1997@gmail.com
Guanglu Song
Sensetime Research
songguanglu@sensetime.com
Zhuofan Zong
Beihang Univerisity
zongzhuofan@gmail.com
Liang Chen
Peking University
clandzyy@pku.edu.cn
Yu Liu
Sensetime Research
liuyuisanai@gmail.com
Ping Luo
The University of Hong Kong,
Shanghai AI Laboratory
pluo@cs.hku.hk
Abstract
Training a large-scale deep neural network in a large-scale dataset is challeng-
ing and time-consuming. The recent breakthrough of large-batch optimization
is a promising way to tackle this challenge. However, although the current ad-
vanced algorithms such as LARS and LAMB succeed in classification models,
the complicated pipelines of dense visual predictions such as object detection
and segmentation still suffer from the heavy performance drop in the large-batch
training regime. To address this challenge, we propose a simple yet effective
algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can
train dense visual predictors with very large batch size, enabling several benefits
more appealing than prior arts. Firstly, AGVM can align the gradient variances
between different modules in the dense visual predictors, such as backbone, feature
pyramid network (FPN), detection, and segmentation heads. We show that training
with a large batch size can fail with the gradient variances misaligned among them,
which is a phenomenon primarily overlooked in previous work. Secondly, AGVM
is a plug-and-play module that generalizes well to many different architectures
(e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance
segmentation, semantic segmentation, and panoptic segmentation). It is also com-
patible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical
analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K
datasets demonstrate the superiority of AGVM. For example, it can train Faster
R-CNN+ResNet50 in 4 minutes without losing performance. AGVM demonstrates
more stable generalization performance than prior arts under extremely large batch
size (i.e., 10k). It enables training an object detector with one billion parameters in
just 3.5 hours, reducing the training time by 20.9
×
, whilst achieving 62.2 mAP on
COCO. The deliverables are released at https://github.com/Sense-X/AGVM.
*Work done during an internship at Sensetime Research.
Corresponding authors.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11078v1 [cs.CV] 20 Oct 2022
 
  
 
  

 
  
 
  
 


 
      
 



       
 



       
   
 






      
   
 






      
   
 






      
   
 






      
Minutes
mAP mAP PQ mIoU
X
XX
    
 









     
Figure 1:
First row
: Comparisons of the gradient variances (omitting learning rate in
Φ(i)
t
referred to Eq.
(3)
)
of different network modules in Mask R-CNN, including backbone, FPN, RPN, and heads. From left to right, the
models are trained using SGD with a mini-batch size of 32, 256, 512, and 1024, respectively. Note that smaller
batch size (32 in the first figure) produces similar
Φ(i)
t
between different modules. When the batch size increases
from 256 to 1024 (
2nd 4th
figures), the gradient variance curves suffer from heavy misalignment between
modules. Specifically, the gradient variances are significantly small in the RPN, FPN, detection head, and mask
head. We find that the larger the variance gap, the lower the model performance (the best performance is achieved
when batch size equals 32).
Second row
: In figures from left to right, we compare the performance (right vertical
axis) and training time of AGVM (bar diagram, left vertical axis) in different visual tasks, including object
detection (
1st
figure), instance segmentation (
2nd
), panoptic segmentation (
3rd
), and semantic segmentation
(
4th
), where the models are trained using different methods with different batch sizes. The “
×
” indicates training
failure when using previous methods. Our method outperforms the recent approaches in all tasks with various
batch sizes, significantly reducing training time.
1 Introduction
The recent successes in many tasks of dense visual predictions rely on the large-scale datasets [
1
,
2
,
3
],
the increase of computational power (e.g., GPUs), and the parallel training paradigm with large
sample batches. Sufficient computational resource enables large-batch training, greatly reducing the
training time [
4
]. However, although simply scaling the batch size allows fewer iterations to update
the parameters of deep neural networks, it often leads to dramatic drop of generalization performance
[5,6,7].
To reduce the generalization gap in the large-batch training paradigm, LARS [
8
] scales the batch
size of a plain ResNet50 from 8k to 32k without losing accuracy, enabling to train an image
classification model on ImageNet in a few minutes. However, different from the plain network
architectures in ImageNet classification [
9
,
10
,
11
], many tasks of dense visual predictions, such as
object detection [
12
,
13
,
14
,
15
,
16
] and segmentation [
17
,
18
,
19
,
20
], are solved by more complicated
pipelines, which consist of multiple different modules, such as region proposal network (RPN) [
12
],
feature pyramid network (FPN) [
21
], detection head, and segmentation head. Nevertheless, the
recent advanced large-batch optimization methods such as LARS [
8
] and LAMB [
6
] are typically
not sufficient to achieve good generalization performance in dense visual predictions. The long
training time of dense predictors greatly limits the researchers from making full use of the increasing
computational power and large-scale datasets.
To address the above challenge, we present a novel large-batch training algorithm, named Adaptive
Gradient Variance Modulator (AGVM), which can train different complicated dense predictors with
very large batch size, significantly reducing their training time while maintaining the generalization
performance. The design of AGVM is motivated by a training phenomenon overlooked in prior
arts. We call it gradient variance misalignment, which would present when a visual dense prediction
pipeline contains many different modules and is trained with a large mini-batch, where different
modules (e.g., backbone, RPN, FPN, and heads) can have different gradient variance magnitudes,
impeding the generalization ability.
2
As shown in the first row of Fig.1, where Mask R-CNN [
17
] with ResNet50 [
22
] as the backbone is
trained using different batch sizes, we compare the gradient variances of different network modules,
including backbone, FPN, RPN, detection head, and mask head. We see that when the batch size
is small (
32
in the first figure), the gradient variances of different network modules are similar
throughout the training process. When the batch size increases from 256 to 1024 (
2nd 4th
figures), the gradient variances misalign in different modules whose variance gap enlarges during
training. Training fails when batch size equals 1024. More importantly, the gradient variances have
significantly smaller values in the RPN, FPN, detection head, and mask head compared to that in the
backbone, and their gradient variances change sharply in the late stage of training (two figures in
the middle). We find that such misalignment undesirably burdens the large-batch training, leading to
severe performance drop and even training failure. More observations on various visual tasks and
networks can be found in Appendix A.2.
The above empirical analysis naturally inspires us to design a simple yet effective method AGVM for
training dense visual predictors with multiple modules using very large batch size. AGVM directly
modulates the misaligned variance of gradient, making it consistent between different network
modules throughout training. As shown in the second row of Fig.1, AGVM significantly outperforms
the recent approaches of large-batch training in four different visual prediction tasks with various
batch sizes from 32 to 2048. For example, AGVM enables us to train an object detector with a huge
batch size 1536 (where prior arts may fail), reducing training time by more than 35
×
compared to
the regular training setup.
This work makes three main
contributions
.
Firstly
, we carefully design AGVM, which to our
knowledge, is the first large-batch optimization method for various dense prediction networks and
tasks. We evaluate AGVM in different architectures (e.g., CNNs and Transformers), solvers (e.g., SGD
and AdamW), and tasks (e.g., object detection, instance segmentation, semantic segmentation, and
panoptic segmentation).
Secondly
, we provide a convergence analysis of AGVM, which converges
to a stable point in a general non-convex optimization setting. We also conduct an empirical analysis
that reveals an important insight: the inconsistency of effective batch size between different modules
would aggravate the gradient variance misalignment when batch size is large, leading to performance
drop and even training failure. We believe this insight may facilitate future research for large-scale
training of complicated vision systems.
Thirdly
, extensive experiments are conducted to evaluate
AGVM, which achieves many new state-of-the-art performances on large-batch training. For example,
AGVM demonstrates more stable generalization performance than prior arts under extremely large
batch size (i.e., 10k). In particular, it enables training of the widely-used Faster R-CNN+ResNet50
within 4 minutes without performance drop. More importantly, AGVM can train a detector with one
billion parameters within just 3.5 hours, which reduces the training time by 20.9
×
, while achieving a
top-ranking mAP 62.2 on the COCO dataset.
2 Preliminary and Notation
Let
S={(xi, yi)}n
i=1
denote a dataset with
n
training samples, where
xi
and
yi
represent a data
point and its label respectively. We can estimate the value of a loss function
L:RdR
using
a mini-batch of samples that are randomly sampled, and obtain
l(wt) = 1
bPjStL(wt,(xj, yj))
,
where
St
denotes the mini-batch at the
t
-th iteration with batch size
|St|=b
and
wt
represents the
parameters of a deep neural network. We can apply stochastic gradient descent (SGD), one of the
most representative algorithms, to update the parameters
wt
. The SGD update equation with learning
rate ηtis:
wt+1 =wtηtl(wt),(1)
where l(wt)represents the gradient of the loss function with respect to wt.
Layerwise Scaling Ratio.
In large-batch training, You et al.
[8]
observe that the ratio between the
norm of the layer weights and the norm of the gradients is unstable (i.e., oscillate a lot), leading to
training failure. You et al.
[8]
present the LARS algorithm, which adopts a layerwise scaling ratio,
kw(i)
tk/k∇l(w(i)
t)+λw(i)
tk
, to modify the magnitude of the gradient of the
i
-th layer
l(w(i)
t)
, where
w(i)
t
and
λ
indicate the parameters of the
i
-th layer and the weight decay coefficient, respectively.
Furthermore, LAMB [
6
] improves LARS by combining the AdamW optimizer with the layerwise
scaling ratio. It can be formulated as
rt=mt/vt+
, where
mt=β1mt1+ (1 β1)l(wt)
3
and
vt=β2vt1+ (1 β2)l(wt)2
. The layerwise scaling ratio of LAMB can be computed by
kw(i)
tk/kr(i)
t+λw(i)
tk.
Sharpness-aware Minimization.
Large-batch training often converges to a sharp local minima,
resulting in undesired generalization performance. The sharpness-aware minimization (SAM) [
23
]
algorithm explicitly penalizes the sharp minima and finds the parameters whose neighbors (in an
lp-ball) have low training loss function values using the following objective function:
lSAM(wt) = max
kkpρl(wt+).(2)
To solve the above equation, SAM applies one-step gradient ascent to determine
=
ρl(wt)/k∇l(wt)k
. Its gradient is then approximated by
lSAM (wt)≈ ∇l(wt)|wt+
. How-
ever, SAM involves two sequential gradient computations at each iteration and thus doubles the
computational cost.
Gradient Variance Estimation.
Qin et al.
[24]
utilize the cosine similarity between two aggregated
gradients from the replicas in a distributed training system, to estimate the gradient variance between
SGD and GD efficiently. Specifically, we can compute the gradient for each sample in the
t
-th
mini-batch
St
of batch size
b
, denoted by
r1,t, ..., rj,t, ..., rb,t
. We have
l(wt) = 1
bPb
j=1 rj,t
. We
split the above gradients into two groups and average each group, obtaining
Gt,1=2
bPb
2
j=1 r2j1,t
and
Gt,2=2
bPb
2
j=1 r2j,t
, respectively. Then the gradient variance can be measured by
Φt=
1cos(Gt,1, Gt,2), where cos(·,·)is the cosine similarity function.
3 Our Approach
Our goal is to perform large-batch training for dense visual predictors with many different network
modules. As illustrated in Fig.1, the inconsistency of gradient variances among different modules
need to be modulated.
Gradient Variance across Modules.
We derive an updated (considering learning rate) gradient
variance to delve into the difference of network modules in complicated dense visual prediction
pipelines. The updated gradient variance of the
i
-th network module at the
t
-th iteration can be
formulated as:
Var(ηtg(i)
t) = nb
2nbη2
t(1 E[cos(G(i)
t,1, G(i)
t,2)])
| {z }
Φ(i)
t
E[kg(i)
tk2],(3)
where
n
and
b
are the number of training samples and the mini-batch size, respectively.
ηt
is
the learning rate.
g(i)
t
indicates the gradient of the
i
-th network module.
G(i)
t,1
and
G(i)
t,2
are two
groups of the gradient estimation as discussed above. Since each entry in the vector
g(i)
t
could be
assumed i.i.d. in a massive dataset following [
24
,
25
],
Φ(i)
t
is thus proportional to the above updated
gradient variance. At each training iteration, we can approximate the updated gradient variance by
Φ(i)
t=η2
t(1 cos(G(i)
t,1, G(i)
t,2))
. Note that
Φ(i)
t
for
i
-th module has been normalized by the number
of parameters, so
Φ(i)
t
of different modules are comparable. For consistency of presentation, we
still call
Φ(i)
t
gradient variance, which enables us to estimate the gradient variance of each network
module at each training iteration. More discussions can be found in Appendix A.1.
Adaptive Gradient Variance Modulator (AGVM).
Let
M
be a set of modules in a complicated
dense prediction pipeline, where
M
has
h
different modules. At the
t
-th iteration, we have a set of
learning rates,
{ˆη(i)
t|i∈ {1,2, ..., h}}
, corresponding to different modules. We treat the
Backbone
(
i= 1
) as the anchor and modulate other modules making their gradient variances consistent with the
Backbone
. Specifically, we adjust the module learning rates
ˆη(i)
t
by using the ratio between
Φ(1)
t
and
Φ(i)
t. The update rule for each network module can be written as:
w(i)
t+1 =w(i)
tˆη(i)
tg(i)
t,where ˆη(i)
t=ηtµ(i)
tand µ(i)
t=v
u
u
tΦ(1)
t
Φ(i)
t
,(4)
4
Table 1:
Comparisons between different methods.
“Generalization” indicates the methods’ generalization
ability for dense visual prediction tasks. The number of “+” in the column “stable to batch size scaling” means
the degree of stability when batch size is increased, whereas the number in the bracket means the maximum
applicable batch size without divergence on object detection. We measure the average extra overhead of the
Faster R-CNN+ResNet50 detector at each iteration using 128 NVIDIA A100 GPUs (total batch size is 1024).
The number in the column “extra overhead” indicates the ratio of extra overhead (an extra all-reduce call)
compared to the original computations. “N/A” means no extra overhead.
Method Solution Generalization Less hyperparam.
tuning
Stable to batch
size scaling
Extra
overhead
MegDet [28] Accumulate statistics of BN " " + (1024) N/A
SAM [23] Penalize sharp minima % % + (2048) 100%
LARS [8] Rectify layerwise gradient % % + (1024) N/A
LAMB [6] Rectify layerwise gradient % % ++ (4096) N/A
PMD-LAMB [29] Reduce historical effect " % ++ (4096) N/A
AGVM Balance gradient variance " " +++ (10k) 0.12%
where
ηt
is the global learning rate. However, simply adjusting the learning rates on-the-fly would
easily yield training failure due to the transitory large variance ratio that impedes the optimization.
We propose a momentum update to address this problem. Let
α[0,1)
be a momentum coefficient,
we have:
µ(i)
tαµ(i)
t1+ (1 α)µ(i)
t,(5)
which can reduce the influence of unstable variance. Note that we update µ(i)
teach τiterations.
Discussion on Momentum and Weight Decay.
In practice, the weight decay is widely used as a
regularizer and is tightly coupled with the learning rate and the momentum. For instance, the gradient
g(i)
t
will be replaced by the momentum, such as
m(i)
t=β1m(i)
t1+ (1 β1)(g(i)
t+λw(i)
t)
[
6
,
26
],
where
β1
and
λ
indicate the momentum coefficient and the weight decay coefficient, respectively.
We observe that it’s also important to modulate the learning rate by Eq.
(46)
when weight decay is
presented. In addition, since the above
mt
is a momentum-based moving average of
(g(i)
t+λw(i)
t)
,
we can directly apply ˆη(i)
tonto m(i)
t.
Extensions to Different Optimization Algorithms.
AGVM can be easily embedded into different
optimization algorithms such as SGD and AdamW. We demonstrate the details in Appendix A.6:
Alg.1 and Alg.2, respectively. They can be easily implemented using a deep learning framework
e.g., PyTorch [27].
Discussion on Convergence Rate.
With AGVM, the SGD and the AdamW optimizers still have
appealing convergence properties in the general non-convex settings. Considering some mild as-
sumptions in stochastic optimization and the case without heavy-ball momentum (
β1= 0
), SGD
and AdamW achieve
O(1/T)
and
O(ln(T)/T)
convergence rate respectively with appropriate
choice of the learning rate ηt. We present the analysis in Appendix A.4.
Comparisons with Existing Works.
The purpose of exploring large-batch training is to speed up
model training with increasing computational power, as well as enabling us to explore the larger
dataset. As shown in Table 1, the seminal works such as LARS [
8
], LAMB [
6
], and SAM [
23
] have
made great contributions to large-batch training for plain vision pipelines e.g., image-level prediction,
despite that they often require hyper-parameter tuning by experienced engineers. For complicated
pipelines of dense visual predictions, they are typically not sufficient to achieve desired generalization
performance. MegDet [
28
] and PMD-LAMB [
29
] contribute the preliminary attempts by applying
large-batch training on object detection. Different from these approaches, we revisit the design
paradigm of the complicated dense visual perception pipelines and present a simple yet effective
solution, AGVM, which is insensitive to hyperparameter tuning and can be easily plugged into many
visual perception pipelines. For example, AGVM can perform stable training with an unprecedented
batch size 10K, which could greatly reduce the training time. Moreover, AGVM adds a negligible
computational overhead in training, unlike SAM which involves two sequential (non-parallelizable)
gradient computations at each iteration, resulting in a significant increase of the training time.
5
摘要:

Large-batchOptimizationforDenseVisualPredictionsZeyueXueTheUniversityofHongKongxuezeyue@connect.hku.hkJianmingLiangBeihangUniversityljmmm1997@gmail.comGuangluSongSensetimeResearchsongguanglu@sensetime.comZhuofanZongBeihangUniverisityzongzhuofan@gmail.comLiangChenPekingUniversityclandzyy@pku.edu....

展开>> 收起<<
Large-batch Optimization for Dense Visual Predictions Zeyue Xue.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.07MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注