Selecting and Composing Learning Rate Policies for Deep Neural Networks

2025-05-03
0
0
3.1MB
25 页
10玖币
侵权投诉
111
Selecting and Composing Learning Rate Policies for Deep
Neural Networks
YANZHAO WU and LING LIU, Georgia Institute of Technology, USA
The choice of learning rate (LR) functions and policies has evolved from a simple xed LR to the decaying LR
and the cyclic LR, aiming to improve the accuracy and reduce the training time of Deep Neural Networks
(DNNs). This paper presents a systematic approach to selecting and composing an LR policy for eective DNN
training to meet desired target accuracy and reduce training time within the pre-dened training iterations. It
makes three original contributions. First, we develop an LR tuning mechanism for auto-verication of a given
LR policy with respect to the desired accuracy goal under the pre-dened training time constraint. Second, we
develop an LR policy recommendation system (LRBench) to select and compose good LR policies from the
same and/or dierent LR functions through dynamic tuning, and avoid bad choices, for a given learning task,
DNN model and dataset. Third, we extend LRBench by supporting dierent DNN optimizers and show the
signicant mutual impact of dierent LR policies and dierent optimizers. Evaluated using popular benchmark
datasets and dierent DNN models (LeNet, CNN3, ResNet), we show that our approach can eectively deliver
high DNN test accuracy, outperform the existing recommended default LR policies, and reduce the DNN
training time by 1.6∼6.7×to meet a targeted model accuracy.
CCS Concepts:
•Computing methodologies →Machine learning
;
Heuristic function construction
;
Learning settings.
Additional Key Words and Phrases: Learning Rate, Hyper-parameter Optimization, Deep Neural Network,
Deep Learning, Training, Accuracy
ACM Reference Format:
Yanzhao Wu and Ling Liu. 2022. Selecting and Composing Learning Rate Policies for Deep Neural Networks.
ACM Trans. Intell. Syst. Technol. 37, 4, Article 111 (October 2022), 25 pages.
1 INTRODUCTION
Hyperparameter tuning is widely recognized as a critical optimization for ecient training of deep
neural networks (DNNs). A deep neural network is trained iteratively over the input training data
D
through forward and backward processes to update a set of trainable model parameters
Θ
based on
the conguration of its hyper-parameters
H
and the optimization algorithm of its loss function. The
learning rate (
𝜂
) is one of the most important hyper-parameters for eciency optimization of the
DNN training algorithms. The learning rate (LR) function and the conguration policy are known
to have direct impacts on both the training ecacy and the test accuracy of the trained model.
However, it is challenging to choose a good LR function, to select a good LR policy (e.g., a specic
LR parameter conguration) given an LR function, and avoid bad LR policies. Even for the xed
learning rate function, it is non-trivial to choose a good value and avoid a bad one, since too small
or too large LR value may impair the DNN training progress on both model accuracy and training
Authors’ address: Yanzhao Wu, yanzhaowu@gatech.edu; Ling Liu, lingliu@cc.gatech.edu, Georgia Institute of Technology,
266 Ferst Drive, Atlanta, Georgia, USA, 30332-0765.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
2157-6904/2022/10-ART111 $15.00
https://doi.org/
ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.
arXiv:2210.12936v1 [cs.LG] 24 Oct 2022
111:2 Yanzhao Wu and Ling Liu
time, resulting in slow convergence or even model divergence [
5
,
9
]. The typical trial-and-error
approach will try dierent LR values each time for training, which is tedious and time-consuming
to tune the single LR value. Even with a reduced search space such as
[
0
.
0001
,
0
.
1
]
, the possible LR
values for trial-and-error can be inexplicable. Bearing with the diculty of determining a good
LR value for xed learning rate, a growing trend of research eorts have been devoted to more
complex LR functions (
𝜂(𝑡
;
P)
), which have multiple LR parameters instead of a single xed value,
and will change as a function of the training iterations (
𝑡
). As a result, nding a good LR function
and selecting a good LR policy will demand tuning multiple LR parameters for each LR function,
making the hyperparameter tuning for LR a far-reaching challenge [
6
,
10
,
22
,
30
,
36
,
41
]. Moreover,
good LR policies for a given LR function tend to vary based on the specic datasets and learning
tasks, and the DNN algorithms used for model training [
1
,
11
,
28
,
29
,
40
,
42
]. In practice, empirical
approaches are typically used to manually select an LR function and congure a good LR policy
by choosing the concrete LR parameters through trials and errors. For example, most of the DL
frameworks (e.g., TensorFlow, Cae and PyTorch) recommend dierent LR policies for dierent
benchmark learning tasks and datasets as their default LR policies in their public releases with
accuracy/training time benchmark results. Even for the same learning task and dataset, each of
these DL frameworks often has dierent learning rate policies for dierent DNN models as the
recommended default LRs. For example, TensorFlow uses a constant learning rate (xed LR) with a
specic LR value as its recommended LR policy for CIFAR-10 when training using AlexNet, and
uses a decaying LR (NSTEP) as its default LR policy when ResNet is used for training a CIFAR-10
classier. Many popular DNN training optimizers, such as Stochastic Gradient Descent (SGD) [
8
],
SGD with Momentum [
33
] and Adam [
24
], utilize the learning rate in their optimization execution,
indicating that the learning rate is a critical hyper-parameter for DNN training. For example, for
MNIST, TensorFlow chooses Adam optimizer with the xed LR of 0.0001, and Cae, Torch and
Theano all choose SGD optimizer with a xed LR, but they set their default choice for the xed LR to
0.01, 0.05 and 0.1 respectively. In comparison, for CIFAR-10, TensorFlow, Torch and Theano choose
SGD with xed LR values of 0.1, 0.001 and 0.01 respectively while Cae changes its LR function to
a two-step decay LR policy with 0.001 as the LR value for the rst 4,000 iterations in training and
0.0001 as the updated LR value for the last 1,000 iterations of training [
42
]. However, when a new
DNN model is used for training or an existing DNN model is trained for a new learning task or a new
dataset, domain-scientists and engineers have found it hard to select and compose a good learning
rate policy and avoid worse LR choices for eective training of DNN models. The manual-tuning
task for nding a good or acceptable LR policy with respect to the training accuracy objective and
training time constraint can be labor-intensive and error-prone, especially given the large search
space of LR values for LR functions of either single-parameter or multi-parameters. There is a high
demand for designing and developing a systematic approach to selecting and composing a good LR
policy for a given learning task and dataset and a given DNN training algorithm. We argue that a
good LR policy can notably improve the DNN training performance on both model test accuracy
and model training time, signicantly alleviate the frustration of manual-tuning diculty and costs,
and more importantly, can help avoiding the bad LR policy choices that will lead to below average
or poor training performance.
Bearing these objectives in mind, we present a systematic study of 15 representative learning
rate functions from four LR algorithm families. This paper makes three original contributions. First,
we develop an LR tuning mechanism to enable dynamically tuning and verication of LR policies
with respect to desired accuracy goal and training time constraints, e.g., the pre-set #Iterations
or #Epochs (the number of complete pass-throughs of the training data
D
). Second, we develop
an LR policy recommendation system (LRBench) to select and compose good LR policies from
the same and/or dierent LR function(s) and avoid bad ones for a given learning task and a given
ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.
Selecting and Composing Learning Rate Policies for Deep Neural Networks 111:3
dataset and DNN model. Third, we incorporate the support of dierent DNN optimizers and the
recommendation of adaptive composite LR policy. The adaptive composite LR policy can further
improve the quality of LR policy selection by enabling the creation of a multi-policy LR by combining
multi-LR policies from dierent LR functions at dierent stages of the training process, boosting
the overall performance of DNN model training on accuracy and training time. We evaluate our
approach using four benchmark datasets, MNIST [
26
], CIFAR-10 [
25
], SVHN [
32
] and ImageNet [
35
],
and three families of DNN backbone algorithms for model training: LeNet [
26
], CNN3 [
21
], and
ResNet [
17
]. The results show that our approach is eective and the LR policies chosen by LRBench
can consistently deliver high DNN model accuracy, outperform the existing recommended default
LR policies for a given DNN model, learning task and dataset, and reduce the DNN training time
by 1.6∼6.7×to meet a targeted accuracy.
2 PROBLEM STATEMENT
The DNN training with a given set of hyperparameters will output a trained model
𝐹
with dataset-
specic model parameters (
Θ
). During the training, an optimizer is used to update the model
parameters and improve the model performance iteratively with two important optimizations. (1)
A loss function (
𝐿
) is computed statistically and used to measure the prediction deviation of the
DNN model output to the ground truth, which enables the optimizer to reduce and minimize the
loss value (error) throughout the iterative model update process. (2) The learning rate policy (
𝜂(𝑡)
)
is leveraged by the optimizer to control and adjust the amount of model parameter updates to be
exercised during each training iteration
𝑡
, which enables the optimizer to tune the rate of the update
to the model parameters between slow and fast based on the specic learning rate value given at
each training iteration. There are three primary goals for DNN training to adjust the extent of the
update on the model parameters based on a specic LR policy: (i) to control the model learning
speed, (ii) to avoid over-tting to the single mini-batch, and (iii) to ensure that the model converges
to global/local optimum.
Non-convex optimization algorithms are widely adopted as the optimizer for DNN training, such
as Stochastic Gradient Descent (SGD) [
8
], SGD with Momentum [
33
], Adam [
24
], Nesterov [
38
]
and so forth. For SGD, the DNN parameter update can be formalized as follows:
Θ𝑡+1=Θ𝑡−𝜂(𝑡)∇𝐿(1)
where
𝑡
represents the current iteration,
𝐿
is the loss function,
∇𝐿
is the gradients and
𝜂(𝑡)
is the
learning rate (LR) at iteration
𝑡
that controls the extent of the update to the model parameters (i.e.,
Θ𝑡+1−Θ𝑡=−𝜂(𝑡)∇𝐿).
For other optimizers, such as SGD with Momentum (Momentum) and Adam, they adopt a similar
method to update model parameters. For example, Momentum will update the model parameters
Θ
as Formula (2) shows.
𝑉𝑡=𝛾𝑉𝑡−1−𝜂(𝑡)∇𝐿, Θ𝑡+1=Θ𝑡+𝑉𝑡(2)
where
𝑉𝑡
is the accumulated gradients at iteration
𝑡
to be updated to the model parameters and
𝛾
is
a coecient applied to the previous
𝑉𝑡−1
, which is typically set to 0.9. Adam is another popular
optimizer widely used in DNN training. It updates the model parameters Θas Formula (3) shows.
𝑀𝑡=𝛽1𝑀𝑡−1+ (1−𝛽1)∇𝐿, 𝑉𝑡=𝛽2𝑉𝑡−1+ (1−𝛽2) (∇𝐿)2
ˆ
𝑀𝑡=
𝑀𝑡
1−𝛽𝑡
1
,ˆ
𝑉𝑡=
𝑉𝑡
1−𝛽𝑡
2
,Θ𝑡+1=Θ𝑡−𝜂(𝑡)
ˆ
𝑉𝑖+𝜖
ˆ
𝑀𝑡
(3)
where
𝛽1
and
𝛽2
are the coecients to balance the previous accumulated gradients and the square
of gradients. Typically, we will set
𝛽1=
0
.
9,
𝛽2=
0
.
999 and
𝜖=
10
−8
. Formula (1)
∼
(3) all contain the
learning rate policy
𝜂(𝑡)
as other popular optimizers do, such as Nesterov [
38
] and AdaDelta [
43
].
ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.
111:4 Yanzhao Wu and Ling Liu
In addition, [
4
] proposed to search the optimizers for training DNNs by modeling an optimizer
as Formula (4)
Θ𝑡+1=Θ𝑡−𝜂(𝑡)𝑏(𝑢1(𝑜𝑝1), 𝑢2(𝑝2)) (4)
where
𝑜𝑝1
and
𝑜𝑝2
are the operands, such as
∇𝐿
,
𝑀𝑡
and
𝑉𝑡
in Formula (3),
𝑢1(.), 𝑢2(.)
and
𝑏(., .)
denote the unary and binary functions respectively, such as mapping the input
𝑥
to
−𝑥
and
𝑙𝑜𝑔|𝑥|
for the unary functions, and addition and multiplication for the binary functions. In particular, the
learning rate policy 𝜂(𝑡)still plays a critical role in the searching and optimization process.
Learning rate optimization is a subproblem of hyper-parameter optimization that is only for
learning rate
𝜂
. For DNN training, given an optimizer
O
and a deep neural network
𝐹Θ
with
trainable model parameters
Θ
, the optimizer
O
minimizes the loss
𝐿(𝑥
;
𝐹Θ)
over i.i.d. samples
𝑥
from a natural (grand truth) distribution
G𝑥
. In practice, the optimizer
O
will map a training
dataset
X𝑡𝑟𝑎𝑖𝑛
to data-specic model parameters
Θ
for a given deep neural network
𝐹Θ
, that is
Θ=O(X𝑡𝑟𝑎𝑖𝑛 )
. An important hyper-parameter for the optimizer
O
is the learning rate policy
𝜂=𝜂(𝑡)
. With the chosen
𝜂
, we have the optimizer
O𝜂
and
Θ=O𝜂(X𝑡𝑟𝑎𝑖𝑛 )
. Learning rate
optimization aims at identifying a good learning rate policy
𝜂
to minimize the generalization error
E𝑥∼G𝑥[𝐿(𝑥
;
𝐹O𝜂( X𝑡𝑟 𝑎𝑖𝑛 ))]
. In practice, we use a validation dataset
X𝑣𝑎𝑙
to estimate the generalization
error, that is
𝐿𝑥∈X𝑣𝑎𝑙 (𝑥
;
𝐹O𝜂( X𝑡𝑟 𝑎𝑖𝑛 ))
.
P
denotes the set containing all possible LR policies, the LR
optimization problem is formalized as Formula (5):
ˆ
𝜂=𝑎𝑟𝑔𝑚𝑖𝑛𝜂∈P 𝐿𝑥∈X𝑣𝑎𝑙 (𝑥;𝐹O𝜂(X𝑡 𝑟𝑎𝑖𝑛 ))(5)
Dierent from other hyper-parameters, such as the weight decay rate, number of lters, kernel
size, which are typically constant in the entire training process, the learning rate may change over
the training iteration
𝑡
. In practice, we choose a nite set of
𝑆
LR policies, consisted of dierent
LR functions, denoted as
P∗⊆ P
and
P∗={𝜂1(𝑡), 𝜂2(𝑡), ..., 𝜂𝑆(𝑡)}
, e.g.,
𝜂1(𝑡)=𝑘
(a xed LR) and
𝜂2(𝑡)=𝛾𝑡
(
𝛾<
1). Hence, we can formalize the LR optimization as Formula (6). That is to select or
compose the optimal LR policy from the candidate set P∗={𝜂1(𝑡), 𝜂2(𝑡), ..., 𝜂𝑆(𝑡)}.
ˆ
𝜂=𝑎𝑟𝑔𝑚𝑖𝑛𝜂∈P∗={𝜂1(𝑡),...,𝜂𝑆(𝑡) }𝐿𝑥∈X𝑣𝑎𝑙 (𝑥;𝐹O𝜂(X𝑡𝑟 𝑎𝑖𝑛 ))(6)
3 LEARNING RATE SELECTION AND COMPOSITION
Learning rate is a function of the training iteration
𝑡
with a set of parameters and a method to
determine the learning rate value at each iteration
𝑡
of the overall training process. A learning rate
policy species a concrete parameter setting of an LR function. For example, a xed learning rate
of 0.01 is an LR policy of constant learning rate method with a xed value of 0.01 throughout all
iterations of the model training. Another example is the two-step LR policy of 0.01 in the rst half
of the training iterations and 0.001 in the second half of the training iterations. In this section we
will cover a total of 15 functions from three families of LR functions: xed LRs, decaying LRs and
cyclic LRs. We use the term of single learning rate policy to refer to the LR policy that corresponds
to a single LR function, and refer to the LR policy that is dened by combining multiple LR policies
from two or more LR functions as composite LR or multi-policy LR. We rst describe our approach
to select single LR policy for a given learning task, dataset and DNN backbone algorithm for model
training. Then we introduce our composite LR scheme for selecting and composing an adaptive LR
policy to further boost the overall training performance in terms of accuracy or training time given
a target accuracy.
3.1 Single Policy Learning Rates
Fixed LRs (FIX)
, also called constant LRs, use a pre-selected xed LR value throughout the entire
training process, represented by
𝜂(𝑡)=𝑘
with
𝑘
as the only hyper-parameter to tune. However, too
ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.
Selecting and Composing Learning Rate Policies for Deep Neural Networks 111:5
small
𝑘
value may slow down the training progress signicantly. Too large
𝑘
value may accelerate
the training progress at the cost of causing the loss function to uctuate wildly, making the training
fail to converge, resulting in very low accuracy. The conservative approach is popularly used, which
uses a small value to ensure the model convergence and avoid oscillating loss (e.g., 0.01 for MNIST
on LeNet and 0.001 for CIFAR-10 on CNN3). However, choosing a small and yet good xed LR value
is challenging. Even for the same learning task and dataset, e.g., CIFAR-10, dierent DNN models
need to choose dierent constant values to meet the target accuracy goal (e.g., for CIFAR-10, 0.001
on CNN3 and 0.1 on ResNet-32). Another limitation of xed LR policies is that it cannot adapt to the
needs of dierent learning speeds during dierent training stages of the entire iterative learning
process, and thus suers from reaching the peak accuracy due to missing the speed-up opportunity
when the training is on the plateau or failing to converge at the end of training.
3.2 Composite Policy Learning Rates
There are two types of composite learning rate schemes, according to whether the LRs are composed
using the same LR function or using two or more dierent LR functions. The former is coined as
the homogeneous multi-policy LRs and the later is called the heterogeneous multi-policy LRs.
Decaying LRs
improve the limitation of the xed LRs by using decreasing LR values during
training. Similar to simulated annealing, training with a decaying LR starts with a relatively large
LR value, which is reduced gradually throughout the training, aiming to accelerate the learning
process while ensure that the training converges with good accuracy or meeting the target accuracy.
A decaying LR policy is dened by a decay function
𝑔(𝑡)
and a constant coecient
𝑘
, denoted by
𝜂(𝑡)=𝑘𝑔(𝑡)
.
𝑔(𝑡)
gradually decreases from the upper bound of 1 as the number of iterations (
𝑡
)
increases, and the constant 𝑘value serves as the starting learning rate.
Table 1. Decaying Functions 𝑔(𝑡)for Decaying LRs
abbr. 𝑔(𝑡)Schedule Param #Param
STEP 𝛾𝑓 𝑙𝑜𝑜𝑟 (𝑡/𝑙)t, l 𝛾, 𝑙 2
NSTEP
𝑙0, ..., 𝑙𝑛−1,
𝛾𝑖, 𝑖 ∈N𝑠.𝑡.
𝑙𝑖−1≤𝑡<𝑙𝑖
𝑡, 𝑙𝑖𝛾, 𝑙𝑖𝑛+1
EXP 𝛾𝑡𝑡 𝛾 1
INV 1
(1+𝑡𝛾)𝑝𝑡 𝛾, 𝑝 2
POLY (1−𝑡
𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 )𝑝𝑡 𝑝 1
Table 1 lists the 5 most popular decaying LRs, supported in LRBench. The STEP function denes
the LR policy at iteration
𝑡
with 2 parameters, a xed step size
𝑙
(
𝑙>
1) and an exponential factor
𝛾
.
The LR value is initialized with
𝑘
and decays every
𝑙
iterations by
𝛾
. The NSTEP enriches STEP by
introducing
𝑛
variable step sizes, denoted by
𝑙0, 𝑙1, ..., 𝑙𝑛−1
, instead of the one xed step size
𝑙
. NSTEP
is initialized by
𝑘
(
𝑔(𝑡)=
1when
𝑖=
0
, 𝑡 <𝑙0
) and computed by
𝛾𝑖
(when
𝑖>
0and
𝑙𝑖−1≤𝑡<𝑙𝑖
).
EXP is an LR function dened by an exponential function (
𝛾𝑡
). Although EXP, STEP and NSTEP
all use an exponential function to dene
𝑔(𝑡)
, their choice of concrete
𝛾
is dierent. To avoid the
learning rate decaying too fast due to the exponential explosion, EXP uses a
𝛾
that is close to 1, e.g.,
0.99994 and reduces the LR value every iteration. In contrast, STEP and NSTEP employ a small
𝛾
,
e.g., 0.1, and decay the LR value using one xed step size
𝑙
or using
𝑛
variable step sizes
𝑙𝑖
. The total
number of steps is determined, for STEP, by the step size and the pre-dened training #Iterations
(or #Epochs), and for NSTEP,
𝑛
is typically small, e.g., 2
∼
5 steps. Other decaying LRs are based on
the inverse time function (INV) and the polynomial function (POLY) with parameter
𝑝
as shown in
ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.
摘要:
收起<<
111SelectingandComposingLearningRatePoliciesforDeepNeuralNetworksYANZHAOWUandLINGLIU,GeorgiaInstituteofTechnology,USAThechoiceoflearningrate(LR)functionsandpolicieshasevolvedfromasimplefixedLRtothedecayingLRandthecyclicLR,aimingtoimprovetheaccuracyandreducethetrainingtimeofDeepNeuralNetworks(DNNs).T...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 2
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 1
分类:图书资源
价格:10玖币
属性:25 页
大小:3.1MB
格式:PDF
时间:2025-05-03
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载