Selecting and Composing Learning Rate Policies for Deep Neural Networks

2025-05-03 0 0 3.1MB 25 页 10玖币

侵权投诉

111

Selecting and Composing Learning Rate Policies for Deep

Neural Networks

YANZHAO WU and LING LIU, Georgia Institute of Technology, USA

The choice of learning rate (LR) functions and policies has evolved from a simple xed LR to the decaying LR

and the cyclic LR, aiming to improve the accuracy and reduce the training time of Deep Neural Networks

(DNNs). This paper presents a systematic approach to selecting and composing an LR policy for eective DNN

training to meet desired target accuracy and reduce training time within the pre-dened training iterations. It

makes three original contributions. First, we develop an LR tuning mechanism for auto-verication of a given

LR policy with respect to the desired accuracy goal under the pre-dened training time constraint. Second, we

develop an LR policy recommendation system (LRBench) to select and compose good LR policies from the

same and/or dierent LR functions through dynamic tuning, and avoid bad choices, for a given learning task,

DNN model and dataset. Third, we extend LRBench by supporting dierent DNN optimizers and show the

signicant mutual impact of dierent LR policies and dierent optimizers. Evaluated using popular benchmark

datasets and dierent DNN models (LeNet, CNN3, ResNet), we show that our approach can eectively deliver

high DNN test accuracy, outperform the existing recommended default LR policies, and reduce the DNN

training time by 1.6∼6.7×to meet a targeted model accuracy.

CCS Concepts:

•Computing methodologies →Machine learning

;

Heuristic function construction

;

Learning settings.

Additional Key Words and Phrases: Learning Rate, Hyper-parameter Optimization, Deep Neural Network,

Deep Learning, Training, Accuracy

ACM Reference Format:

Yanzhao Wu and Ling Liu. 2022. Selecting and Composing Learning Rate Policies for Deep Neural Networks.

ACM Trans. Intell. Syst. Technol. 37, 4, Article 111 (October 2022), 25 pages.

1 INTRODUCTION

Hyperparameter tuning is widely recognized as a critical optimization for ecient training of deep

neural networks (DNNs). A deep neural network is trained iteratively over the input training data

through forward and backward processes to update a set of trainable model parameters

based on

the conguration of its hyper-parameters

and the optimization algorithm of its loss function. The

learning rate (

𝜂

) is one of the most important hyper-parameters for eciency optimization of the

DNN training algorithms. The learning rate (LR) function and the conguration policy are known

to have direct impacts on both the training ecacy and the test accuracy of the trained model.

However, it is challenging to choose a good LR function, to select a good LR policy (e.g., a specic

LR parameter conguration) given an LR function, and avoid bad LR policies. Even for the xed

learning rate function, it is non-trivial to choose a good value and avoid a bad one, since too small

or too large LR value may impair the DNN training progress on both model accuracy and training

Authors’ address: Yanzhao Wu, yanzhaowu@gatech.edu; Ling Liu, lingliu@cc.gatech.edu, Georgia Institute of Technology,

266 Ferst Drive, Atlanta, Georgia, USA, 30332-0765.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2157-6904/2022/10-ART111 $15.00

https://doi.org/

ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.

arXiv:2210.12936v1 [cs.LG] 24 Oct 2022

111:2 Yanzhao Wu and Ling Liu

time, resulting in slow convergence or even model divergence [

]. The typical trial-and-error

approach will try dierent LR values each time for training, which is tedious and time-consuming

to tune the single LR value. Even with a reduced search space such as

[

0001

]

, the possible LR

values for trial-and-error can be inexplicable. Bearing with the diculty of determining a good

LR value for xed learning rate, a growing trend of research eorts have been devoted to more

complex LR functions (

𝜂(𝑡

;

), which have multiple LR parameters instead of a single xed value,

and will change as a function of the training iterations (

𝑡

). As a result, nding a good LR function

and selecting a good LR policy will demand tuning multiple LR parameters for each LR function,

making the hyperparameter tuning for LR a far-reaching challenge [

]. Moreover,

good LR policies for a given LR function tend to vary based on the specic datasets and learning

tasks, and the DNN algorithms used for model training [

]. In practice, empirical

approaches are typically used to manually select an LR function and congure a good LR policy

by choosing the concrete LR parameters through trials and errors. For example, most of the DL

frameworks (e.g., TensorFlow, Cae and PyTorch) recommend dierent LR policies for dierent

benchmark learning tasks and datasets as their default LR policies in their public releases with

accuracy/training time benchmark results. Even for the same learning task and dataset, each of

these DL frameworks often has dierent learning rate policies for dierent DNN models as the

recommended default LRs. For example, TensorFlow uses a constant learning rate (xed LR) with a

specic LR value as its recommended LR policy for CIFAR-10 when training using AlexNet, and

uses a decaying LR (NSTEP) as its default LR policy when ResNet is used for training a CIFAR-10

classier. Many popular DNN training optimizers, such as Stochastic Gradient Descent (SGD) [

SGD with Momentum [

] and Adam [

], utilize the learning rate in their optimization execution,

indicating that the learning rate is a critical hyper-parameter for DNN training. For example, for

MNIST, TensorFlow chooses Adam optimizer with the xed LR of 0.0001, and Cae, Torch and

Theano all choose SGD optimizer with a xed LR, but they set their default choice for the xed LR to

0.01, 0.05 and 0.1 respectively. In comparison, for CIFAR-10, TensorFlow, Torch and Theano choose

SGD with xed LR values of 0.1, 0.001 and 0.01 respectively while Cae changes its LR function to

a two-step decay LR policy with 0.001 as the LR value for the rst 4,000 iterations in training and

0.0001 as the updated LR value for the last 1,000 iterations of training [

]. However, when a new

DNN model is used for training or an existing DNN model is trained for a new learning task or a new

dataset, domain-scientists and engineers have found it hard to select and compose a good learning

rate policy and avoid worse LR choices for eective training of DNN models. The manual-tuning

task for nding a good or acceptable LR policy with respect to the training accuracy objective and

training time constraint can be labor-intensive and error-prone, especially given the large search

space of LR values for LR functions of either single-parameter or multi-parameters. There is a high

demand for designing and developing a systematic approach to selecting and composing a good LR

policy for a given learning task and dataset and a given DNN training algorithm. We argue that a

good LR policy can notably improve the DNN training performance on both model test accuracy

and model training time, signicantly alleviate the frustration of manual-tuning diculty and costs,

and more importantly, can help avoiding the bad LR policy choices that will lead to below average

or poor training performance.

Bearing these objectives in mind, we present a systematic study of 15 representative learning

rate functions from four LR algorithm families. This paper makes three original contributions. First,

we develop an LR tuning mechanism to enable dynamically tuning and verication of LR policies

with respect to desired accuracy goal and training time constraints, e.g., the pre-set #Iterations

or #Epochs (the number of complete pass-throughs of the training data

). Second, we develop

an LR policy recommendation system (LRBench) to select and compose good LR policies from

the same and/or dierent LR function(s) and avoid bad ones for a given learning task and a given

ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.

Selecting and Composing Learning Rate Policies for Deep Neural Networks 111:3

dataset and DNN model. Third, we incorporate the support of dierent DNN optimizers and the

recommendation of adaptive composite LR policy. The adaptive composite LR policy can further

improve the quality of LR policy selection by enabling the creation of a multi-policy LR by combining

multi-LR policies from dierent LR functions at dierent stages of the training process, boosting

the overall performance of DNN model training on accuracy and training time. We evaluate our

approach using four benchmark datasets, MNIST [

], CIFAR-10 [

], SVHN [

] and ImageNet [

and three families of DNN backbone algorithms for model training: LeNet [

], CNN3 [

], and

ResNet [

]. The results show that our approach is eective and the LR policies chosen by LRBench

can consistently deliver high DNN model accuracy, outperform the existing recommended default

LR policies for a given DNN model, learning task and dataset, and reduce the DNN training time

by 1.6∼6.7×to meet a targeted accuracy.

2 PROBLEM STATEMENT

The DNN training with a given set of hyperparameters will output a trained model

𝐹

with dataset-

specic model parameters (

). During the training, an optimizer is used to update the model

parameters and improve the model performance iteratively with two important optimizations. (1)

A loss function (

𝐿

) is computed statistically and used to measure the prediction deviation of the

DNN model output to the ground truth, which enables the optimizer to reduce and minimize the

loss value (error) throughout the iterative model update process. (2) The learning rate policy (

𝜂(𝑡)

)

is leveraged by the optimizer to control and adjust the amount of model parameter updates to be

exercised during each training iteration

𝑡

, which enables the optimizer to tune the rate of the update

to the model parameters between slow and fast based on the specic learning rate value given at

each training iteration. There are three primary goals for DNN training to adjust the extent of the

update on the model parameters based on a specic LR policy: (i) to control the model learning

speed, (ii) to avoid over-tting to the single mini-batch, and (iii) to ensure that the model converges

to global/local optimum.

Non-convex optimization algorithms are widely adopted as the optimizer for DNN training, such

as Stochastic Gradient Descent (SGD) [

], SGD with Momentum [

], Adam [

], Nesterov [

]

and so forth. For SGD, the DNN parameter update can be formalized as follows:

Θ𝑡+1=Θ𝑡−𝜂(𝑡)∇𝐿(1)

where

𝑡

represents the current iteration,

𝐿

is the loss function,

∇𝐿

is the gradients and

𝜂(𝑡)

is the

learning rate (LR) at iteration

𝑡

that controls the extent of the update to the model parameters (i.e.,

Θ𝑡+1−Θ𝑡=−𝜂(𝑡)∇𝐿).

For other optimizers, such as SGD with Momentum (Momentum) and Adam, they adopt a similar

method to update model parameters. For example, Momentum will update the model parameters

as Formula (2) shows.

𝑉𝑡=𝛾𝑉𝑡−1−𝜂(𝑡)∇𝐿, Θ𝑡+1=Θ𝑡+𝑉𝑡(2)

where

𝑉𝑡

is the accumulated gradients at iteration

𝑡

to be updated to the model parameters and

𝛾

a coecient applied to the previous

𝑉𝑡−1

, which is typically set to 0.9. Adam is another popular

optimizer widely used in DNN training. It updates the model parameters Θas Formula (3) shows.

𝑀𝑡=𝛽1𝑀𝑡−1+ (1−𝛽1)∇𝐿, 𝑉𝑡=𝛽2𝑉𝑡−1+ (1−𝛽2) (∇𝐿)2

𝑀𝑡=

𝑀𝑡

1−𝛽𝑡

,ˆ

𝑉𝑡=

𝑉𝑡

1−𝛽𝑡

,Θ𝑡+1=Θ𝑡−𝜂(𝑡)

ˆ

𝑉𝑖+𝜖

𝑀𝑡

(3)

where

𝛽1

and

𝛽2

are the coecients to balance the previous accumulated gradients and the square

of gradients. Typically, we will set

𝛽1=

𝛽2=

999 and

𝜖=

−8

. Formula (1)

∼

(3) all contain the

learning rate policy

𝜂(𝑡)

as other popular optimizers do, such as Nesterov [

] and AdaDelta [

ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.

111:4 Yanzhao Wu and Ling Liu

In addition, [

] proposed to search the optimizers for training DNNs by modeling an optimizer

as Formula (4)

Θ𝑡+1=Θ𝑡−𝜂(𝑡)𝑏(𝑢1(𝑜𝑝1), 𝑢2(𝑝2)) (4)

where

𝑜𝑝1

and

𝑜𝑝2

are the operands, such as

∇𝐿

𝑀𝑡

and

𝑉𝑡

in Formula (3),

𝑢1(.), 𝑢2(.)

and

𝑏(., .)

denote the unary and binary functions respectively, such as mapping the input

𝑥

−𝑥

and

𝑙𝑜𝑔|𝑥|

for the unary functions, and addition and multiplication for the binary functions. In particular, the

learning rate policy 𝜂(𝑡)still plays a critical role in the searching and optimization process.

Learning rate optimization is a subproblem of hyper-parameter optimization that is only for

learning rate

𝜂

. For DNN training, given an optimizer

and a deep neural network

𝐹Θ

with

trainable model parameters

, the optimizer

minimizes the loss

𝐿(𝑥

;

𝐹Θ)

over i.i.d. samples

𝑥

from a natural (grand truth) distribution

G𝑥

. In practice, the optimizer

will map a training

dataset

X𝑡𝑟𝑎𝑖𝑛

to data-specic model parameters

for a given deep neural network

𝐹Θ

, that is

Θ=O(X𝑡𝑟𝑎𝑖𝑛 )

. An important hyper-parameter for the optimizer

is the learning rate policy

𝜂=𝜂(𝑡)

. With the chosen

𝜂

, we have the optimizer

O𝜂

and

Θ=O𝜂(X𝑡𝑟𝑎𝑖𝑛 )

. Learning rate

optimization aims at identifying a good learning rate policy

𝜂

to minimize the generalization error

E𝑥∼G𝑥[𝐿(𝑥

;

𝐹O𝜂( X𝑡𝑟 𝑎𝑖𝑛 ))]

. In practice, we use a validation dataset

X𝑣𝑎𝑙

to estimate the generalization

error, that is

𝐿𝑥∈X𝑣𝑎𝑙 (𝑥

;

𝐹O𝜂( X𝑡𝑟 𝑎𝑖𝑛 ))

denotes the set containing all possible LR policies, the LR

optimization problem is formalized as Formula (5):

𝜂=𝑎𝑟𝑔𝑚𝑖𝑛𝜂∈P 𝐿𝑥∈X𝑣𝑎𝑙 (𝑥;𝐹O𝜂(X𝑡 𝑟𝑎𝑖𝑛 ))(5)

Dierent from other hyper-parameters, such as the weight decay rate, number of lters, kernel

size, which are typically constant in the entire training process, the learning rate may change over

the training iteration

𝑡

. In practice, we choose a nite set of

𝑆

LR policies, consisted of dierent

LR functions, denoted as

P∗⊆ P

and

P∗={𝜂1(𝑡), 𝜂2(𝑡), ..., 𝜂𝑆(𝑡)}

, e.g.,

𝜂1(𝑡)=𝑘

(a xed LR) and

𝜂2(𝑡)=𝛾𝑡

(

𝛾<

1). Hence, we can formalize the LR optimization as Formula (6). That is to select or

compose the optimal LR policy from the candidate set P∗={𝜂1(𝑡), 𝜂2(𝑡), ..., 𝜂𝑆(𝑡)}.

𝜂=𝑎𝑟𝑔𝑚𝑖𝑛𝜂∈P∗={𝜂1(𝑡),...,𝜂𝑆(𝑡) }𝐿𝑥∈X𝑣𝑎𝑙 (𝑥;𝐹O𝜂(X𝑡𝑟 𝑎𝑖𝑛 ))(6)

3 LEARNING RATE SELECTION AND COMPOSITION

Learning rate is a function of the training iteration

𝑡

with a set of parameters and a method to

determine the learning rate value at each iteration

𝑡

of the overall training process. A learning rate

policy species a concrete parameter setting of an LR function. For example, a xed learning rate

of 0.01 is an LR policy of constant learning rate method with a xed value of 0.01 throughout all

iterations of the model training. Another example is the two-step LR policy of 0.01 in the rst half

of the training iterations and 0.001 in the second half of the training iterations. In this section we

will cover a total of 15 functions from three families of LR functions: xed LRs, decaying LRs and

cyclic LRs. We use the term of single learning rate policy to refer to the LR policy that corresponds

to a single LR function, and refer to the LR policy that is dened by combining multiple LR policies

from two or more LR functions as composite LR or multi-policy LR. We rst describe our approach

to select single LR policy for a given learning task, dataset and DNN backbone algorithm for model

training. Then we introduce our composite LR scheme for selecting and composing an adaptive LR

policy to further boost the overall training performance in terms of accuracy or training time given

a target accuracy.

3.1 Single Policy Learning Rates

Fixed LRs (FIX)

, also called constant LRs, use a pre-selected xed LR value throughout the entire

training process, represented by

𝜂(𝑡)=𝑘

with

𝑘

as the only hyper-parameter to tune. However, too

ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.

Selecting and Composing Learning Rate Policies for Deep Neural Networks 111:5

small

𝑘

value may slow down the training progress signicantly. Too large

𝑘

value may accelerate

the training progress at the cost of causing the loss function to uctuate wildly, making the training

fail to converge, resulting in very low accuracy. The conservative approach is popularly used, which

uses a small value to ensure the model convergence and avoid oscillating loss (e.g., 0.01 for MNIST

on LeNet and 0.001 for CIFAR-10 on CNN3). However, choosing a small and yet good xed LR value

is challenging. Even for the same learning task and dataset, e.g., CIFAR-10, dierent DNN models

need to choose dierent constant values to meet the target accuracy goal (e.g., for CIFAR-10, 0.001

on CNN3 and 0.1 on ResNet-32). Another limitation of xed LR policies is that it cannot adapt to the

needs of dierent learning speeds during dierent training stages of the entire iterative learning

process, and thus suers from reaching the peak accuracy due to missing the speed-up opportunity

when the training is on the plateau or failing to converge at the end of training.

3.2 Composite Policy Learning Rates

There are two types of composite learning rate schemes, according to whether the LRs are composed

using the same LR function or using two or more dierent LR functions. The former is coined as

the homogeneous multi-policy LRs and the later is called the heterogeneous multi-policy LRs.

Decaying LRs

improve the limitation of the xed LRs by using decreasing LR values during

training. Similar to simulated annealing, training with a decaying LR starts with a relatively large

LR value, which is reduced gradually throughout the training, aiming to accelerate the learning

process while ensure that the training converges with good accuracy or meeting the target accuracy.

A decaying LR policy is dened by a decay function

𝑔(𝑡)

and a constant coecient

𝑘

, denoted by

𝜂(𝑡)=𝑘𝑔(𝑡)

𝑔(𝑡)

gradually decreases from the upper bound of 1 as the number of iterations (

𝑡

)

increases, and the constant 𝑘value serves as the starting learning rate.

Table 1. Decaying Functions 𝑔(𝑡)for Decaying LRs

abbr. 𝑔(𝑡)Schedule Param #Param

STEP 𝛾𝑓 𝑙𝑜𝑜𝑟 (𝑡/𝑙)t, l 𝛾, 𝑙 2

NSTEP

𝑙0, ..., 𝑙𝑛−1,

𝛾𝑖, 𝑖 ∈N𝑠.𝑡.

𝑙𝑖−1≤𝑡<𝑙𝑖

𝑡, 𝑙𝑖𝛾, 𝑙𝑖𝑛+1

EXP 𝛾𝑡𝑡 𝛾 1

INV 1

(1+𝑡𝛾)𝑝𝑡 𝛾, 𝑝 2

POLY (1−𝑡

𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 )𝑝𝑡 𝑝 1

Table 1 lists the 5 most popular decaying LRs, supported in LRBench. The STEP function denes

the LR policy at iteration

𝑡

with 2 parameters, a xed step size

𝑙

(

𝑙>

1) and an exponential factor

𝛾

The LR value is initialized with

𝑘

and decays every

𝑙

iterations by

𝛾

. The NSTEP enriches STEP by

introducing

𝑛

variable step sizes, denoted by

𝑙0, 𝑙1, ..., 𝑙𝑛−1

, instead of the one xed step size

𝑙

. NSTEP

is initialized by

𝑘

(

𝑔(𝑡)=

1when

𝑖=

, 𝑡 <𝑙0

) and computed by

𝛾𝑖

(when

𝑖>

0and

𝑙𝑖−1≤𝑡<𝑙𝑖

EXP is an LR function dened by an exponential function (

𝛾𝑡

). Although EXP, STEP and NSTEP

all use an exponential function to dene

𝑔(𝑡)

, their choice of concrete

𝛾

is dierent. To avoid the

learning rate decaying too fast due to the exponential explosion, EXP uses a

𝛾

that is close to 1, e.g.,

0.99994 and reduces the LR value every iteration. In contrast, STEP and NSTEP employ a small

𝛾

e.g., 0.1, and decay the LR value using one xed step size

𝑙

or using

𝑛

variable step sizes

𝑙𝑖

. The total

number of steps is determined, for STEP, by the step size and the pre-dened training #Iterations

(or #Epochs), and for NSTEP,

𝑛

is typically small, e.g., 2

∼

5 steps. Other decaying LRs are based on

the inverse time function (INV) and the polynomial function (POLY) with parameter

𝑝

as shown in

ACM Trans. Intell. Syst. Technol., Vol. 37, No. 4, Article 111. Publication date: October 2022.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

111SelectingandComposingLearningRatePoliciesforDeepNeuralNetworksYANZHAOWUandLINGLIU,GeorgiaInstituteofTechnology,USAThechoiceoflearningrate(LR)functionsandpolicieshasevolvedfromasimplefixedLRtothedecayingLRandthecyclicLR,aimingtoimprovetheaccuracyandreducethetrainingtimeofDeepNeuralNetworks(DNNs).T...

收起<<

Selecting and Composing Learning Rate Policies for Deep Neural Networks.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Selecting and Composing Learning Rate Policies for Deep Neural Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: