On the optimization and pruning for Bayesian deep learning Xiongwen Ke and Yanan Fan School of mathematics and statistics UNSW

2025-05-02 0 0 913.63KB 10 页 10玖币

侵权投诉

On the optimization and pruning for Bayesian deep learning

Xiongwen Ke and Yanan Fan

School of mathematics and statistics, UNSW

Abstract

The goal of Bayesian deep learning is to pro-

vide uncertainty quantiﬁcation via the posterior

distribution. However, exact inference over the

weight space is computationally intractable due

to the ultra-high dimensions of the neural net-

work. Variational inference (VI) is a promis-

ing approach, but naive application on weight

space does not scale well and often underper-

form on predictive accuracy. In this paper,

we propose a new adaptive variational Bayesian

algorithm to train neural networks on weight

space that achieves high predictive accuracy. By

showing that there is an equivalence to Stochas-

tic Gradient Hamiltonian Monte Carlo(SGHMC)

with preconditioning matrix, we then propose an

MCMC within EM algorithm, which incorpo-

rates the spike-and-slab prior to capture the spar-

sity of the neural network. The EM-MCMC al-

gorithm allows us to perform optimization and

model pruning within one-shot. We evaluate

our methods on CIFAR-10, CIFAR-100 and Im-

ageNet datasets, and demonstrate that our dense

model can reach the state-of-the-art performance

and our sparse model perform very well com-

pared to previously proposed pruning schemes.

1 INTRODUCTION

Bayesian inference (Bishop and Nasrabadi, 2006) provide

an elegant way to capture uncertainty via the posterior dis-

tribution over model parameters. Unfortunately, posterior

inference is intractable in any reasonably sized neural net-

work models.

Works focussing on scalable inference for Bayesian deep

learning over the last decade can be separated by two

streams. One is a deterministic approximation approach

such as variational inference (Graves, 2011; Blundell et al.,

Preliminary work. Under review by AISTATS 2023. Do not dis-

tribute.

2015), dropout (Gal and Ghahramani, 2016), Laplace ap-

proximation (Ritter et al., 2018), or expectation propa-

gation(Hernández-Lobato and Adams, 2015). The other

stream involve sampling approaches such as MCMC using

stochastic gradient Langevin dynamics (SGLM) (Welling

and Teh, 2011; Chen et al., 2014).

Prior to 2019, Bayesian neural networks (BNN) generally

struggle with predictive accuracy and computational efﬁ-

ciency. Recently, a lot of advances have been made in both

directions of research. In deterministic approach, several

authors began to consider using dimension reduction tech-

niques such as subspace inference (Maddox et al., 2019),

rank-1 parameterization(Dusenberry et al., 2020), subnet-

work inference(Daxberger et al., 2021) and node-space in-

ference(Trinh et al., 2022). In the sampling approach,

(Zhang et al., 2019) propose to use cycles of learning rates

with a high-to-low step size schedule. A large step size in

the early stage of the cycle results in aggressive exploration

in the parameter space; as the step size decreases, the algo-

rithm begins to collect sample around the local mode.

Apart from the progress within these two streams, Wilson

and Izmailov (2020) show that deep ensembles (Lakshmi-

narayanan et al., 2017) can be interpreted as an approx-

imate approach to posterior predictive distribution. They

combine multiple independently trained SWAG (Gaus-

sian stochastic weight averaging) approximations (Maddox

et al., 2019; Izmailov et al., 2018) to create a mixture of

Gaussians approximation to the posterior. However, per-

forming variational inference directly on weight space still

produces poor predictive accuracy and struggles with com-

putational efﬁciency. Even a simple mean-ﬁeld variational

inference will involve double the number of parameters

of the neural network (i.e., mean and variance for each

weight), which incurs a extra GPU memory requirement

and 2-5 times of the runtime of the baseline neural network

(Osawa et al., 2019).

In this paper, we develop an adaptive optimization algo-

rithm for Gaussian Mean-ﬁeld variational Bayesian infer-

ence that can achieve state-of-the-art predictive accuracy.

We further show that when the learning rate is small and

the update of the posterior variance has been frozen, the

algorithm is equivalent to the SGHMC (Stochastic Gradi-

ent Hamiltonian Monte Carlo) with preconditioning ma-

trix. Therefore, if we exploit the closed form expression of

the gradient for the posterior variances and only keep track

arXiv:2210.12957v1 [cs.LG] 24 Oct 2022

Manuscript under review by AISTATS 2023

of the weight generated by the algorithm(no need to regis-

ter the mean and variance parameter in the code), we can

achieve big savings on GPU memory and runtime costs.

Based on the connection to SGHMC, we extend the EM

Algorithm for Bayesian variable selection (Roˇ

cková and

George, 2014; Wang et al., 2016; Roˇ

cková, 2018) for linear

models to neural networks by replacing the Gaussian prior

in BNN with the spike-and-slab group Gaussian prior(Xu

and Ghosh, 2015). Our method is an MCMC within EM

algorithm, which will switch the weight decay factor be-

tween small and large based on the magnitude of each

group during training. Since by construction, there are

no exact zeros, we further ﬁnd a simple pruning crite-

rion to remove the weights permanently during training.

A sparse model will be trained in one-shot without addi-

tional retrain. Our approach is more computationally ef-

ﬁcient than those dynamic pruning strategies that allows

regrow (Zhu and Gupta, 2017; Dettmers and Zettlemoyer,

2019; Lin et al., 2020). We will show that this aggres-

sive approach has no performance loss. Our code is avail-

able at github: https://github.com/z5041294/

optimization-and-pruning-for-BNN

2 Optimization

2.1 Preliminaries on variational Bayesian neural

network

Given a dataset D={xi, yi}N

i=1, a Bayesian neural net-

work (BNN) is deﬁned in terms of a prior p(w)on the p-

dimensional weights, as well as the neural network likeli-

hood p(D|w). Variational Bayesian methods approximates

the true posterior p(w|D)by minimising the KL diver-

gence between the approximate distribution, qθ(w), and

the true posterior. This is shown to be equivalent to maxi-

mizing the evidence lower bound (ELBO):

L[θ] = Eqθ[log p(D | w)] −DKL(qθ(w)kp(w)) (1)

where we consider a Bayesian neural net with Gaussian

prior p(w)∼Np(0,Σ0)and a Gaussian approximate pos-

terior qθ(w)∼Np(µ,Σ)where θ= (µ,Σ). To make it

scale to large sized models, we assume both the approxi-

mate posterior and prior weights are independent, such that

p(wj)∼N(0, δ−1)and qθj(wj)∼N(µj, σ2

j). The gradi-

ent of ELBO with respect to µand Σis

∇µL=ENp(µ,Σ)[∇wlog p(D | w)] −Σ−1

0µ≈ − 1

i=1

gi−δµ

∇ΣL=1

2ENp(µ,Σ)∇2

wlog p(D | w)+1

2Σ−1−1

2Σ−1

≈ − 1

i=1

i+1

2diag(σ−2

j)−1

2δ

(2)

where gi=−∇wlog p(D | wi)and wi∼

j=1 N(µj, σ2

j)are Monte Carlo samples. In addition, we

also have the second order derivative of µ

∇2

µL=−ENp(µ,Σ)∇2

wlog p(D | w)+Σ−1

0≈1

i=1

i+δ

Using Gaussian back-propagation and reparameterization

trick, an alternative MC approximation (Khan et al., 2018)

can be used, such that −ENp(µ,Σ)∇2

wlog p(D | w)≈

SPS

i=1 gii

σ, where i∼N(0,1p). In practice, to re-

duce the runtime complexity, we often use S= 1 as long

as the batch size in one iteration is not too small.

2.2 Bayesian versions of adaptive algorithm

In Gaussian mean-ﬁeld variational Bayesian inference, the

natural-gradient descent(Khan et al., 2018; Zhang et al.,

2018; Osawa et al., 2019) optimizes the ELBO bound by

updating

µt+1 = argmin

µ∈Rph∇µL,µi+1

2lt

(µ−µt)Tdiag(σ−2)α(µ−µt)T

=µt−lt(σ2

t)α ∇µtL

where ltis a learning rate, α∈1

2,1and 1

σ2

is updated

with momentum 1

σ2

t=1−λγ

σ2

t−1+γ([gtgt] + δ

N)(0<

γ < 1and λis another learning rate). When α=1

2, this is

similar to the Adam (Kingma and Ba, 2014) algorithm and

when α= 1, the algorithm is very close to second-order

optimization with the Hessian matrix given by Σ−1when

∇ΣL= 0 in equation (2).

There are two concerns for this version of Bayesian adap-

tive algorithm: First, similar to Adam, the variance of adap-

tive learning rate is problematically large in the early stages

of the optimization (Liu et al., 2019). A warm-up stage is

recommended for traditional Adam. As an example, con-

sider a ReLu neural network with a single hidden layer and

binary cross entropy loss, then if a normal initialization of

the weight with mean zero has been used, then the variance

of the adaptive learning will diverge at the beginning of the

training period (See detailed discussion in the Appendix

A). Second, when the number of Monte Carlo samples for

the gradient is S= 1 and the learning rate is small, the in-

jected Gaussian noise may dominate the gradient. We see

that

wt−wt−1=µt−µt−1+ (σt−σt−10)

=−lt(σ2

t)αmt+qσ2

t+σ2

t−1

where mtis the momentum of gradient. For α=1

2, when

the learning rate is small, the gradient may be erased by the

injected noise. For α= 1, this could also happen when

σ2

t1. Empirically, we observe that both w1and

σ21in Bayesian deep learning. So a warm up strategy

that starts with a very small learning rate may not work for

variational BNN. On the contrary, when the learning rate is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheoptimizationandpruningforBayesiandeeplearningXiongwenKeandYananFanSchoolofmathematicsandstatistics,UNSWAbstractThegoalofBayesiandeeplearningistopro-videuncertaintyquanticationviatheposteriordistribution.However,exactinferenceovertheweightspaceiscomputationallyintractableduetotheultra-highdimen...

展开>> 收起<<

On the optimization and pruning for Bayesian deep learning Xiongwen Ke and Yanan Fan School of mathematics and statistics UNSW.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the optimization and pruning for Bayesian deep learning Xiongwen Ke and Yanan Fan School of mathematics and statistics UNSW

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: