On the optimization and pruning for Bayesian deep learning Xiongwen Ke and Yanan Fan School of mathematics and statistics UNSW

2025-05-02 0 0 913.63KB 10 页 10玖币
侵权投诉
On the optimization and pruning for Bayesian deep learning
Xiongwen Ke and Yanan Fan
School of mathematics and statistics, UNSW
Abstract
The goal of Bayesian deep learning is to pro-
vide uncertainty quantification via the posterior
distribution. However, exact inference over the
weight space is computationally intractable due
to the ultra-high dimensions of the neural net-
work. Variational inference (VI) is a promis-
ing approach, but naive application on weight
space does not scale well and often underper-
form on predictive accuracy. In this paper,
we propose a new adaptive variational Bayesian
algorithm to train neural networks on weight
space that achieves high predictive accuracy. By
showing that there is an equivalence to Stochas-
tic Gradient Hamiltonian Monte Carlo(SGHMC)
with preconditioning matrix, we then propose an
MCMC within EM algorithm, which incorpo-
rates the spike-and-slab prior to capture the spar-
sity of the neural network. The EM-MCMC al-
gorithm allows us to perform optimization and
model pruning within one-shot. We evaluate
our methods on CIFAR-10, CIFAR-100 and Im-
ageNet datasets, and demonstrate that our dense
model can reach the state-of-the-art performance
and our sparse model perform very well com-
pared to previously proposed pruning schemes.
1 INTRODUCTION
Bayesian inference (Bishop and Nasrabadi, 2006) provide
an elegant way to capture uncertainty via the posterior dis-
tribution over model parameters. Unfortunately, posterior
inference is intractable in any reasonably sized neural net-
work models.
Works focussing on scalable inference for Bayesian deep
learning over the last decade can be separated by two
streams. One is a deterministic approximation approach
such as variational inference (Graves, 2011; Blundell et al.,
Preliminary work. Under review by AISTATS 2023. Do not dis-
tribute.
2015), dropout (Gal and Ghahramani, 2016), Laplace ap-
proximation (Ritter et al., 2018), or expectation propa-
gation(Hernández-Lobato and Adams, 2015). The other
stream involve sampling approaches such as MCMC using
stochastic gradient Langevin dynamics (SGLM) (Welling
and Teh, 2011; Chen et al., 2014).
Prior to 2019, Bayesian neural networks (BNN) generally
struggle with predictive accuracy and computational effi-
ciency. Recently, a lot of advances have been made in both
directions of research. In deterministic approach, several
authors began to consider using dimension reduction tech-
niques such as subspace inference (Maddox et al., 2019),
rank-1 parameterization(Dusenberry et al., 2020), subnet-
work inference(Daxberger et al., 2021) and node-space in-
ference(Trinh et al., 2022). In the sampling approach,
(Zhang et al., 2019) propose to use cycles of learning rates
with a high-to-low step size schedule. A large step size in
the early stage of the cycle results in aggressive exploration
in the parameter space; as the step size decreases, the algo-
rithm begins to collect sample around the local mode.
Apart from the progress within these two streams, Wilson
and Izmailov (2020) show that deep ensembles (Lakshmi-
narayanan et al., 2017) can be interpreted as an approx-
imate approach to posterior predictive distribution. They
combine multiple independently trained SWAG (Gaus-
sian stochastic weight averaging) approximations (Maddox
et al., 2019; Izmailov et al., 2018) to create a mixture of
Gaussians approximation to the posterior. However, per-
forming variational inference directly on weight space still
produces poor predictive accuracy and struggles with com-
putational efficiency. Even a simple mean-field variational
inference will involve double the number of parameters
of the neural network (i.e., mean and variance for each
weight), which incurs a extra GPU memory requirement
and 2-5 times of the runtime of the baseline neural network
(Osawa et al., 2019).
In this paper, we develop an adaptive optimization algo-
rithm for Gaussian Mean-field variational Bayesian infer-
ence that can achieve state-of-the-art predictive accuracy.
We further show that when the learning rate is small and
the update of the posterior variance has been frozen, the
algorithm is equivalent to the SGHMC (Stochastic Gradi-
ent Hamiltonian Monte Carlo) with preconditioning ma-
trix. Therefore, if we exploit the closed form expression of
the gradient for the posterior variances and only keep track
arXiv:2210.12957v1 [cs.LG] 24 Oct 2022
Manuscript under review by AISTATS 2023
of the weight generated by the algorithm(no need to regis-
ter the mean and variance parameter in the code), we can
achieve big savings on GPU memory and runtime costs.
Based on the connection to SGHMC, we extend the EM
Algorithm for Bayesian variable selection (Roˇ
cková and
George, 2014; Wang et al., 2016; Roˇ
cková, 2018) for linear
models to neural networks by replacing the Gaussian prior
in BNN with the spike-and-slab group Gaussian prior(Xu
and Ghosh, 2015). Our method is an MCMC within EM
algorithm, which will switch the weight decay factor be-
tween small and large based on the magnitude of each
group during training. Since by construction, there are
no exact zeros, we further find a simple pruning crite-
rion to remove the weights permanently during training.
A sparse model will be trained in one-shot without addi-
tional retrain. Our approach is more computationally ef-
ficient than those dynamic pruning strategies that allows
regrow (Zhu and Gupta, 2017; Dettmers and Zettlemoyer,
2019; Lin et al., 2020). We will show that this aggres-
sive approach has no performance loss. Our code is avail-
able at github: https://github.com/z5041294/
optimization-and-pruning-for-BNN
2 Optimization
2.1 Preliminaries on variational Bayesian neural
network
Given a dataset D={xi, yi}N
i=1, a Bayesian neural net-
work (BNN) is defined in terms of a prior p(w)on the p-
dimensional weights, as well as the neural network likeli-
hood p(D|w). Variational Bayesian methods approximates
the true posterior p(w|D)by minimising the KL diver-
gence between the approximate distribution, qθ(w), and
the true posterior. This is shown to be equivalent to maxi-
mizing the evidence lower bound (ELBO):
L[θ] = Eqθ[log p(D | w)] DKL(qθ(w)kp(w)) (1)
where we consider a Bayesian neural net with Gaussian
prior p(w)Np(0,Σ0)and a Gaussian approximate pos-
terior qθ(w)Np(µ,Σ)where θ= (µ,Σ). To make it
scale to large sized models, we assume both the approxi-
mate posterior and prior weights are independent, such that
p(wj)N(0, δ1)and qθj(wj)N(µj, σ2
j). The gradi-
ent of ELBO with respect to µand Σis
µL=ENp(µ,Σ)[wlog p(D | w)] Σ1
0µ 1
S
S
X
i=1
giδµ
ΣL=1
2ENp(µ,Σ)2
wlog p(D | w)+1
2Σ11
2Σ1
0
≈ − 1
2S
S
X
i=1
g2
i+1
2diag(σ2
j)1
2δ
(2)
where gi=−∇wlog p(D | wi)and wi
Qp
j=1 N(µj, σ2
j)are Monte Carlo samples. In addition, we
also have the second order derivative of µ
2
µL=ENp(µ,Σ)2
wlog p(D | w)+Σ1
01
S
S
X
i=1
g2
i+δ
Using Gaussian back-propagation and reparameterization
trick, an alternative MC approximation (Khan et al., 2018)
can be used, such that ENp(µ,Σ)2
wlog p(D | w)
1
SPS
i=1 gii
σ, where iN(0,1p). In practice, to re-
duce the runtime complexity, we often use S= 1 as long
as the batch size in one iteration is not too small.
2.2 Bayesian versions of adaptive algorithm
In Gaussian mean-field variational Bayesian inference, the
natural-gradient descent(Khan et al., 2018; Zhang et al.,
2018; Osawa et al., 2019) optimizes the ELBO bound by
updating
µt+1 = argmin
µRph∇µL,µi+1
2lt
(µµt)Tdiag(σ2)α(µµt)T
=µtlt(σ2
t)α ∇µtL
where ltis a learning rate, α1
2,1and 1
σ2
t
is updated
with momentum 1
σ2
t=1λγ
σ2
t1+γ([gtgt] + δ
N)(0<
γ < 1and λis another learning rate). When α=1
2, this is
similar to the Adam (Kingma and Ba, 2014) algorithm and
when α= 1, the algorithm is very close to second-order
optimization with the Hessian matrix given by Σ1when
ΣL= 0 in equation (2).
There are two concerns for this version of Bayesian adap-
tive algorithm: First, similar to Adam, the variance of adap-
tive learning rate is problematically large in the early stages
of the optimization (Liu et al., 2019). A warm-up stage is
recommended for traditional Adam. As an example, con-
sider a ReLu neural network with a single hidden layer and
binary cross entropy loss, then if a normal initialization of
the weight with mean zero has been used, then the variance
of the adaptive learning will diverge at the beginning of the
training period (See detailed discussion in the Appendix
A). Second, when the number of Monte Carlo samples for
the gradient is S= 1 and the learning rate is small, the in-
jected Gaussian noise may dominate the gradient. We see
that
wtwt1=µtµt1+ (σtσt10)
=lt(σ2
t)αmt+qσ2
t+σ2
t1
where mtis the momentum of gradient. For α=1
2, when
the learning rate is small, the gradient may be erased by the
injected noise. For α= 1, this could also happen when
σ2
t1. Empirically, we observe that both w1and
σ21in Bayesian deep learning. So a warm up strategy
that starts with a very small learning rate may not work for
variational BNN. On the contrary, when the learning rate is
摘要:

OntheoptimizationandpruningforBayesiandeeplearningXiongwenKeandYananFanSchoolofmathematicsandstatistics,UNSWAbstractThegoalofBayesiandeeplearningistopro-videuncertaintyquanticationviatheposteriordistribution.However,exactinferenceovertheweightspaceiscomputationallyintractableduetotheultra-highdimen...

展开>> 收起<<
On the optimization and pruning for Bayesian deep learning Xiongwen Ke and Yanan Fan School of mathematics and statistics UNSW.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:913.63KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注