
Manuscript under review by AISTATS 2023
of the weight generated by the algorithm(no need to regis-
ter the mean and variance parameter in the code), we can
achieve big savings on GPU memory and runtime costs.
Based on the connection to SGHMC, we extend the EM
Algorithm for Bayesian variable selection (Roˇ
cková and
George, 2014; Wang et al., 2016; Roˇ
cková, 2018) for linear
models to neural networks by replacing the Gaussian prior
in BNN with the spike-and-slab group Gaussian prior(Xu
and Ghosh, 2015). Our method is an MCMC within EM
algorithm, which will switch the weight decay factor be-
tween small and large based on the magnitude of each
group during training. Since by construction, there are
no exact zeros, we further find a simple pruning crite-
rion to remove the weights permanently during training.
A sparse model will be trained in one-shot without addi-
tional retrain. Our approach is more computationally ef-
ficient than those dynamic pruning strategies that allows
regrow (Zhu and Gupta, 2017; Dettmers and Zettlemoyer,
2019; Lin et al., 2020). We will show that this aggres-
sive approach has no performance loss. Our code is avail-
able at github: https://github.com/z5041294/
optimization-and-pruning-for-BNN
2 Optimization
2.1 Preliminaries on variational Bayesian neural
network
Given a dataset D={xi, yi}N
i=1, a Bayesian neural net-
work (BNN) is defined in terms of a prior p(w)on the p-
dimensional weights, as well as the neural network likeli-
hood p(D|w). Variational Bayesian methods approximates
the true posterior p(w|D)by minimising the KL diver-
gence between the approximate distribution, qθ(w), and
the true posterior. This is shown to be equivalent to maxi-
mizing the evidence lower bound (ELBO):
L[θ] = Eqθ[log p(D | w)] −DKL(qθ(w)kp(w)) (1)
where we consider a Bayesian neural net with Gaussian
prior p(w)∼Np(0,Σ0)and a Gaussian approximate pos-
terior qθ(w)∼Np(µ,Σ)where θ= (µ,Σ). To make it
scale to large sized models, we assume both the approxi-
mate posterior and prior weights are independent, such that
p(wj)∼N(0, δ−1)and qθj(wj)∼N(µj, σ2
j). The gradi-
ent of ELBO with respect to µand Σis
∇µL=ENp(µ,Σ)[∇wlog p(D | w)] −Σ−1
0µ≈ − 1
S
S
X
i=1
gi−δµ
∇ΣL=1
2ENp(µ,Σ)∇2
wlog p(D | w)+1
2Σ−1−1
2Σ−1
0
≈ − 1
2S
S
X
i=1
g2
i+1
2diag(σ−2
j)−1
2δ
(2)
where gi=−∇wlog p(D | wi)and wi∼
Qp
j=1 N(µj, σ2
j)are Monte Carlo samples. In addition, we
also have the second order derivative of µ
∇2
µL=−ENp(µ,Σ)∇2
wlog p(D | w)+Σ−1
0≈1
S
S
X
i=1
g2
i+δ
Using Gaussian back-propagation and reparameterization
trick, an alternative MC approximation (Khan et al., 2018)
can be used, such that −ENp(µ,Σ)∇2
wlog p(D | w)≈
1
SPS
i=1 gii
σ, where i∼N(0,1p). In practice, to re-
duce the runtime complexity, we often use S= 1 as long
as the batch size in one iteration is not too small.
2.2 Bayesian versions of adaptive algorithm
In Gaussian mean-field variational Bayesian inference, the
natural-gradient descent(Khan et al., 2018; Zhang et al.,
2018; Osawa et al., 2019) optimizes the ELBO bound by
updating
µt+1 = argmin
µ∈Rph∇µL,µi+1
2lt
(µ−µt)Tdiag(σ−2)α(µ−µt)T
=µt−lt(σ2
t)α ∇µtL
where ltis a learning rate, α∈1
2,1and 1
σ2
t
is updated
with momentum 1
σ2
t=1−λγ
σ2
t−1+γ([gtgt] + δ
N)(0<
γ < 1and λis another learning rate). When α=1
2, this is
similar to the Adam (Kingma and Ba, 2014) algorithm and
when α= 1, the algorithm is very close to second-order
optimization with the Hessian matrix given by Σ−1when
∇ΣL= 0 in equation (2).
There are two concerns for this version of Bayesian adap-
tive algorithm: First, similar to Adam, the variance of adap-
tive learning rate is problematically large in the early stages
of the optimization (Liu et al., 2019). A warm-up stage is
recommended for traditional Adam. As an example, con-
sider a ReLu neural network with a single hidden layer and
binary cross entropy loss, then if a normal initialization of
the weight with mean zero has been used, then the variance
of the adaptive learning will diverge at the beginning of the
training period (See detailed discussion in the Appendix
A). Second, when the number of Monte Carlo samples for
the gradient is S= 1 and the learning rate is small, the in-
jected Gaussian noise may dominate the gradient. We see
that
wt−wt−1=µt−µt−1+ (σt−σt−10)
=−lt(σ2
t)αmt+qσ2
t+σ2
t−1
where mtis the momentum of gradient. For α=1
2, when
the learning rate is small, the gradient may be erased by the
injected noise. For α= 1, this could also happen when
σ2
t1. Empirically, we observe that both w1and
σ21in Bayesian deep learning. So a warm up strategy
that starts with a very small learning rate may not work for
variational BNN. On the contrary, when the learning rate is