MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE PRECISION MATRIX Martin Magris Mostafa Shabani Alexandros Iosifidis

2025-05-02 0 0 2.18MB 47 页 10玖币
侵权投诉
MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE
PRECISION MATRIX
Martin Magris, Mostafa Shabani & Alexandros Iosifidis
Department of Electrical and Computer Engineering
Aarhus University
Finlandsgade 22, 8200 Aarhus N, Denmark
{magris,mshabani,ai}@ece.au.dk
Martin Magris, Mostafa Shabani, Alexandros Iosifidis
Department of Electrical and Computer Engineering, Aarhus University, Finlandsgade 22, Aarhus
8200, Denmark
Keywords: Variational inference, Manifold optimization, Bayesian learning, Black-box optimiza-
tion
Abstract
We propose an optimization algorithm for Variational Inference (VI) in complex models. Our ap-
proach relies on natural gradient updates where the variational space is a Riemann manifold. We
develop an efficient algorithm for Gaussian Variational Inference whose updates satisfy the pos-
itive definite constraint on the variational covariance matrix. Our Manifold Gaussian Variational
Bayes on the Precision matrix (MGVBP) solution provides simple update rules, is straightforward
to implement, and the use of the precision matrix parametrization has a significant computational
advantage. Due to its black-box nature, MGVBP stands as a ready-to-use solution for VI in complex
models. Over five datasets, we empirically validate our feasible approach on different statistical and
econometric models, discussing its performance with respect to baseline methods.
1 Introduction
Although Bayesian principles are not new to Machine Learning (ML) (Mackay, 1992; 1995;
Lampinen and Vehtari, 2001), it is only with the recent methodological developments that we are
witnessing a growing use of Bayesian techniques in the field are (Zhang et al., 2018; Trusheim et al.,
arXiv:2210.14598v5 [stat.ML] 17 Apr 2024
2018; Osawa et al., 2019; Khan et al., 2018; Khan and Nielsen, 2018). In typical ML settings, the
applicability of sampling methods for the challenging computation of the posterior is prohibitive;
however, approximate methods such as Variational Inference (VI) have been proved suitable and
successful (Saul et al., 1996; Wainwright and Jordan, 2008; Hoffman et al., 2013; Blei et al., 2017).
VI is generally performed with Stochastic Gradient Descent (SGD) methods (Robbins and Monro,
1951; Hoffman et al., 2013; Salimans and Knowles, 2014), boosted by the use of natural gradients
(Hoffman et al., 2013; Wierstra et al., 2014; Khan et al., 2018), and the updates often take a simple
form (Khan and Nielsen, 2018; Osawa et al., 2019; Magris et al., 2022).
Most VI algorithms rely on the extensive use of models’ gradients, and the form of the variational
posterior implies additional model-specific derivations that are not easy to adapt to a general, plug-
and-play optimizer. Black box methods (Ranganath et al., 2014) are straightforward to implement
and versatile as they avoid model-specific derivations by relying on stochastic sampling (Salimans
and Knowles, 2014; Paisley et al., 2012; Kingma and Welling, 2013). The increased variance in
the gradient estimates as opposed to, e.g., methods relying on the reparametrization trick (Blundell
et al., 2015; Xu et al., 2019) can be alleviated with variance reduction techniques.
Furthermore, most existing algorithms do not directly address parameter constraints. Under the
typical Gaussian variational assumption, granting positive-definiteness of the covariance matrix is
an acknowledged problem (Tran et al., 2021a; Khan et al., 2018; Lin et al., 2020). Only a few
algorithms directly tackle the problem (Osawa et al., 2019; Lin et al., 2020), see Section 4. A recent
approximate approach based on manifold optimization is found in (Tran et al., 2021a). For a review
of the various algorithms for performing VI, see (Magris and Iosifidis, 2023a).
On the results of Tran et al. (2021a) and on their of Manifold Gaussian Variational Bayes (MGVB)
method, we develop a variational inference algorithm that explicitly tackles the positive-definiteness
constraint for the variational covariance matrix, resembles the readily-applicable natural-gradient
black-box framework of (Magris et al., 2022), and that has computational advantages. We bridge
a theoretical issue for the use of symmetric and positive-definite manifold retraction and parallel
transport for Gaussian VI, leading to our Manifold Gaussian variational Bayes on the Precision
matrix (MGVBP) algorithm. Our solution, based on the precision matrix parametrization of the
variational Gaussian distribution, has furthermore a computational advantage over the implementa-
tion of the usual canonical parameterization on the covariance matrix, as the form of the relevant
gradients in our update rule is greatly simplified. We distinguish and apply two forms of the stochas-
tic gradient estimator that are applicable in a wider context and show how to exploit certain forms
of the prior/posterior further to reduce the variance of the stochastic gradient estimators. We show
that MGVBP is straightforward to implement, discuss recommendations and practicalities in this
2
regard, and demonstrate its feasibility in extensive experiments over five datasets, 14 models, three
competing VI optimizers, and a Markov Chain Monte Carlo baseline.
In Section 2, we review the basis of VI, in Section 4, we review the Manifold Gaussian Variational
Bayes approach and other related works, Section 5 describes the proposed approach. Section 6
discusses implementation aspects, results are reported in Section 7, and Section 8 concludes this
paper. Appendices expand the experiments and provide proofs.
2 Variational inference
Variational Inference (VI) is a convenient and feasible approximate method for Bayesian inference.
Let ydenote the data, p(y|θ)the likelihood of the data based on some model whose d-dimensional
parameter is θ. Let p(θ)be the prior distribution on θ. In standard Bayesian inference, the posterior
is retrieved via the Bayes theorem as p(θ|y) = p(θ)p(y|θ)/p(y). As the marginal likelihood p(y)
is generally intractable, Bayesian inference is often difficult for complex models. Though sampling
techniques can tackle the problem, non-parametric and asymptotically exact Monte Carlo methods
may be slow, especially in high-dimensional applications (Salimans et al., 2015).
Fixed-form VI approximates the true unknown posterior with a probability density qchosen within
a tractable class of distributions Q, such as the exponential family. VI turns the Bayesian inference
problem into that of finding the best variational distribution q Q minimizing the Kullback-
Leibler (KL) divergence from qto p(θ|y):q= arg minq∈Q DKL(q||p(θ|y)). It can be shown that
the KL minimization problem is equivalent to the maximization of the so-called Lower Bound (LB)
on log p(y), e.g., (Tran et al., 2021b). The optimization problem accounts for finding the optimal
variational parameter ζparametrizing qqζthat maximizes the Lower Bound (LB) (L), that is
ζ= arg maxζ∈Z L(ζ), with
L(ζ):=Zqζ(θ) log p(θ)p(y|θ)
qζ(θ)dθ=Eqζlog p(θ)p(y|θ)
qζ(θ)=Eqζ[hζ(θ)],
where Eqζmeans that the expectation is taken with respect to the distribution qζ, and Zis the
parameter space for ζ.
The maximization of the LB is generally addressed with a gradient-descent method such as SGD
(Robbins and Monro, 1951), ADAM (Kingma and Ba, 2014). The learning of the parameter ζbased
on standard gradient descent is, however, problematic as it ignores the information geometry of the
distribution qζ, is not scale invariant, unstable, and very susceptible to the initial values (Wierstra
et al., 2014). SGD implicitly relies on the Euclidean norm for capturing the dissimilarity between
two distributions, which can be a poor and misleading measure of discrepancy (Khan and Nielsen,
3
2018). By using the KL divergence in place of the Euclidean norm, the SGD update results in the
following natural gradient update:
ζt+1 =ζt+βth˜
ζL(ζ)i
ζ=ζt
, (1)
where βtis a possibly adaptive learning rate, and tdenotes the iteration. The above update results in
improved steps towards the maximum of the LB when optimizing it for the variational parameter ζ.
The natural gradient ˜
ζL(ζ)is obtained by rescaling the Euclidean gradient ζL(ζ)by the inverse
of the Fisher Information Matrix (FIM), i.e.,
˜
ζL(ζ) = I1
ζζL(ζ),
where Iζdenotes the FIM. A significant issue in following this approach is that ζis unconstrained.
Think of a Gaussian variational posterior: in the above setting, there is no guarantee that the covari-
ance matrix updates onto a symmetric and positive definite matrix. As discussed in the introduction,
manifold optimization is an attractive possibility.
3 Elements of manifold optimization
We wish to optimize the function Lof the variational parameter ζwith an update like (1), where
the variational parameter ζlies in a manifold. The usual approach for unconstrained optimization
reduces to (i) finding the descent direction and (ii) performing a step in that direction to obtain
function decrease. The notion of gradient is extended to manifolds through the tangent space. At a
point ζon the manifold, the tangent space Tζis the approximating vector space, thus given a descent
direction ξζTζ, a step is performed along the smooth curve on the manifold in this direction.
A Riemannian manifold is a real, smooth manifold equipped with a positive-definite inner product
gζ(·,·)on the tangent space at each point ζ(see (Absil et al., 2008) for a rigorous definition). A
Riemann manifold, hereafter simply called manifold, is thus a pair (S, g), where Sis a certain set,
e.g., of certain matrices. For Riemannian manifolds, the Riemann gradient denoted by gradf(ζ)is
defined as a direction on the tangent space, where the inner product of the Riemann gradient and
any direction in the tangent space gives the directional derivative of the function,
<gradf(ζ), ξζ>=Df(ζ)[η],
4
Figure 1: Manifold illustration. Left: manifold (black), tangent space (light blue), and Riemann
gradient at the point in black. Middle: exponential map (dotted gray) and the corresponding point
on the manifold (green point). Right: Parallel transform between vectors on two tangent planes.
where Df(ζ)[η]denotes the directional derivative of fat ζin the direction η. The gradient has the
property that the direction of gradf(ζ)is the steepest-ascent direction of fat ζ(Absil et al., 2008),
that important for the scope of optimization.
For a descent direction on the tangent space, the map that gives the corresponding point on the
manifold is called the exponential map. The exponential map Expζ(ξζ)thus projects a tangent
vector ξζTζback to the manifold, generalizing the usual concept ζ+ξζin Euclidean spaces. In
fact, Expζ(ξζ)can be thought as the point on the manifold reached by leaving from ζand moving
in the direction ξζwhile remaining on the manifold. Therefore, in analogy with the usual gradient
descent approach ζζ+βf(ζ)with βbeing the learning rate, on manifolds, the update is
performed through retraction following the steepest direction provided by the Riemann gradient as
Expζ(βgradf(ζ)).
In practice, exponential maps are cumbersome to compute; retractions are used as first-order approx-
imations. A Riemannian manifold also has a natural way of transporting vectors. Parallel transport
moves tangent vectors from one tangent space to another while preserving the original length and
direction, extending the use of momentum gradients to manifolds. As for the exponential map, a
parallel transport is in practice approximated by the so-called vector transport. Note that the forms
of retraction and vector transport, as much as that of the Riemann gradient, depend on the specific
metric adopted in the tangent space.
Thinking of ζas the parameter of a Gaussian distribution, ζinvolves elements related to µ, un-
constrained over Rd, and elements related to the covariance matrix, constrained to define a valid
covariance matrix: the product space of Riemannian manifolds is itself a Riemannian manifold. The
exponential map, gradient, and parallel transport are defined as the Cartesian product of the individ-
ual ones, while the inner product is defined as the sum of the inner product of the components in
their respective manifolds (Hosseini and Sra, 2015).
5
摘要:

MANIFOLDGAUSSIANVARIATIONALBAYESONTHEPRECISIONMATRIXMartinMagris,MostafaShabani&AlexandrosIosifidisDepartmentofElectricalandComputerEngineeringAarhusUniversityFinlandsgade22,8200AarhusN,Denmark{magris,mshabani,ai}@ece.au.dkMartinMagris,MostafaShabani,AlexandrosIosifidisDepartmentofElectricalandCompu...

收起<<
MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE PRECISION MATRIX Martin Magris Mostafa Shabani Alexandros Iosifidis.pdf

共47页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:47 页 大小:2.18MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 47
客服
关注