MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE PRECISION MATRIX Martin Magris Mostafa Shabani Alexandros Iosifidis

2025-05-02 1 0 2.18MB 47 页 10玖币

侵权投诉

MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE

PRECISION MATRIX

Martin Magris, Mostafa Shabani & Alexandros Iosiﬁdis

Department of Electrical and Computer Engineering

Aarhus University

Finlandsgade 22, 8200 Aarhus N, Denmark

{magris,mshabani,ai}@ece.au.dk

Martin Magris, Mostafa Shabani, Alexandros Iosiﬁdis

Department of Electrical and Computer Engineering, Aarhus University, Finlandsgade 22, Aarhus

8200, Denmark

Keywords: Variational inference, Manifold optimization, Bayesian learning, Black-box optimiza-

tion

Abstract

We propose an optimization algorithm for Variational Inference (VI) in complex models. Our ap-

proach relies on natural gradient updates where the variational space is a Riemann manifold. We

develop an efﬁcient algorithm for Gaussian Variational Inference whose updates satisfy the pos-

itive deﬁnite constraint on the variational covariance matrix. Our Manifold Gaussian Variational

Bayes on the Precision matrix (MGVBP) solution provides simple update rules, is straightforward

to implement, and the use of the precision matrix parametrization has a signiﬁcant computational

advantage. Due to its black-box nature, MGVBP stands as a ready-to-use solution for VI in complex

models. Over ﬁve datasets, we empirically validate our feasible approach on different statistical and

econometric models, discussing its performance with respect to baseline methods.

1 Introduction

Although Bayesian principles are not new to Machine Learning (ML) (Mackay, 1992; 1995;

Lampinen and Vehtari, 2001), it is only with the recent methodological developments that we are

witnessing a growing use of Bayesian techniques in the ﬁeld are (Zhang et al., 2018; Trusheim et al.,

arXiv:2210.14598v5 [stat.ML] 17 Apr 2024

2018; Osawa et al., 2019; Khan et al., 2018; Khan and Nielsen, 2018). In typical ML settings, the

applicability of sampling methods for the challenging computation of the posterior is prohibitive;

however, approximate methods such as Variational Inference (VI) have been proved suitable and

successful (Saul et al., 1996; Wainwright and Jordan, 2008; Hoffman et al., 2013; Blei et al., 2017).

VI is generally performed with Stochastic Gradient Descent (SGD) methods (Robbins and Monro,

1951; Hoffman et al., 2013; Salimans and Knowles, 2014), boosted by the use of natural gradients

(Hoffman et al., 2013; Wierstra et al., 2014; Khan et al., 2018), and the updates often take a simple

form (Khan and Nielsen, 2018; Osawa et al., 2019; Magris et al., 2022).

Most VI algorithms rely on the extensive use of models’ gradients, and the form of the variational

posterior implies additional model-speciﬁc derivations that are not easy to adapt to a general, plug-

and-play optimizer. Black box methods (Ranganath et al., 2014) are straightforward to implement

and versatile as they avoid model-speciﬁc derivations by relying on stochastic sampling (Salimans

and Knowles, 2014; Paisley et al., 2012; Kingma and Welling, 2013). The increased variance in

the gradient estimates as opposed to, e.g., methods relying on the reparametrization trick (Blundell

et al., 2015; Xu et al., 2019) can be alleviated with variance reduction techniques.

Furthermore, most existing algorithms do not directly address parameter constraints. Under the

typical Gaussian variational assumption, granting positive-deﬁniteness of the covariance matrix is

an acknowledged problem (Tran et al., 2021a; Khan et al., 2018; Lin et al., 2020). Only a few

algorithms directly tackle the problem (Osawa et al., 2019; Lin et al., 2020), see Section 4. A recent

approximate approach based on manifold optimization is found in (Tran et al., 2021a). For a review

of the various algorithms for performing VI, see (Magris and Iosiﬁdis, 2023a).

On the results of Tran et al. (2021a) and on their of Manifold Gaussian Variational Bayes (MGVB)

method, we develop a variational inference algorithm that explicitly tackles the positive-deﬁniteness

constraint for the variational covariance matrix, resembles the readily-applicable natural-gradient

black-box framework of (Magris et al., 2022), and that has computational advantages. We bridge

a theoretical issue for the use of symmetric and positive-deﬁnite manifold retraction and parallel

transport for Gaussian VI, leading to our Manifold Gaussian variational Bayes on the Precision

matrix (MGVBP) algorithm. Our solution, based on the precision matrix parametrization of the

variational Gaussian distribution, has furthermore a computational advantage over the implementa-

tion of the usual canonical parameterization on the covariance matrix, as the form of the relevant

gradients in our update rule is greatly simpliﬁed. We distinguish and apply two forms of the stochas-

tic gradient estimator that are applicable in a wider context and show how to exploit certain forms

of the prior/posterior further to reduce the variance of the stochastic gradient estimators. We show

that MGVBP is straightforward to implement, discuss recommendations and practicalities in this

regard, and demonstrate its feasibility in extensive experiments over ﬁve datasets, 14 models, three

competing VI optimizers, and a Markov Chain Monte Carlo baseline.

In Section 2, we review the basis of VI, in Section 4, we review the Manifold Gaussian Variational

Bayes approach and other related works, Section 5 describes the proposed approach. Section 6

discusses implementation aspects, results are reported in Section 7, and Section 8 concludes this

paper. Appendices expand the experiments and provide proofs.

2 Variational inference

Variational Inference (VI) is a convenient and feasible approximate method for Bayesian inference.

Let ydenote the data, p(y|θ)the likelihood of the data based on some model whose d-dimensional

parameter is θ. Let p(θ)be the prior distribution on θ. In standard Bayesian inference, the posterior

is retrieved via the Bayes theorem as p(θ|y) = p(θ)p(y|θ)/p(y). As the marginal likelihood p(y)

is generally intractable, Bayesian inference is often difﬁcult for complex models. Though sampling

techniques can tackle the problem, non-parametric and asymptotically exact Monte Carlo methods

may be slow, especially in high-dimensional applications (Salimans et al., 2015).

Fixed-form VI approximates the true unknown posterior with a probability density qchosen within

a tractable class of distributions Q, such as the exponential family. VI turns the Bayesian inference

problem into that of ﬁnding the best variational distribution q⋆∈ Q minimizing the Kullback-

Leibler (KL) divergence from qto p(θ|y):q⋆= arg minq∈Q DKL(q||p(θ|y)). It can be shown that

the KL minimization problem is equivalent to the maximization of the so-called Lower Bound (LB)

on log p(y), e.g., (Tran et al., 2021b). The optimization problem accounts for ﬁnding the optimal

variational parameter ζparametrizing q≡qζthat maximizes the Lower Bound (LB) (L), that is

ζ⋆= arg maxζ∈Z L(ζ), with

L(ζ):=Zqζ(θ) log p(θ)p(y|θ)

qζ(θ)dθ=Eqζlog p(θ)p(y|θ)

qζ(θ)=Eqζ[hζ(θ)],

where Eqζmeans that the expectation is taken with respect to the distribution qζ, and Zis the

parameter space for ζ.

The maximization of the LB is generally addressed with a gradient-descent method such as SGD

(Robbins and Monro, 1951), ADAM (Kingma and Ba, 2014). The learning of the parameter ζbased

on standard gradient descent is, however, problematic as it ignores the information geometry of the

distribution qζ, is not scale invariant, unstable, and very susceptible to the initial values (Wierstra

et al., 2014). SGD implicitly relies on the Euclidean norm for capturing the dissimilarity between

two distributions, which can be a poor and misleading measure of discrepancy (Khan and Nielsen,

2018). By using the KL divergence in place of the Euclidean norm, the SGD update results in the

following natural gradient update:

ζt+1 =ζt+βth˜

∇ζL(ζ)i



ζ=ζt

, (1)

where βtis a possibly adaptive learning rate, and tdenotes the iteration. The above update results in

improved steps towards the maximum of the LB when optimizing it for the variational parameter ζ.

The natural gradient ˜

∇ζL(ζ)is obtained by rescaling the Euclidean gradient ∇ζL(ζ)by the inverse

of the Fisher Information Matrix (FIM), i.e.,

∇ζL(ζ) = I−1

ζ∇ζL(ζ),

where Iζdenotes the FIM. A signiﬁcant issue in following this approach is that ζis unconstrained.

Think of a Gaussian variational posterior: in the above setting, there is no guarantee that the covari-

ance matrix updates onto a symmetric and positive deﬁnite matrix. As discussed in the introduction,

manifold optimization is an attractive possibility.

3 Elements of manifold optimization

We wish to optimize the function Lof the variational parameter ζwith an update like (1), where

the variational parameter ζlies in a manifold. The usual approach for unconstrained optimization

reduces to (i) ﬁnding the descent direction and (ii) performing a step in that direction to obtain

function decrease. The notion of gradient is extended to manifolds through the tangent space. At a

point ζon the manifold, the tangent space Tζis the approximating vector space, thus given a descent

direction ξζ∈Tζ, a step is performed along the smooth curve on the manifold in this direction.

A Riemannian manifold is a real, smooth manifold equipped with a positive-deﬁnite inner product

gζ(·,·)on the tangent space at each point ζ(see (Absil et al., 2008) for a rigorous deﬁnition). A

Riemann manifold, hereafter simply called manifold, is thus a pair (S, g), where Sis a certain set,

e.g., of certain matrices. For Riemannian manifolds, the Riemann gradient denoted by gradf(ζ)is

deﬁned as a direction on the tangent space, where the inner product of the Riemann gradient and

any direction in the tangent space gives the directional derivative of the function,

<gradf(ζ), ξζ>=Df(ζ)[η],

Figure 1: Manifold illustration. Left: manifold (black), tangent space (light blue), and Riemann

gradient at the point in black. Middle: exponential map (dotted gray) and the corresponding point

on the manifold (green point). Right: Parallel transform between vectors on two tangent planes.

where Df(ζ)[η]denotes the directional derivative of fat ζin the direction η. The gradient has the

property that the direction of gradf(ζ)is the steepest-ascent direction of fat ζ(Absil et al., 2008),

that important for the scope of optimization.

For a descent direction on the tangent space, the map that gives the corresponding point on the

manifold is called the exponential map. The exponential map Expζ(ξζ)thus projects a tangent

vector ξζ∈Tζback to the manifold, generalizing the usual concept ζ+ξζin Euclidean spaces. In

fact, Expζ(ξζ)can be thought as the point on the manifold reached by leaving from ζand moving

in the direction ξζwhile remaining on the manifold. Therefore, in analogy with the usual gradient

descent approach ζ←ζ+β∇f(ζ)with βbeing the learning rate, on manifolds, the update is

performed through retraction following the steepest direction provided by the Riemann gradient as

Expζ(βgradf(ζ)).

In practice, exponential maps are cumbersome to compute; retractions are used as ﬁrst-order approx-

imations. A Riemannian manifold also has a natural way of transporting vectors. Parallel transport

moves tangent vectors from one tangent space to another while preserving the original length and

direction, extending the use of momentum gradients to manifolds. As for the exponential map, a

parallel transport is in practice approximated by the so-called vector transport. Note that the forms

of retraction and vector transport, as much as that of the Riemann gradient, depend on the speciﬁc

metric adopted in the tangent space.

Thinking of ζas the parameter of a Gaussian distribution, ζinvolves elements related to µ, un-

constrained over Rd, and elements related to the covariance matrix, constrained to deﬁne a valid

covariance matrix: the product space of Riemannian manifolds is itself a Riemannian manifold. The

exponential map, gradient, and parallel transport are deﬁned as the Cartesian product of the individ-

ual ones, while the inner product is deﬁned as the sum of the inner product of the components in

their respective manifolds (Hosseini and Sra, 2015).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MANIFOLDGAUSSIANVARIATIONALBAYESONTHEPRECISIONMATRIXMartinMagris,MostafaShabani&AlexandrosIosifidisDepartmentofElectricalandComputerEngineeringAarhusUniversityFinlandsgade22,8200AarhusN,Denmark{magris,mshabani,ai}@ece.au.dkMartinMagris,MostafaShabani,AlexandrosIosifidisDepartmentofElectricalandCompu...

展开>> 收起<<

MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE PRECISION MATRIX Martin Magris Mostafa Shabani Alexandros Iosifidis.pdf

共47页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MANIFOLD GAUSSIAN VARIATIONAL BAYES ON THE PRECISION MATRIX Martin Magris Mostafa Shabani Alexandros Iosifidis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: