ON REPRESENTATIONS OF MEAN-FIELD VARIATIONAL INFERENCE SOUMYADIP GHOSH YINGDONG LU TOMASZ NOWICKI AND EDITH ZHANG

2025-05-02 0 0 699.93KB 19 页 10玖币
侵权投诉
ON REPRESENTATIONS OF MEAN-FIELD VARIATIONAL
INFERENCE
SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI, AND EDITH ZHANG
Abstract.
The mean field variational inference (MFVI) formulation restricts
the general Bayesian inference problem to the subspace of product measures. We
present a framework to analyse MFVI algorithms, which is inspired by a similar
development for general variational Bayesian formulations. Our approach
enables the MFVI problem to be represented in three different manners: a
gradient flow on Wasserstein space, a system of Fokker-Planck-like equations
and a diffusion process. Rigorous guarantees are established to show that a
time-discretized implementation of the coordinate ascent variational inference
algorithm in the product Wasserstein space of measures yields a gradient flow
in the limit. A similar result is obtained for their associated densities, with
the limit being given by a quasi-linear partial differential equation. A popular
class of practical algorithms falls in this framework, which provides tools to
establish convergence. We hope this framework could be used to guarantee
convergence of algorithms in a variety of approaches, old and new, to solve
variational inference problems.
1. Introduction
Bayesian analysis posits a statistical model with observable variables
x
x
xRn
and unobserved latent variables
θRd
and seeks to infer a posterior distribution
p
(
θ|x
x
x
)for the latent
θ
given a dataset of observations
x
x
x
= (
x1, . . . , xn
). The answer
is provided, in the abstract, by Bayes’ theorem:
p
(
θ|x
x
x
) =
π
(
θ
)
P
(
x
x
x|θ
)
/Z
where
P
(
x
x
x|θ
)represents the conditional probability of observations
x
x
x
given
θ
,
π
(
θ
)is a pre-
specified prior distribution on
θ
and the normalizing constant
Z
=
Rπ
(
ζ
)
P
(
x
x
x|ζ
)
is the (unconditioned) probability of observing
x
x
x
. Computing the denominator
Z
is often prohibitively expensive (it is a
]P
-complete problem even in some
special cases, see, e.g. [
16
]), and so an exact computation of the desired posterior
distribution
p
directly from Bayes’ rule is intractable. Various algorithms have
been proposed to overcome this difficulty in practice. These include sampling
algorithms such as Markov chain Monte Carlo (MCMC) methods [
12
] that aim
to estimate the true posterior
p
, but are challenged in practice by the possibility
of long initialization periods that are discarded and the hardness of determining
effective stopping criteria. Variational Inference (VI) [
7
] algorithms on the other
hand can be efficiently implemented to quickly identify approximations of
p
that are
restricted to computationally advantageous forms. Each such VI approach comes
with varying degrees of theoretical guarantees for convergence. In this article, we
focus on the rigorous analysis of convergence of a subset called Mean Field VI
(MFVI), a commonly implemented practical VI approach.
The posterior distribution
p
is trivially re-expressed as the minimizer of the
Kullback-Leibler (KL) divergence
D
to itself, where
D
(
ξkη
) :=
Eξ
[
log
(
/dη
)] for
1
arXiv:2210.11385v1 [stat.ML] 20 Oct 2022
2 SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI, AND EDITH ZHANG
measures ξand η. Denoting P(θ, x
x
x) := π(θ)P(x
x
x|θ), we have
p= arg min
ν∈P(Rd)D(νkp)(1)
= arg min
ν∈P(Rd){Eν[log ν]Eν[log P(x
x
x, θ)]}+ log Z.
Here, the set
P
(
Rd
)contains absolutely continuous probability measures. The
optimization problem
(1)
over the probability space is known as the Variational
Bayes (VB) form of Bayes’ rule [
7
]. Denote as
H
(
ν
) :=
Eν
[
log ν
]the entropy of
the measure
ν
, and Ψ(
ν
) :=
Eν
[
log P
(
x
x
x, θ
)] the expected negative log likelihood
of the joint distribution
P
(
x
x
x, θ
). Since
log Z
is a constant w.r.t.
ν
, the VB 1
minimizes the evidence lower bound (ELBO) [
7
] objective
J
(
ν
) := Ψ(
ν
)
H
(
ν
).
Equivalently, it maximizes
J
(
ν
), balancing a high log likelihood Ψ(
ν
)under
ν
with
a regularization term that desires a high entropy solution
ν
. [
17
] provide equivalent
functional representations of the objective of
(1)
that arise from other perspectives.
Existence, uniqueness and convergence results for VB can be obtained from
representations of
(1)
constructed by exploiting intriguing connections between
Bayesian inference, differential equations and diffusion processes. [
13
] provided
a seminal result that the gradient flow in Wasserstein space (the metric space
P
(
Rd
)of probability measures endowed with 2-Wasserstein distance
W2
) of an
objective function like
(1)
can be equivalently expressed as the solution to a Fokker-
Planck (FPE) equation, which is a parabolic partial differential equation (PDE) on
densities as
L1
functions. These key connections allow Bayes’ rule to be expressed as
minimum of various related functionals on different metric spaces: it can be viewed
as the stationary solution of a gradient flow of
J
in the space
W2
, as the stationary
solution to an FPE in the
L1
space of density functions, and also corresponds to the
stationary distribution of a diffusion process. These equivalent relationships have
been depicted in Fig. 1; see [17] for further details.
Solution procedures for the several equivalent optimization representations to
obtain the posterior
p
that are shown in Fig. 1 are in practice hard to implement since
each still requires computationally difficult operations in functional and probability
spaces. In practice, the VB problem
(1)
is approximated by Variational Inference
procedures that replace the general set
P
(
Rd
)with a constrained subset of feasible
probability measures
Q⊂P
where measures in
Q
possess structural properties
that allow for practical and efficient implementation of the optimization. The
solution thus obtained is an approximation of
p
, and will coincide only if
p∈ Q
.
A common choice is the mean field VI [
7
] where
Q
is taken to be the mean field
family
Q
(
Rd
) :=
Qd
i=1 P
(
R
)where the components of
θ
are independent of each
other. The MFVI approximation of
p
is then obtained by solving the optimization
problem (1) over the restricted feasible set ν∈ Q.
Contributions:
Our main focus is to derive multiple representations for the
MFVI formulation similar to those displayed in Fig. 1. Our analogous representations
are recounted concisely in Fig. 2. Specifically:
Broadly following the alternative views available for Bayesian inference, we
describe three different representations of the MFVI algorithm. The first
views the mean-field approximation of the posterior as the gradient flow of
a joint set of functionals, the second as a solution to a system of quasilinear
partial differential equations and the last as a diffusion process that is the
stationary distribution of a system of stochastic differential equations.
ON REPRESENTATIONS OF MEAN-FIELD VARIATIONAL INFERENCE 3
Theorem 1 shows that a discrete process induced by the candidate solutions
of a coordinate-wise algorithm (see Sec 3) converges to an equivalent gradient
flow defined on the product Wasserstein space of measures when a certain
step size parameter is shrunk to zero. This is to the best of our knowledge
the first gradient flow representation of the general MFVI algorithm, and it
depends on extensions of some basic concepts of gradient flows to product
Wasserstein space, which are presented in Sec. 4.1 and Sec. B.
We also demonstrate that the corresponding density functions converge to
the solution of a second order quasilinear evolutional (parabolic) equation
in Corollary 1. Additionally, in Theorem 2, we extend our analysis to
present new results of independent interest on existence and uniqueness of
solutions to families of quasilinear evolutional equations that satisfy similar
conditions.
The quasilinear evolutional equation leads to the probabilistic representation
of the MFVI by connecting its solution to the density of a stochastic process
that is the solution to a corresponding stochastic differential equation (SDE)
of Mckean-Vlasov type.
The three representations presented in this article open the possibility of multiple
new algorithmic approaches to obtaining the approximation to
p
in the space
Q
, and
also provides tools to study the convergence properties of these algorithms. While a
detailed development is out of scope here, we briefly summarize some possibilities.
The MFVI formulation
(1)
can be solved using a system of SGD-like iterations
produced by Euler-discretization formulations
(11)
, each of which can be solved
explicitly for further restrictions of the marginals measures to parametric families
such as Gaussian, mixed-Gaussian etc. Alternately, non-parametric particle-based
heuristics can be used to approximate the solution to the general SGD steps
(11)
. The
SDE representation on the other hand suggests that the posterior be approximated
by estimating the stationary process of the SDE by exploiting techniques from
the vast literature on SDEs. A particle filter based approach can for example be
constructed using a system of MCMCs with dynamics arising from the components
of the SDE.
Prior Work:
Convergence analysis of the MFVI approximation to the VB
problem is relatively less well established. [
18
] provide consistency results for
MFVI procedures by establishing that point estimates of the latent variables
θ
(such as expectations of functions of
θ
) constructed using MFVI estimates of the
posterior converge to the true value asymptotically as the size
n
of
x
x
x
grows under
the assumption that the true latent variable takes a definite value. A recent analysis
by [
14
] presents a convergence analysis of VI where the set
Q
are further constrained
to be (mixtures of) Gaussian distributions, thus operating in the sub-manifold of
P
(
Rd
)known as the Bures-Wasserstein manifold. Their methodology closely follows
the standard VB analysis outlined in Fig. 1 restricted to this manifold. In particular,
the formulation
(1)
in this case leads to a simplified FPE equation and associated
diffusion process, unlike our case which requires the development and analysis of a
system of quasi-linear PDEs and associated stochastic processes.
Organization:
The rest of the paper will be organized as follows: in Sec. 2,
we provide precise definitions of various key representations of Bayesian inference
as illustrated in Fig. 1; Sec. 3 defines the optimization formulation of the MFVI
problem, including an Euler discretization scheme which forms the basis of all the
4 SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI, AND EDITH ZHANG
Figure 1. Representations of Bayesian inference
development that follows; in Sec. 4, we present our results on the convergence of
the discrete scheme to gradient flow; in Sec. 5, we define the equivalent quasilinear
parabolic equation in the
L2
space of densities and discuss its well-posedness, as
well as the probabilistic representation in the form of a Mckean-Vlasov stochastic
differential equation.
2. Representations of the Bayesian Posterior
The variational formulation
(1)
of Bayesian inference enables a fruitful exploration
of connection between the algorithms such as gradient descent and gradient flows,
as well as its implications. This section presents a brief overview of three equivalent
representations of the VB formulation that yield different characterisations of the
posterior distribution, as summarized in Figure 1, each of which lead to potential
algorithmic approaches to approximate it. In Sec. 4 & 5, a similar set of relationship
will be established for MFVI. For that purpose, we will provide precise definitions
and descriptions of these characterizations in this section.
In the classic Euclidean space setting, a curve
x
(
t
)in
Rd
is called the gradient
flow for some function E:RdRif it solves the following equation
(2) tx(t) = −∇E(x(t)).
To extend this concept to general metric space (Wasserstein spaces included),
consider a general (energy) functional
E
:
XR
defined on a metric space (
X, d
),
its gradient flow
x
(
t
) :
R+X
solves the following energy dissipation equation, for
t > 0,
E(x0) = E(xt) + 1
2Zt
0
|˙x((r)|2dr +1
2Zt
0
|∇E(x(r))|2dr,(3)
摘要:

ONREPRESENTATIONSOFMEAN-FIELDVARIATIONALINFERENCESOUMYADIPGHOSH,YINGDONGLU,TOMASZNOWICKI,ANDEDITHZHANGAbstract.Themeaneldvariationalinference(MFVI)formulationrestrictsthegeneralBayesianinferenceproblemtothesubspaceofproductmeasures.WepresentaframeworktoanalyseMFVIalgorithms,whichisinspiredbyasimila...

展开>> 收起<<
ON REPRESENTATIONS OF MEAN-FIELD VARIATIONAL INFERENCE SOUMYADIP GHOSH YINGDONG LU TOMASZ NOWICKI AND EDITH ZHANG.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:699.93KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注