2 SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI, AND EDITH ZHANG
measures ξand η. Denoting P(θ, x
x
x) := π(θ)P(x
x
x|θ), we have
p= arg min
ν∈P(Rd)D(νkp)(1)
= arg min
ν∈P(Rd){Eν[log ν]−Eν[log P(x
x
x, θ)]}+ log Z.
Here, the set
P
(
Rd
)contains absolutely continuous probability measures. The
optimization problem
(1)
over the probability space is known as the Variational
Bayes (VB) form of Bayes’ rule [
7
]. Denote as
H
(
ν
) :=
−Eν
[
log ν
]the entropy of
the measure
ν
, and Ψ(
ν
) :=
Eν
[
−log P
(
x
x
x, θ
)] the expected negative log likelihood
of the joint distribution
P
(
x
x
x, θ
). Since
log Z
is a constant w.r.t.
ν
, the VB 1
minimizes the evidence lower bound (ELBO) [
7
] objective
J
(
ν
) := Ψ(
ν
)
−H
(
ν
).
Equivalently, it maximizes
−J
(
ν
), balancing a high log likelihood Ψ(
ν
)under
ν
with
a regularization term that desires a high entropy solution
ν
. [
17
] provide equivalent
functional representations of the objective of
(1)
that arise from other perspectives.
Existence, uniqueness and convergence results for VB can be obtained from
representations of
(1)
constructed by exploiting intriguing connections between
Bayesian inference, differential equations and diffusion processes. [
13
] provided
a seminal result that the gradient flow in Wasserstein space (the metric space
P
(
Rd
)of probability measures endowed with 2-Wasserstein distance
W2
) of an
objective function like
(1)
can be equivalently expressed as the solution to a Fokker-
Planck (FPE) equation, which is a parabolic partial differential equation (PDE) on
densities as
L1
functions. These key connections allow Bayes’ rule to be expressed as
minimum of various related functionals on different metric spaces: it can be viewed
as the stationary solution of a gradient flow of
J
in the space
W2
, as the stationary
solution to an FPE in the
L1
space of density functions, and also corresponds to the
stationary distribution of a diffusion process. These equivalent relationships have
been depicted in Fig. 1; see [17] for further details.
Solution procedures for the several equivalent optimization representations to
obtain the posterior
p
that are shown in Fig. 1 are in practice hard to implement since
each still requires computationally difficult operations in functional and probability
spaces. In practice, the VB problem
(1)
is approximated by Variational Inference
procedures that replace the general set
P
(
Rd
)with a constrained subset of feasible
probability measures
Q⊂P
where measures in
Q
possess structural properties
that allow for practical and efficient implementation of the optimization. The
solution thus obtained is an approximation of
p
, and will coincide only if
p∈ Q
.
A common choice is the mean field VI [
7
] where
Q
is taken to be the mean field
family
Q
(
Rd
) :=
Qd
i=1 P
(
R
)where the components of
θ
are independent of each
other. The MFVI approximation of
p
is then obtained by solving the optimization
problem (1) over the restricted feasible set ν∈ Q.
Contributions:
Our main focus is to derive multiple representations for the
MFVI formulation similar to those displayed in Fig. 1. Our analogous representations
are recounted concisely in Fig. 2. Specifically:
•
Broadly following the alternative views available for Bayesian inference, we
describe three different representations of the MFVI algorithm. The first
views the mean-field approximation of the posterior as the gradient flow of
a joint set of functionals, the second as a solution to a system of quasilinear
partial differential equations and the last as a diffusion process that is the
stationary distribution of a system of stochastic differential equations.