
Later, this method has been extended to HMC [
19
] and multiple projections [
20
]. However, the
implementation and analysis of these methods often assume the algorithm starts on the manifold and
never leaves it, requiring prior known points on the manifold and expensive projection subroutines,
such as Newton’s method [
2
,
37
,
18
–
20
] or a long time ordinary differential equation (ODE) [
39
,
31
,
14
]. In contrast, our method works with distributions supported on the ambient space and thus gets
rid of the above strong assumptions, leading to a much faster update per iteration. This makes our
method especially suitable for complex ML models such as deep neural networks.
Sampling with a Moment Constraint
Recently, sampling with a general moment constraint, such
as
Eq[g]≤0
where
q
is the approximated distribution, has been studied [
25
]. However, this type
of constraint can not guarantee every sample to satisfy
g(x) = 0
. From a technical view, the target
distribution with a moment constraint is usually not singular w.r.t.
π
, so the problem is conceptually
less challenging compared to the problem considered in this work.
3 Preliminaries
Variational Framework
We review the derivation of Langevin dynamics and SVGD from a unified
variational framework. The variational approach frames the problem of sampling into a KL divergence
minimization problem:
minq∈P KL(q|| π)
where
P
is the space of probability measures. We start
from an initial distribution
q0
and an initial point
x0∼q0
, and update
xt
following
dxt=vt(xt)dt
,
where
vt:Rd→Rd
is a velocity field at time
t
. Then the density
qt
of
xt
follows Fokker-Planck
equation: dqt/dt =−∇ · (vtqt), and the KL divergence decreases with the following rate [23]:
−d
dtKL(qt|| π) = Eqt[Aπvt] = Eqt[(sπ−sqt)>vt],(1)
where
Aπv(x) = sπ(x)>v(x) + ∇·v(x)
is the Stein operator, and
sp=∇log p
is the score function
of the distribution p. The optimal vtis obtained by solving an optimization in a Hilbert space H,
max
v∈H
Eqt[(sπ−sqt)>v]−1
2kvk2
H.(2)
The above objective makes sure that vtdecreases the KL divergence as fast as possible.
Langevin Dynamics and SVGD Algorithms
Both Langevin dynamics and SVGD can be derived
from this variational framework by taking
H
to be different spaces. Taking
H
to be
L2
q
, the velocity
field becomes
vt(·) = ∇sπ(·)− ∇qt(·)
which can be simulated by Langevin dynamics
dxt=
sπ(xt)dt +dWt
with
Wt
being a standard Browninan motion. After discretization with a step size
η > 0, the update step of Langevin dynamics is xt+1 =xt+ Langevin(xt), where
Langevin(xt) = η∇log π(xt) + p2ηξt, ξt∼ N(0, I).(3)
Taking
H
to be the reproducing kernel Hilbert space (RKHS) of a continuously differentiable kernel
k:Rd×Rd→R
, the velocity field becomes
vt(·) = Ex∼qt[kt(·, x)sπ(x) + ∇xkt(·, x)].
After
discretization, the update step of SVGD for particles
{xi}n
i=1
is
xi,t+1 =xi,t +η·SVGDk(xi,t)
,
for i= 1, . . . , n, where ηis a step size and
SVGDk(xi,t) = 1
n
n
X
j=1
k(xi,t, xj,t)∇xj,t log π(xj,t) + ∇xj,t k(xi,t, xj,t).(4)
4 Main Method
In this section, we formulate the constrained sampling problem into a constrained optimization
through the variational lens in Section 4.1, and introduce a new gradient flow to solve the problem
in Section 4.2. We apply this general framework to Langevin dynamics and SVGD, leading to two
practical algorithms in Section 4.3.
4.1 Constrained Variational Optimization
Recall that our goal is to draw samples according to the probability of
π
, but restricted to a low
dimensional manifold specified by an equality: G0:= {x∈Rd:g(x)=0}. Similar to the standard
3