Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent Ruqi Zhang

2025-04-26 0 0 1.55MB 23 页 10玖币
侵权投诉
Sampling in Constrained Domains with
Orthogonal-Space Variational Gradient Descent
Ruqi Zhang
Department of Computer Science
Purdue University
ruqiz@purdue.edu
Qiang Liu
Department of Computer Science
University of Texas at Austin
lqiang@cs.texas.edu
Xin T. Tong
Department of Mathematics
National University of Singapore
mattxin@nus.edu.sg
Abstract
Sampling methods, as important inference and learning techniques, are typically de-
signed for unconstrained domains. However, constraints are ubiquitous in machine
learning problems, such as those on safety, fairness, robustness, and many other
properties that must be satisfied to apply sampling results in real-life applications.
Enforcing these constraints often leads to implicitly-defined manifolds, making
efficient sampling with constraints very challenging. In this paper, we propose
a new variational framework with a designed orthogonal-space gradient flow (O-
Gradient) for sampling on a manifold
G0
defined by general equality constraints.
O-Gradient decomposes the gradient into two parts: one decreases the distance
to
G0
and the other decreases the KL divergence in the orthogonal space. While
most existing manifold sampling methods require initialization on
G0
, O-Gradient
does not require such prior knowledge. We prove that O-Gradient converges to the
target constrained distribution with rate
e
O(1/the number of iterations)
under mild
conditions. Our proof relies on a new Stein characterization of conditional measure
which could be of independent interest. We implement O-Gradient through both
Langevin dynamics and Stein variational gradient descent and demonstrate its
effectiveness in various experiments, including Bayesian deep neural networks.
1 Introduction
Sampling methods, such as Markov chain Monte Carlo (MCMC) [
1
] and Stein variational gradi-
ent descent (SVGD) [
23
,
22
], have been widely used for getting samples from or approximating
intractable distributions in machine learning (ML) problems, such as estimating Bayesian neural
network posteriors [
38
], generating new images [
33
], and training energy-based models [
15
]. While
being powerful, most sampling methods usually can only be used in unconstrained domains or some
special geometric spaces. This greatly limits the application of sampling to many real-life tasks.
We consider sampling from a distribution
π
with an equality constraint
g(x)=0
where
g:RdR
is
a general differentiable function. The domain in this case is the level set
G0={xRd:g(x) = 0}
which is a submanifold in
Rd
. We do not require additional information about
G0
, such as explicit
parameterization or known in-domain points, which is in contrast, often demanded by previous
methods [
3
,
35
,
2
,
19
]. The problem defined above includes many ML applications, such as disease
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06447v1 [cs.LG] 12 Oct 2022
(a) O-Gradient (b) O-Langevin (c) O-SVGD
Figure 1: Visualization of our methods. (a) O-Gradient
v
is formed by
v]
which follows
g
, and
v
which is perpendicular to
g
. (b)-(c) Applying O-Gradient to Langevin dynamics and SVGD. Both
methods can approach the manifold and sample on it.
diagnosis with logic rules constraint, policymaking with fairness constraint for different demographic
subgroups, and autonomous driving with robustness constraint to unseen scenarios.
In this paper, we propose a new variational framework which transforms the above constrained
sampling problem into a constrained functional minimization problem. A special gradient flow,
denoted orthogonal-space gradient flow (O-Gradient), is developed to minimize the objective. As
illustrated in Figure 1a, the direction of O-Gradient
v
can be decomposed into two parts: the first
part
v]
drives the sampler towards the manifold
G0
following
g
and keeps it on
G0
once arrived; the
second part
v
makes the sampler explore
G0
following the density
π(x)
. We prove the convergence
of O-Gradient in the continuous-time mean-field limit. O-Gradient can be applied to both Langevin
dynamics and SVGD, resulting in O-Langevin and O-SVGD respectively. As shown in Figure 1b&c,
both methods can converge to the target distribution on the manifold. In particular, O-Langevin
converges following a noisy trajectory while O-SVGD converges smoothly, similar to their standard
unconstrained counterparts. We empirically demonstrate the sampling performance of O-Langevin
and O-SVGD across different constrained ML problems. We summarize our contributions as follows:
We reformulate the hard-constrained sampling problem into a functional optimization
problem and derive a special gradient flow, O-Gradient, to obtain the solution.
We prove that O-Gradient converges to the target constrained distribution with rate
e
O(1/the number of iterations)
under mild conditions. Our proof technique includes a new
Stein characterization of conditional measure which could be of independent interest.
We implement O-Gradient through both Langevin dynamics and SVGD and demonstrate
its effectiveness in various experiments, including a constrained synthetic distribution,
income classification with fairness constraint, loan classification with logic rules and image
classification with robust Bayesian deep neural networks.
2 Related Work
Sampling on Explicitly Defined Manifolds
Manifolds with special shapes, such as geometric
or physics structures, can sometimes be explicitly parameterized in lower dimension spaces. For
example, a torus embedded in
R3
can be explicitly defined in two dimensions using polar coordinates.
Variants of classical methods have been developed to sample on such manifolds, including rejection
sampling [
6
], Langevin dynamics [
35
], Hamiltonian Monte Carlo (HMC) [
3
] and Riemannian
manifold HMC [
16
,
28
]. However, explicit parameterization is only applicable to a few special cases
and cannot be used for general machine learning problems. In contrast to this line of work, our
method is able to work with more general manifolds defined in the original domain Rd.
Sampling on Implicitly Defined Manifolds
Many common applications are not endowed with
simple manifolds, such as molecular dynamics [
17
], matrix factorization [
27
] and free energy
calculations [
34
]. Motivated by these applications, sampling methods on implicitly defined manifolds
have been developed. Brubaker et al.
[2]
has proposed a family of constrained MCMC methods
by applying Lagrangian mechanics to Hamiltonian dynamics. Zappa et al.
[37]
has introduced a
constrained Metropolis-Hastings (MH) with a reverse projection check to ensure the reversibility.
2
Later, this method has been extended to HMC [
19
] and multiple projections [
20
]. However, the
implementation and analysis of these methods often assume the algorithm starts on the manifold and
never leaves it, requiring prior known points on the manifold and expensive projection subroutines,
such as Newton’s method [
2
,
37
,
18
20
] or a long time ordinary differential equation (ODE) [
39
,
31
,
14
]. In contrast, our method works with distributions supported on the ambient space and thus gets
rid of the above strong assumptions, leading to a much faster update per iteration. This makes our
method especially suitable for complex ML models such as deep neural networks.
Sampling with a Moment Constraint
Recently, sampling with a general moment constraint, such
as
Eq[g]0
where
q
is the approximated distribution, has been studied [
25
]. However, this type
of constraint can not guarantee every sample to satisfy
g(x) = 0
. From a technical view, the target
distribution with a moment constraint is usually not singular w.r.t.
π
, so the problem is conceptually
less challenging compared to the problem considered in this work.
3 Preliminaries
Variational Framework
We review the derivation of Langevin dynamics and SVGD from a unified
variational framework. The variational approach frames the problem of sampling into a KL divergence
minimization problem:
minq∈P KL(q|| π)
where
P
is the space of probability measures. We start
from an initial distribution
q0
and an initial point
x0q0
, and update
xt
following
dxt=vt(xt)dt
,
where
vt:RdRd
is a velocity field at time
t
. Then the density
qt
of
xt
follows Fokker-Planck
equation: dqt/dt =−∇ · (vtqt), and the KL divergence decreases with the following rate [23]:
d
dtKL(qt|| π) = Eqt[Aπvt] = Eqt[(sπsqt)>vt],(1)
where
Aπv(x) = sπ(x)>v(x) + ·v(x)
is the Stein operator, and
sp=log p
is the score function
of the distribution p. The optimal vtis obtained by solving an optimization in a Hilbert space H,
max
v∈H
Eqt[(sπsqt)>v]1
2kvk2
H.(2)
The above objective makes sure that vtdecreases the KL divergence as fast as possible.
Langevin Dynamics and SVGD Algorithms
Both Langevin dynamics and SVGD can be derived
from this variational framework by taking
H
to be different spaces. Taking
H
to be
L2
q
, the velocity
field becomes
vt(·) = sπ(·)− ∇qt(·)
which can be simulated by Langevin dynamics
dxt=
sπ(xt)dt +dWt
with
Wt
being a standard Browninan motion. After discretization with a step size
η > 0, the update step of Langevin dynamics is xt+1 =xt+ Langevin(xt), where
Langevin(xt) = ηlog π(xt) + p2ηξt, ξt∼ N(0, I).(3)
Taking
H
to be the reproducing kernel Hilbert space (RKHS) of a continuously differentiable kernel
k:Rd×RdR
, the velocity field becomes
vt(·) = Exqt[kt(·, x)sπ(x) + xkt(·, x)].
After
discretization, the update step of SVGD for particles
{xi}n
i=1
is
xi,t+1 =xi,t +η·SVGDk(xi,t)
,
for i= 1, . . . , n, where ηis a step size and
SVGDk(xi,t) = 1
n
n
X
j=1
k(xi,t, xj,t)xj,t log π(xj,t) + xj,t k(xi,t, xj,t).(4)
4 Main Method
In this section, we formulate the constrained sampling problem into a constrained optimization
through the variational lens in Section 4.1, and introduce a new gradient flow to solve the problem
in Section 4.2. We apply this general framework to Langevin dynamics and SVGD, leading to two
practical algorithms in Section 4.3.
4.1 Constrained Variational Optimization
Recall that our goal is to draw samples according to the probability of
π
, but restricted to a low
dimensional manifold specified by an equality: G0:= {xRd:g(x)=0}. Similar to the standard
3
variational framework in Section 3, we can formulate the problem into a constrained optimization in
the space of probability measures:
min
q∈P KL(q|| π),s.t. q(g(x) = 0) = 1.
However, this problem is in general ill-posed. To see that, when
q
satisfies the constraint,
q
will be
singular w.r.t.
π
, so both
dq
and
KL(q|| π)
are not defined. Although the problem is ill-posed, we
are actually still able to derive a
KL
-gradient flow to solve the problem by considering
q
supported
on
Rd
. The intuition of the derivation is that, in addition to minimizing the objective as in Eq.
(1)
, the
velocity filed
vt
should also push
q
towards
G0
to satisfy the constraint. Surprisingly, the distribution
qt
following such a gradient flow indeed converges to the target distribution on the manifold. We will
focus on the derivation of the gradient flow here and leave its rigorous justification in Section 5.
4.2 Orthogonal-Space Gradient Flow (O-Gradient)
As mentioned above, besides maximizing the decay of
KL(q|| π)
, the velocity field
vt
also needs to
drive
q
towards the manifold satisfying
g(x)=0
. In particular, we add to
(2)
a requirement that the
value of g(x)is driven towards 0with a given rate:
vt= arg max
v∈H
Eqt[(sπsqt)>v]1
2kvk2
H,s.t. vt(x)>g(x) = ψ(g(x)) (5)
where
ψ(x)
is an increasing odd function. To see the effect of the constraint term, we consider three
cases:
When
g(x)>0
, then
vt(x)>g(x) = ψ(g(x)) >0
which ensures that
vt
will make
g
decrease strictly.
When
g(x)<0
, then
vt(x)>g(x) = ψ(g(x)) <0
which ensures that
vt
will make
g
increase strictly.
When
g(x)=0
, then
vt(x)>g(x) = ψ(g(x)) = 0
which ensures
x
stay on the manifold
G0.
We choose
ψ(x) = αsign(x)|x|1+β
with
α > 0
and
β(0,1]
in this paper because it is one of the
simplest functions that satisfy the requirements and we found it works well in theory and practice.
In summary, the objective function in Eq.
(5)
is the same as Eq.
(2)
in the standard variational
framework while the constraint ensures that
vt
pushes
qt
towards the manifold and keeps it stay. It is
easy to see that the solution of the above problem can be decomposed as vt=v]+vwhere
v](x) = ψ(g(x))g(x)
k∇g(x)k2, v⊥∇g. (6)
We use
fg
to denote (pointwise) orthogonality:
f(x)>g(x)=0
,
xRd
. Note that
v]
is parallel
to
g
and the remaining is to determine
v
. Note that
v
can be represented as a projection of an
arbitrary function uto the orthogonal space of g:
v=D(x)u(x),where D(x) := Ig(x)g(x)>
k∇g(x)k2.(7)
The projection operator
D
makes sure that
v⊥∇g
holds for any
u
, of which the optimal value we
can get by maximizing the unconstrained objective in Eq. (2),
max
v
Eqt[(sπsqt)>(v]+v)] 1
2kvk2
Hmax
u
Eqt[(D(sπsqt))>u]1
2kDuk2
H.(8)
The optimal solution of udepends on the choice of space H, which we discuss in Section 4.3.
Overall, we obtain the velocity field
vt
by first formulating a constrained optimization and then trans-
forming it into an unconstrained optimization via orthogonal decomposition. We call
vt
Orthogonal-
Space Gradient Flow (O-Gradient) and it drives
qt
to the target distribution only with the knowledge
of g, requiring no explicit representation of the manifold G0.
4
4.3 Practical Algorithms
After deriving O-Gradient for general Hilbert spaces
H
, we explain how to implement it using SVGD
and Langevin dynamics. The resulting O-SVGD and O-Langevin are outlined in Algorithm 1. At a
high level, our algorithms keep the original SVGD or Langevin dynamics movement in the directions
perpendicular to g, while pushing the density towards G0along the gdirection.
O-SVGD
We apply O-Gradient to SVGD first since it is fairly straightforward. Recall that
v]
can
be obtained using Eq. (6). We solve Eq. (8) to get vthrough the following lemma.
Lemma 4.1.
When
H
is an RKHS with kernel
k:Rd×RdR
, a solution to Eq.
(8)
is
v=
Du(x) = Eyqt(k(x, y)sπ(y) + y·k(x, y))
with the orthogonal-space kernel
k(x, y) =
k(x, y)D(x)D(y). Here k:Rd×RdRd×dis matrix valued, and y·k=Pjyjkij
(x, y).
Then the combined velocity is obtained using the original SVGD with the kernel k,
vt(x) = v](x) + Zk(x, y)sπ(y)qt(y)dy +Zy·(k(x, y))qt(y)dy.
Numerically, we iteratively update a set of
n
particles
{xi,t}n
i=1 Rd
, such that its empirical
distribution
Pn
i=1 δθi,t /n
is an approximation of
qt
in a proper sense when step size
η0
and
particle size
n+
. Similar to the update of standard SVGD in Eq.
(4)
, the update of O-SVGD is
xi,t+1 =xi,t +η·(v](xi,t) + SVGDk(xi,t)) where
SVGDk(xi,t) = 1
n
n
X
j=1
k(xi,t, xj,t)xj,t log π(xj,t) + xj,t k(xi,t, xj,t).(9)
It is worth noting that SVGDkis identical to Eq. 4 but with kernel krather than k.
O-Langevin
The Langevin implementation requires some additional derivation. First of all, with
H=L2
q
, we can show that the optimal velocity field is
vt(x) = φ(x)D(x)sqt(x)
where
φ(x) =
v](x) + D(x)sπ(x). This leads to a density flow
d
dtqt(x) = −∇ · (φ(x)qt(x)) + ∇ · (D(x)qt(x)).(10)
Next, we try to design a stochastic differential equation (SDE) of which the Fokker–Plank equation
(FPE) is identical to Eq. (10). The result is given by the following:
Theorem 4.2.
Consider a vector field
r(x) = ∇ · D(x)
, or its component-wise formulation
ri(x) =
Pd
j=1 xjDi,j (x), where xjdenotes the jth dimension of x. Consider the SDE
dxt= (φ(xt) + r(xt))dt +2D(xt)dWt(11)
with
φ(x) = v](x) + D(x)sπ(x)
, then its FPE is identical to Eq.
(10)
. Moreover, i) the value
g(xt)
has deterministic decay
d
dt g(xt) = ψ(xt)
; ii) for any
f
with
f⊥∇g= 0
, the generator of
xt
satisfies the Langevin equation: d
dt E[f(xt)|x0=x]t=0 := Lf(x) = f>(x)sπ(x)+∆f(x).
It is worth pointing out that if
g(x0)=0
, then
g(xt)0
, that is,
xt
always stays on
G0
. In this case,
Eq.
(11)
degenerates to manifold Langevin dynamics studied in previous work [
9
,
35
]. However, our
SDE does not have this requirement since it is still well-defined off
G0
. This is especially useful for
numerical implementations, leading to a fast algorithm without expensive projection steps.
Similar to the standard Langevin dynamics update in Eq.
(3)
, the update rule of O-Langevin is
xt+1 =xt+η·v](xt) + Langevin(xt)where
Langevin(xt) = ηD(xt)sπ(xt) + ηr(xt) + p2ηD(xt)ξt, ξt∼ N(0, I).(12)
5 Theoretical Analysis
We theoretically justify the convergence of O-Gradient in this section. To do so, we first describe
the target measure as a conditioned measure
Π0
, then derive its associated orthogonal-space Fisher
divergence, and finally prove that O-Gradient converges to Π0.
5
摘要:

SamplinginConstrainedDomainswithOrthogonal-SpaceVariationalGradientDescentRuqiZhangDepartmentofComputerSciencePurdueUniversityruqiz@purdue.eduQiangLiuDepartmentofComputerScienceUniversityofTexasatAustinlqiang@cs.texas.eduXinT.TongDepartmentofMathematicsNationalUniversityofSingaporemattxin@nus.edu.sg...

展开>> 收起<<
Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent Ruqi Zhang.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.55MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注