Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent Ruqi Zhang

2025-04-26 0 0 1.55MB 23 页 10玖币

侵权投诉

Sampling in Constrained Domains with

Orthogonal-Space Variational Gradient Descent

Ruqi Zhang

Department of Computer Science

Purdue University

ruqiz@purdue.edu

Qiang Liu

Department of Computer Science

University of Texas at Austin

lqiang@cs.texas.edu

Xin T. Tong

Department of Mathematics

National University of Singapore

mattxin@nus.edu.sg

Abstract

Sampling methods, as important inference and learning techniques, are typically de-

signed for unconstrained domains. However, constraints are ubiquitous in machine

learning problems, such as those on safety, fairness, robustness, and many other

properties that must be satisﬁed to apply sampling results in real-life applications.

Enforcing these constraints often leads to implicitly-deﬁned manifolds, making

efﬁcient sampling with constraints very challenging. In this paper, we propose

a new variational framework with a designed orthogonal-space gradient ﬂow (O-

Gradient) for sampling on a manifold

deﬁned by general equality constraints.

O-Gradient decomposes the gradient into two parts: one decreases the distance

and the other decreases the KL divergence in the orthogonal space. While

most existing manifold sampling methods require initialization on

, O-Gradient

does not require such prior knowledge. We prove that O-Gradient converges to the

target constrained distribution with rate

O(1/the number of iterations)

under mild

conditions. Our proof relies on a new Stein characterization of conditional measure

which could be of independent interest. We implement O-Gradient through both

Langevin dynamics and Stein variational gradient descent and demonstrate its

effectiveness in various experiments, including Bayesian deep neural networks.

1 Introduction

Sampling methods, such as Markov chain Monte Carlo (MCMC) [

] and Stein variational gradi-

ent descent (SVGD) [

], have been widely used for getting samples from or approximating

intractable distributions in machine learning (ML) problems, such as estimating Bayesian neural

network posteriors [

], generating new images [

], and training energy-based models [

]. While

being powerful, most sampling methods usually can only be used in unconstrained domains or some

special geometric spaces. This greatly limits the application of sampling to many real-life tasks.

We consider sampling from a distribution

with an equality constraint

g(x)=0

where

g:Rd→R

a general differentiable function. The domain in this case is the level set

G0={x∈Rd:g(x) = 0}

which is a submanifold in

. We do not require additional information about

, such as explicit

parameterization or known in-domain points, which is in contrast, often demanded by previous

methods [

]. The problem deﬁned above includes many ML applications, such as disease

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06447v1 [cs.LG] 12 Oct 2022

(a) O-Gradient (b) O-Langevin (c) O-SVGD

Figure 1: Visualization of our methods. (a) O-Gradient

is formed by

which follows

∇g

, and

v⊥

which is perpendicular to

∇g

. (b)-(c) Applying O-Gradient to Langevin dynamics and SVGD. Both

methods can approach the manifold and sample on it.

diagnosis with logic rules constraint, policymaking with fairness constraint for different demographic

subgroups, and autonomous driving with robustness constraint to unseen scenarios.

In this paper, we propose a new variational framework which transforms the above constrained

sampling problem into a constrained functional minimization problem. A special gradient ﬂow,

denoted orthogonal-space gradient ﬂow (O-Gradient), is developed to minimize the objective. As

illustrated in Figure 1a, the direction of O-Gradient

can be decomposed into two parts: the ﬁrst

part

drives the sampler towards the manifold

following

∇g

and keeps it on

once arrived; the

second part

v⊥

makes the sampler explore

following the density

π(x)

. We prove the convergence

of O-Gradient in the continuous-time mean-ﬁeld limit. O-Gradient can be applied to both Langevin

dynamics and SVGD, resulting in O-Langevin and O-SVGD respectively. As shown in Figure 1b&c,

both methods can converge to the target distribution on the manifold. In particular, O-Langevin

converges following a noisy trajectory while O-SVGD converges smoothly, similar to their standard

unconstrained counterparts. We empirically demonstrate the sampling performance of O-Langevin

and O-SVGD across different constrained ML problems. We summarize our contributions as follows:

•

We reformulate the hard-constrained sampling problem into a functional optimization

problem and derive a special gradient ﬂow, O-Gradient, to obtain the solution.

•

We prove that O-Gradient converges to the target constrained distribution with rate

O(1/the number of iterations)

under mild conditions. Our proof technique includes a new

Stein characterization of conditional measure which could be of independent interest.

•

We implement O-Gradient through both Langevin dynamics and SVGD and demonstrate

its effectiveness in various experiments, including a constrained synthetic distribution,

income classiﬁcation with fairness constraint, loan classiﬁcation with logic rules and image

classiﬁcation with robust Bayesian deep neural networks.

2 Related Work

Sampling on Explicitly Deﬁned Manifolds

Manifolds with special shapes, such as geometric

or physics structures, can sometimes be explicitly parameterized in lower dimension spaces. For

example, a torus embedded in

can be explicitly deﬁned in two dimensions using polar coordinates.

Variants of classical methods have been developed to sample on such manifolds, including rejection

sampling [

], Langevin dynamics [

], Hamiltonian Monte Carlo (HMC) [

] and Riemannian

manifold HMC [

]. However, explicit parameterization is only applicable to a few special cases

and cannot be used for general machine learning problems. In contrast to this line of work, our

method is able to work with more general manifolds deﬁned in the original domain Rd.

Sampling on Implicitly Deﬁned Manifolds

Many common applications are not endowed with

simple manifolds, such as molecular dynamics [

], matrix factorization [

] and free energy

calculations [

]. Motivated by these applications, sampling methods on implicitly deﬁned manifolds

have been developed. Brubaker et al.

[2]

has proposed a family of constrained MCMC methods

by applying Lagrangian mechanics to Hamiltonian dynamics. Zappa et al.

[37]

has introduced a

constrained Metropolis-Hastings (MH) with a reverse projection check to ensure the reversibility.

Later, this method has been extended to HMC [

] and multiple projections [

]. However, the

implementation and analysis of these methods often assume the algorithm starts on the manifold and

never leaves it, requiring prior known points on the manifold and expensive projection subroutines,

such as Newton’s method [

–

] or a long time ordinary differential equation (ODE) [

]. In contrast, our method works with distributions supported on the ambient space and thus gets

rid of the above strong assumptions, leading to a much faster update per iteration. This makes our

method especially suitable for complex ML models such as deep neural networks.

Sampling with a Moment Constraint

Recently, sampling with a general moment constraint, such

Eq[g]≤0

where

is the approximated distribution, has been studied [

]. However, this type

of constraint can not guarantee every sample to satisfy

g(x) = 0

. From a technical view, the target

distribution with a moment constraint is usually not singular w.r.t.

, so the problem is conceptually

less challenging compared to the problem considered in this work.

3 Preliminaries

Variational Framework

We review the derivation of Langevin dynamics and SVGD from a uniﬁed

variational framework. The variational approach frames the problem of sampling into a KL divergence

minimization problem:

minq∈P KL(q|| π)

where

is the space of probability measures. We start

from an initial distribution

and an initial point

x0∼q0

, and update

following

dxt=vt(xt)dt

where

vt:Rd→Rd

is a velocity ﬁeld at time

. Then the density

follows Fokker-Planck

equation: dqt/dt =−∇ · (vtqt), and the KL divergence decreases with the following rate [23]:

−d

dtKL(qt|| π) = Eqt[Aπvt] = Eqt[(sπ−sqt)>vt],(1)

where

Aπv(x) = sπ(x)>v(x) + ∇·v(x)

is the Stein operator, and

sp=∇log p

is the score function

of the distribution p. The optimal vtis obtained by solving an optimization in a Hilbert space H,

max

v∈H

Eqt[(sπ−sqt)>v]−1

2kvk2

H.(2)

The above objective makes sure that vtdecreases the KL divergence as fast as possible.

Langevin Dynamics and SVGD Algorithms

Both Langevin dynamics and SVGD can be derived

from this variational framework by taking

to be different spaces. Taking

to be

, the velocity

ﬁeld becomes

vt(·) = ∇sπ(·)− ∇qt(·)

which can be simulated by Langevin dynamics

dxt=

sπ(xt)dt +dWt

with

being a standard Browninan motion. After discretization with a step size

η > 0, the update step of Langevin dynamics is xt+1 =xt+ Langevin(xt), where

Langevin(xt) = η∇log π(xt) + p2ηξt, ξt∼ N(0, I).(3)

Taking

to be the reproducing kernel Hilbert space (RKHS) of a continuously differentiable kernel

k:Rd×Rd→R

, the velocity ﬁeld becomes

vt(·) = Ex∼qt[kt(·, x)sπ(x) + ∇xkt(·, x)].

After

discretization, the update step of SVGD for particles

{xi}n

i=1

xi,t+1 =xi,t +η·SVGDk(xi,t)

for i= 1, . . . , n, where ηis a step size and

SVGDk(xi,t) = 1

j=1

k(xi,t, xj,t)∇xj,t log π(xj,t) + ∇xj,t k(xi,t, xj,t).(4)

4 Main Method

In this section, we formulate the constrained sampling problem into a constrained optimization

through the variational lens in Section 4.1, and introduce a new gradient ﬂow to solve the problem

in Section 4.2. We apply this general framework to Langevin dynamics and SVGD, leading to two

practical algorithms in Section 4.3.

4.1 Constrained Variational Optimization

Recall that our goal is to draw samples according to the probability of

, but restricted to a low

dimensional manifold speciﬁed by an equality: G0:= {x∈Rd:g(x)=0}. Similar to the standard

variational framework in Section 3, we can formulate the problem into a constrained optimization in

the space of probability measures:

min

q∈P KL(q|| π),s.t. q(g(x) = 0) = 1.

However, this problem is in general ill-posed. To see that, when

satisﬁes the constraint,

will be

singular w.r.t.

, so both

dπ

and

KL(q|| π)

are not deﬁned. Although the problem is ill-posed, we

are actually still able to derive a

-gradient ﬂow to solve the problem by considering

supported

. The intuition of the derivation is that, in addition to minimizing the objective as in Eq.

(1)

, the

velocity ﬁled

should also push

towards

to satisfy the constraint. Surprisingly, the distribution

following such a gradient ﬂow indeed converges to the target distribution on the manifold. We will

focus on the derivation of the gradient ﬂow here and leave its rigorous justiﬁcation in Section 5.

4.2 Orthogonal-Space Gradient Flow (O-Gradient)

As mentioned above, besides maximizing the decay of

KL(q|| π)

, the velocity ﬁeld

also needs to

drive

towards the manifold satisfying

g(x)=0

. In particular, we add to

(2)

a requirement that the

value of g(x)is driven towards 0with a given rate:

vt= arg max

v∈H

Eqt[(sπ−sqt)>v]−1

2kvk2

H,s.t. vt(x)>∇g(x) = −ψ(g(x)) (5)

where

ψ(x)

is an increasing odd function. To see the effect of the constraint term, we consider three

cases:

•

When

g(x)>0

, then

vt(x)>∇g(x) = −ψ(g(x)) >0

which ensures that

will make

decrease strictly.

•

When

g(x)<0

, then

vt(x)>∇g(x) = −ψ(g(x)) <0

which ensures that

will make

increase strictly.

•

When

g(x)=0

, then

vt(x)>∇g(x) = −ψ(g(x)) = 0

which ensures

stay on the manifold

G0.

We choose

ψ(x) = αsign(x)|x|1+β

with

α > 0

and

β∈(0,1]

in this paper because it is one of the

simplest functions that satisfy the requirements and we found it works well in theory and practice.

In summary, the objective function in Eq.

(5)

is the same as Eq.

(2)

in the standard variational

framework while the constraint ensures that

pushes

towards the manifold and keeps it stay. It is

easy to see that the solution of the above problem can be decomposed as vt=v]+v⊥where

v](x) = −ψ(g(x))∇g(x)

k∇g(x)k2, v⊥⊥∇g. (6)

We use

f⊥g

to denote (pointwise) orthogonality:

f(x)>g(x)=0

∀x∈Rd

. Note that

is parallel

∇g

and the remaining is to determine

v⊥

. Note that

v⊥

can be represented as a projection of an

arbitrary function uto the orthogonal space of ∇g:

v⊥=D(x)u(x),where D(x) := I−∇g(x)∇g(x)>

k∇g(x)k2.(7)

The projection operator

makes sure that

v⊥⊥∇g

holds for any

, of which the optimal value we

can get by maximizing the unconstrained objective in Eq. (2),

max

v⊥

Eqt[(sπ−sqt)>(v]+v⊥)] −1

2kv⊥k2

H⇒max

Eqt[(D(sπ−sqt))>u]−1

2kDuk2

H.(8)

The optimal solution of udepends on the choice of space H, which we discuss in Section 4.3.

Overall, we obtain the velocity ﬁeld

by ﬁrst formulating a constrained optimization and then trans-

forming it into an unconstrained optimization via orthogonal decomposition. We call

Orthogonal-

Space Gradient Flow (O-Gradient) and it drives

to the target distribution only with the knowledge

of ∇g, requiring no explicit representation of the manifold G0.

4.3 Practical Algorithms

After deriving O-Gradient for general Hilbert spaces

, we explain how to implement it using SVGD

and Langevin dynamics. The resulting O-SVGD and O-Langevin are outlined in Algorithm 1. At a

high level, our algorithms keep the original SVGD or Langevin dynamics movement in the directions

perpendicular to ∇g, while pushing the density towards G0along the ∇gdirection.

O-SVGD

We apply O-Gradient to SVGD ﬁrst since it is fairly straightforward. Recall that

can

be obtained using Eq. (6). We solve Eq. (8) to get v⊥through the following lemma.

Lemma 4.1.

When

is an RKHS with kernel

k:Rd×Rd→R

, a solution to Eq.

(8)

v⊥=

Du(x) = Ey∼qt(k⊥(x, y)sπ(y) + ∇y·k⊥(x, y))

with the orthogonal-space kernel

k⊥(x, y) =

k(x, y)D(x)D(y). Here k⊥:Rd×Rd→Rd×dis matrix valued, and ∇y·k⊥=Pj∂yjkij

⊥(x, y).

Then the combined velocity is obtained using the original SVGD with the kernel k⊥,

vt(x) = v](x) + Zk⊥(x, y)sπ(y)qt(y)dy +Z∇y·(k⊥(x, y))qt(y)dy.

Numerically, we iteratively update a set of

particles

{xi,t}n

i=1 ⊂Rd

, such that its empirical

distribution

i=1 δθi,t /n

is an approximation of

in a proper sense when step size

η→0

and

particle size

n→+∞

. Similar to the update of standard SVGD in Eq.

(4)

, the update of O-SVGD is

xi,t+1 =xi,t +η·(v](xi,t) + SVGDk⊥(xi,t)) where

SVGDk⊥(xi,t) = 1

j=1

k⊥(xi,t, xj,t)∇xj,t log π(xj,t) + ∇xj,t k⊥(xi,t, xj,t).(9)

It is worth noting that SVGDk⊥is identical to Eq. 4 but with kernel k⊥rather than k.

O-Langevin

The Langevin implementation requires some additional derivation. First of all, with

H=L2

, we can show that the optimal velocity ﬁeld is

vt(x) = φ(x)−D(x)sqt(x)

where

φ(x) =

v](x) + D(x)sπ(x). This leads to a density ﬂow

dtqt(x) = −∇ · (φ(x)qt(x)) + ∇ · (D(x)∇qt(x)).(10)

Next, we try to design a stochastic differential equation (SDE) of which the Fokker–Plank equation

(FPE) is identical to Eq. (10). The result is given by the following:

Theorem 4.2.

Consider a vector ﬁeld

r(x) = ∇ · D(x)

, or its component-wise formulation

ri(x) =

j=1 ∂xjDi,j (x), where xjdenotes the jth dimension of x. Consider the SDE

dxt= (φ(xt) + r(xt))dt +√2D(xt)dWt(11)

with

φ(x) = v](x) + D(x)sπ(x)

, then its FPE is identical to Eq.

(10)

. Moreover, i) the value

g(xt)

has deterministic decay

dt g(xt) = −ψ(xt)

; ii) for any

with

∇f⊥∇g= 0

, the generator of

satisﬁes the Langevin equation: d

dt E[f(xt)|x0=x]t=0 := Lf(x) = ∇f>(x)sπ(x)+∆f(x).

It is worth pointing out that if

g(x0)=0

, then

g(xt)≡0

, that is,

always stays on

. In this case,

Eq.

(11)

degenerates to manifold Langevin dynamics studied in previous work [

]. However, our

SDE does not have this requirement since it is still well-deﬁned off

. This is especially useful for

numerical implementations, leading to a fast algorithm without expensive projection steps.

Similar to the standard Langevin dynamics update in Eq.

(3)

, the update rule of O-Langevin is

xt+1 =xt+η·v](xt) + Langevin⊥(xt)where

Langevin⊥(xt) = ηD(xt)sπ(xt) + ηr(xt) + p2ηD(xt)ξt, ξt∼ N(0, I).(12)

5 Theoretical Analysis

We theoretically justify the convergence of O-Gradient in this section. To do so, we ﬁrst describe

the target measure as a conditioned measure

Π0

, then derive its associated orthogonal-space Fisher

divergence, and ﬁnally prove that O-Gradient converges to Π0.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SamplinginConstrainedDomainswithOrthogonal-SpaceVariationalGradientDescentRuqiZhangDepartmentofComputerSciencePurdueUniversityruqiz@purdue.eduQiangLiuDepartmentofComputerScienceUniversityofTexasatAustinlqiang@cs.texas.eduXinT.TongDepartmentofMathematicsNationalUniversityofSingaporemattxin@nus.edu.sg...

展开>> 收起<<

Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent Ruqi Zhang.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent Ruqi Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: