IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS Lukang Sun

2025-05-08 0 0 4.18MB 27 页 10玖币

侵权投诉

IMPROVED STEIN VARIATIONAL GRADIENT DESCENT

WITH IMPORTANCE WEIGHTS

Lukang Sun

KAUST ∗

lukang.sun@kaust.edu.sa

Peter Richt´

arik

KAUST

peter.richtarik@kaust.edu.sa

ABSTRACT

Stein Variational Gradient Descent (SVGD) is a popular sampling algorithm used

in various machine learning tasks. It is well known that SVGD arises from a

discretization of the kernelized gradient ﬂow of the Kullback-Leibler divergence

DKL (· | π), where πis the target distribution. In this work, we propose to enhance

SVGD via the introduction of importance weights, which leads to a new method

for which we coin the name β-SVGD. In the continuous time and inﬁnite particles

regime, the time for this ﬂow to converge to the equilibrium distribution π, quan-

tiﬁed by the Stein Fisher information, depends on ρ0and πvery weakly. This is

very different from the kernelized gradient ﬂow of Kullback-Leibler divergence,

whose time complexity depends on DKL (ρ0|π). Under certain assumptions, we

provide a descent lemma for the population limit β-SVGD, which covers the de-

scent lemma for the population limit SVGD when β→0. We also illustrate the

advantages of β-SVGD over SVGD by experiments.

1 INTRODUCTION

The main technical task of Bayesian inference is to estimate integration with respect to the posterior

distribution

π(x)∝e−V(x),

where V:Rd→Ris a potential. In practice, this is often reduced to sampling points from the

distribution π. Typical methods that employ this strategy include algorithms based on Markov Chain

Monte Carlo (MCMC), such as Hamiltonian Monte Carlo (Neal, 2011), also known as Hybrid Monte

Carlo (HMC) (Duane et al., 1987; Betancourt, 2017), and algorithms based on Langevin dynamics

(Dalalyan & Karagulyan, 2019; Durmus & Moulines, 2017; Cheng et al., 2018).

One the other hand, Stein Variational Gradient Descent (SVGD)—a different strategy suggested

by Liu & Wang (2016)—is based on an interacting particle system. In the population limit, the

interacting particle system can be seen as the kernelized negative gradient ﬂow of the Kullback-

Leibler divergence

DKL (ρ|π) := Rlog ρ

π(x)dρ(x); (1)

see (Liu, 2017; Duncan et al., 2019). SVGD has already been widely used in a variety of machine

learning settings, including variational auto-encoders (Pu et al., 2017), reinforcement learning (Liu

et al., 2017), sequential decision making (Zhang et al., 2018; 2019), generative adversarial net-

works (Tao et al., 2019) and federated learning (Kassab & Simeone, 2022). However, current theo-

retical understanding of SVGD is limited to its inﬁnite particle version (Liu, 2017; Korba et al., 2020;

Salim et al., 2021; Sun et al., 2022), and the theory on ﬁnite particle SVGD is far from satisfactory.

Since SVGD is built on a discretization of the kernelized negative gradient ﬂow of (1), we can learn

about its sampling potential by studying this ﬂow. In fact, a simple calculation reveals that

min

0≤s≤tIStein (ρs|π)≤DKL(ρ0|π)

t,(2)

where IStein (ρs|π)is the Stein Fisher information (see Deﬁnition 2) of ρsrelative to π, which is

typically used to quantify how close to πare the probability distributions (ρs)t

s=0 generated along

∗King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

arXiv:2210.00462v3 [cs.LG] 21 Nov 2022

this ﬂow. In particular, if our goal is to guarantee min

0≤s≤tIStein (ρs|π)≤ε, result (2) says that we

need to take

t≥DKL(ρ0|π)

ε.

Unfortunately, and this is the key motivation for our work, the quantity the initial KL divergence

DKL (ρ0|π)can be very large. Indeed, it can be proportional to the underlying dimension, which

is highly problematic in high dimensional regimes. Salim et al. (2021) and Sun et al. (2022) have

recently derived an iteration complexity bound for the inﬁnite particle SVGD method. However,

similarly to the time complexity of the continuous ﬂow, their bound depends on DKL (ρ0|π).

1.1 SUMMARY OF CONTRIBUTIONS

In this paper, we design a family of continuous time ﬂows—which we call β-SVGD ﬂow—by

combining importance weights with the kernelized gradient ﬂow of the KL-divergence. Surpris-

ingly, we prove that the time for this ﬂow to converge to the equilibrium distribution π, that is

min0≤s≤tIStein (ρs|π)≤εwith (ρs)t

s=0 generated along β-SVGD ﬂow, can be bounded by

−1

εβ(β+1) when β∈(−1,0). This indicates that the importance weights can potentially accel-

erate SVGD. Actually, we design β-SVGD method based on a discretization of the β-SVGD ﬂow and

provide a descent lemma for its population limit version. Some simple experiments in Appendix D

verify our predictions.

We summarize our contributions in the following:

•A new family of ﬂows. We construct a family of continuous time ﬂows for which we

coin the name β-SVGD ﬂows. These ﬂows do not arise from a time re-parameterization of

the SVGD ﬂow since their trajectories are different, nor can they be seen as the kernelized

gradient ﬂows of the R´

enyi divergence.

•Convergence rates. When β→0, this returns back to the kernelized gradient ﬂow of the

KL-divergence (SVGD ﬂow); when β∈(−1,0), the convergence rate of β-SVGD ﬂows

is signiﬁcantly improved than that of the SVGD ﬂow in the case DKL (ρ0|π)is large.

Under a Stein Poincar´

e inequality, we derive an exponential convergence rate of 2-R´

enyi

divergence along 1-SVGD ﬂow. Stein Poincar´

e inequality is proved to be weaker than Stein

log-Sobolev inequality, however like Stein log-Sobolev inequality, it is not clear to us when

it does hold.

•Algorithm. We design β-SVGD algorithm based on a discretization of the β-SVGD ﬂow

and we derive a descent lemmas for the population limit β-SVGD.

•Experiments. Finally, we do some experiments to illustrate the advantages of β-SVGD

with negative β. The simulation results on β-SVGD corroborate our theory.

1.2 RELATED WORKS

The SVGD sampling technique was ﬁrst presented in the fundamental work of Liu & Wang (2016).

Since then, a number of SVGD variations have been put out. The following is a partial list: Newton

version SVGD (Detommaso et al., 2018), stochastic SVGD (Gorham et al., 2020), mirrored SVGD

(Shi et al., 2021), random-batch method SVGD (Li et al., 2020) and matrix kernel SVGD (Wang

et al., 2019). The theoretical knowledge of SVGD is still constrained to population limit SVGD. The

ﬁrst work to demonstrate the convergence of SVGD in the population limit was by Liu (2017); Korba

et al. (2020) then derived a similar descent lemma for the population limit SVGD using a different

approach. However, their results relied on the path information and thus were not self-contained,

to provide a clean analysis, Salim et al. (2021) assumed a Talagrand’s T1inequality of the target

distribution πand gave the ﬁrst iteration complexity analysis in terms of dimension d. Following

the work of Salim et al. (2021); Sun et al. (2022) derived a descent lemma for the population limit

SVGD under a non-smooth potential V.

In this paper, we consider a family of generalized divergences, R´

enyi divergence, and SVGD with

importance weights. For these two themes, we name a few but non-exclusive related results. Wang

et al. (2018) proposed to use the f-divergence instead of KL-divergence in the variational inference

problem, here fis a convex function; Yu et al. (2020) also considered variational inference with

f-divergence but with its dual form. Han & Liu (2017) considered combining importance sampling

with SVGD, however the importance weights were only used to adjust the ﬁnal sampled points but

not in the iteration of SVGD as in this paper; Liu & Lee (2017) considered importance sampling, they

designed a black-box scheme to calculate the importance weights (they called them Stein importance

weights in their paper) of any set of points.

2 PRELIMINARIES

We assume the target distribution π∝e−V, and we have oracle to calculate the value of e−V(x)for

all x∈Rd.

2.1 NOTATION

Let x= (x1, . . . , xd)>, y = (y1, . . . , yd)>∈Rd, denote hx, yi:= Pd

i=1 xiyiand kxk:=

phx, xi. For a square matrix B∈Rd×d, the operator norm and Frobenius norm of Bare de-

ﬁned respectively by kBkop := p%(B>B)and kBkF:= qPd

i=1 Pd

j=1 B2

i,j , respectively, where

%denotes the spectral radius. It is easy to verify that kBkop ≤ kBkF. Let P2(Rd)denote the

space of probability measures with ﬁnite second moment; that is, for any µ∈ P2(Rd)we have

Rkxk2dµ(x)<+∞. The Wasserstein 2-distance between ρ, µ ∈ P2(Rd)is deﬁned by

W2(ρ, µ) := inf

η∈Γ(ρ,µ)qRkx−yk2dη(x, y),

where Γ (ρ, µ)is the set of all joint distributions deﬁned on Rd×Rdhaving ρand µas marginals.

The push-forward distribution of ρ∈ P2Rdby a map T:Rd→Rd, denoted by T#ρ,

is deﬁned as follows: for any measurable set Ω∈Rd,T#ρ(Ω) := ρT−1(Ω). By deﬁni-

tion of the push-forward distribution, it is not hard to verify that the probability densities satisfy

T#ρ(T(x))|det D T(x)|=ρ(x), where DTis the Jacobian matrix of T. The reader can refer to

Villani (2009) for more details.

2.2 R ´

ENYI DIVERGENCE

Next, we deﬁne the R´

enyi divergence which plays an important role in information theory and

many other areas such as hypothesis testing (Morales Gonz´

alez et al., 2013) and multiple source

adaptation (Mansour et al., 2012).

Deﬁnition 1 (R´

enyi divergence) For two probability distributions ρand µon Rdand ρµ, the

R´

enyi divergence of positive order αis deﬁned as

Dα(ρ|µ) := 









α−1log Rρ

µα−1(x)dρ(x)0< α < ∞, α 6= 1

Rlog ρ

µ(x)dρ(x)α= 1

.(3)

If ρis not absolutely continuous with respect to µ, we set Dα(ρ|µ) = ∞. Further, we denote

DKL (ρ|µ) := D1(ρ|µ).

R´

enyi divergence is non-negative, continuous and non-decreasing in terms of the parameter α;

speciﬁcally, we have DKL (ρ|µ) = limα→1Dα(ρ|µ). More properties of R´

enyi divergence can

be found in a comprehensive article by Van Erven & Harremos (2014). Besides R´

enyi divergence,

there are other generalizations of the KL-divergence, e.g., admissible relative entropies (Arnold

et al., 2001).

2.3 BACKGROUND ON SVGD

Stein Variational Gradient Descent (SVGD) is deﬁned on a Reproducing Kernel Hilbert

Space (RKHS) H0with a non-negative deﬁnite reproducing kernel k:Rd×Rd→R+. The

key feature of this space is its reproducing property:

f(x) = hf(·), k(x, ·)iH0,∀f∈ H0,(4)

where h·,·iH0is the inner product deﬁned on H0. Let Hbe the d-fold Cartesian product of H0.

That is, f∈ H if and only if there exist f1,· · · , fd∈ H0such that f= (f1, . . . , fd)>. Naturally,

the inner product on His given by

hf, giH:=

i=1

hfi, giiH0, f = (f1, . . . , fd)>∈ H, g = (g1, . . . , gd)>∈ H.(5)

For more details of RKHS, the readers can refer to Berlinet & Thomas-Agnan (2011).

It is well known (see for example Ambrosio et al. (2005)) that ∇log ρ

πis the Wasserstein gradient

of DKL (· | π)at ρ∈ P2(Rd). Liu & Wang (2016) proposed a kernelized Wasserstein gradient of

the KL-divergence, deﬁned by

gρ(x) := Rk(x, y)∇log ρ

π(y)dρ(y)∈ H.(6)

Integration by parts yields

gρ(x) = −R[∇log π(y)k(x, y) + ∇yk(x, y)] dρ(y).(7)

Comparing the Wasserstein gradient ∇log ρ

πwith (7), we ﬁnd that the latter can be easily approx-

imated by

gρ(x)≈ˆgˆρ:= −1

i=1

[∇log π(xi)k(x, xi) + ∇xik(x, xi)] ,(8)

with ˆρ=1

NPN

i=1 δxiand (xi)N

i=1 sampled from ρ. With the above notations, the SVGD update rule

xi←xi+γ

j=1 ∇log π(xj)k(xi, xj) + ∇xjk(xi, xj), i = 1, . . . , N, (9)

where γis the step-size, can be presented in the compact form ˆρ←(I−γˆgˆρ)#ˆρ. When we talk

about the inﬁnite particle SVGD, or population limit SVGD, we mean ρ←(I−γgρ)#ρ. The

metric used in the study of SVGD is the Stein Fisher information or the Kernelized Stein Discrep-

ancy (KSD).

Deﬁnition 2 (Stein Fisher Information) Let ρ∈ P2(Rd). The Stein Fisher Information of ρrela-

tive to πis deﬁned by

IStein(ρ|π) := RRk(x, y)∇log ρ

π(x),∇log ρ

π(y)dρ(x)dρ(y).(10)

A sufﬁcient condition under which limn→∞ IStein(ρn|π)implies ρn→πweakly can be

found in Gorham & Mackey (2017), which requires: i) the kernel kto be in the form k(x, y) =

c2+kx−yk2θ

for some c > 0and θ∈(−1,0); ii) π∝e−Vto be distant dissipative; roughly

speaking, this requires Vto be convex outside a compact set, see Gorham & Mackey (2017) for an

accurate deﬁnition.

In the study of the kernelized Wasserstein gradient (7) and its corresponding continuity equation

∂ρt

∂t + div (ρtgρt) = 0,

Duncan et al. (2019) introduced the following kernelized log-Sobolev inequality to prove the expo-

nential convergence of DKL (ρt|π)along the direction (7):

Deﬁnition 3 (Stein log-Sobolev inequality) We say πsatisﬁes the Stein log-Sobolev inequality

with constant λ > 0if

DKL(ρ|π)≤1

2λIStein(ρ|π).(11)

While this inequality can guarantee an exponential convergence rate of ρtto π, quantiﬁed by the

KL-divergence, the condition for πto satisfy the Stein log-Sobolev inequality is very restrictive. In

fact, little is known about when (11) holds.

3 CONTINUOUS TIME DYNAMICS OF THE β-SVGD FLOW

In this section, we mainly focus on the continuous time dynamics of the β-SVGD ﬂow. Due to page

limitation, we leave all of the proofs to Appendix B.

Figure 1: The performance of β-SVGD with β= 0.5,0,−0.5from left to right, but with

the same step-size. The blue dashed line is the target distribution π: the Gaussian mixture

5N(2,1) + 3

5N(6,1). The green solid line is the distribution generated by β-SVGD after 100

iterations; More experiments can be found in Appendix D.

3.1 β-SVGD FLOW

In this paper, a ﬂow refers to some time-dependent vector ﬁeld vt:Rd→Rd. This time-dependent

vector ﬁeld will inﬂuence the mass distribution on Rdby the continuity equation or the equation of

conservation of mass

∂ρt

∂t + div (ρtvt)=0,(12)

readers can refer to Ambrosio et al. (2005) for more details.

Deﬁnition 4 (β-SVGD ﬂow) Given a weight parameter β∈(−1,+∞), the β-SVGD ﬂow is given

vβ

t(x) := −π

ρtβ

(x)Rk(x, y)∇log ρt

π(y)dρt(y).(13)

Note that when β= 0, this is the negative kernelized Wasserstein gradient (6).

Note that we can not treat β-SVGD ﬂow as the kernelized Wasserstein gradient ﬂow of the (β+ 1)-

R´

enyi divergence. However, they are closely related, and we can derive the following theorem.

Theorem 1 (Main result) Along the β-SVGD ﬂow (13), we have1

min

t∈[0,T ]IStein (ρt|π)≤1

TZT

IStein(ρt|π)dt ≤









eβDβ+1(ρ0|π)

T β(β+1) β > 0

DKL(ρ0|π)

Tβ= 0

−1

T β(β+1) β∈(−1,0)

.(14)

Note the left hand side of (14) is the Stein Fisher information. When βdecreases from positive to

negative, the right hand side of (14) changes dramatically; it appears to be independent of ρ0and

π. If we do not know the R´

enyi divergence between ρ0and π, it seems the best convergence rate is

obtained by setting β=−1

2, that is

mint∈[0,T ]IStein (ρt|π)≤1

TRT

0IStein(ρt|π)dt ≤≤ 4

It is somewhat unexpected to observe that the time complexity is independent of ρ0and π, or to

be more precise, that it relies only very weakly on ρ0and πwhen β∈(−1,0). We wish to

stress that this is not achieved by time re-parameterization. In the proof of Theorem 1, we can

see the term (π

/ρt)βin β-SVGD ﬂow (13) is utilized to cancel term (ρt

/π)βin the Wasserstein

gradient of (β+ 1)-R´

enyi divergence. Actually, when β∈(−1,0), this term has an added ad-

vantage and can be seen as the acceleration factor in front of the kernelized Wasserstein gradi-

ent of KL-divergence. Speciﬁcally, the negative kernelized Wasserstein gradient of KL-divergence

t(x) := −Rk(x, y)∇log( ρt

π)(y)dρt(y)is the vector ﬁeld that compels ρtto approach π, while

1In fact, in the proof in Appendix B we know a stronger result. When β∈(−1,0), the right hand side

of (14) is only weakly dependent on ρ0and πand should be



eβDβ+1(ρ0|π)−eβDβ+1 (ρT|π)



T|β(β+1)|, which is less than

−1

T β(β+1) .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMPROVEDSTEINVARIATIONALGRADIENTDESCENTWITHIMPORTANCEWEIGHTSLukangSunKAUSTlukang.sun@kaust.edu.saPeterRicht´arikKAUSTpeter.richtarik@kaust.edu.saABSTRACTSteinVariationalGradientDescent(SVGD)isapopularsamplingalgorithmusedinvariousmachinelearningtasks.ItiswellknownthatSVGDarisesfromadiscretizationof...

展开>> 收起<<

IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS Lukang Sun.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS Lukang Sun

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: