IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS Lukang Sun

2025-05-08 0 0 4.18MB 27 页 10玖币
侵权投诉
IMPROVED STEIN VARIATIONAL GRADIENT DESCENT
WITH IMPORTANCE WEIGHTS
Lukang Sun
KAUST
lukang.sun@kaust.edu.sa
Peter Richt´
arik
KAUST
peter.richtarik@kaust.edu.sa
ABSTRACT
Stein Variational Gradient Descent (SVGD) is a popular sampling algorithm used
in various machine learning tasks. It is well known that SVGD arises from a
discretization of the kernelized gradient flow of the Kullback-Leibler divergence
DKL (· | π), where πis the target distribution. In this work, we propose to enhance
SVGD via the introduction of importance weights, which leads to a new method
for which we coin the name β-SVGD. In the continuous time and infinite particles
regime, the time for this flow to converge to the equilibrium distribution π, quan-
tified by the Stein Fisher information, depends on ρ0and πvery weakly. This is
very different from the kernelized gradient flow of Kullback-Leibler divergence,
whose time complexity depends on DKL (ρ0|π). Under certain assumptions, we
provide a descent lemma for the population limit β-SVGD, which covers the de-
scent lemma for the population limit SVGD when β0. We also illustrate the
advantages of β-SVGD over SVGD by experiments.
1 INTRODUCTION
The main technical task of Bayesian inference is to estimate integration with respect to the posterior
distribution
π(x)eV(x),
where V:RdRis a potential. In practice, this is often reduced to sampling points from the
distribution π. Typical methods that employ this strategy include algorithms based on Markov Chain
Monte Carlo (MCMC), such as Hamiltonian Monte Carlo (Neal, 2011), also known as Hybrid Monte
Carlo (HMC) (Duane et al., 1987; Betancourt, 2017), and algorithms based on Langevin dynamics
(Dalalyan & Karagulyan, 2019; Durmus & Moulines, 2017; Cheng et al., 2018).
One the other hand, Stein Variational Gradient Descent (SVGD)—a different strategy suggested
by Liu & Wang (2016)—is based on an interacting particle system. In the population limit, the
interacting particle system can be seen as the kernelized negative gradient flow of the Kullback-
Leibler divergence
DKL (ρ|π) := Rlog ρ
π(x)(x); (1)
see (Liu, 2017; Duncan et al., 2019). SVGD has already been widely used in a variety of machine
learning settings, including variational auto-encoders (Pu et al., 2017), reinforcement learning (Liu
et al., 2017), sequential decision making (Zhang et al., 2018; 2019), generative adversarial net-
works (Tao et al., 2019) and federated learning (Kassab & Simeone, 2022). However, current theo-
retical understanding of SVGD is limited to its infinite particle version (Liu, 2017; Korba et al., 2020;
Salim et al., 2021; Sun et al., 2022), and the theory on finite particle SVGD is far from satisfactory.
Since SVGD is built on a discretization of the kernelized negative gradient flow of (1), we can learn
about its sampling potential by studying this flow. In fact, a simple calculation reveals that
min
0stIStein (ρs|π)DKL(ρ0|π)
t,(2)
where IStein (ρs|π)is the Stein Fisher information (see Definition 2) of ρsrelative to π, which is
typically used to quantify how close to πare the probability distributions (ρs)t
s=0 generated along
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
1
arXiv:2210.00462v3 [cs.LG] 21 Nov 2022
this flow. In particular, if our goal is to guarantee min
0stIStein (ρs|π)ε, result (2) says that we
need to take
tDKL(ρ0|π)
ε.
Unfortunately, and this is the key motivation for our work, the quantity the initial KL divergence
DKL (ρ0|π)can be very large. Indeed, it can be proportional to the underlying dimension, which
is highly problematic in high dimensional regimes. Salim et al. (2021) and Sun et al. (2022) have
recently derived an iteration complexity bound for the infinite particle SVGD method. However,
similarly to the time complexity of the continuous flow, their bound depends on DKL (ρ0|π).
1.1 SUMMARY OF CONTRIBUTIONS
In this paper, we design a family of continuous time flows—which we call β-SVGD flow—by
combining importance weights with the kernelized gradient flow of the KL-divergence. Surpris-
ingly, we prove that the time for this flow to converge to the equilibrium distribution π, that is
min0stIStein (ρs|π)εwith (ρs)t
s=0 generated along β-SVGD flow, can be bounded by
1
εβ(β+1) when β(1,0). This indicates that the importance weights can potentially accel-
erate SVGD. Actually, we design β-SVGD method based on a discretization of the β-SVGD flow and
provide a descent lemma for its population limit version. Some simple experiments in Appendix D
verify our predictions.
We summarize our contributions in the following:
A new family of flows. We construct a family of continuous time flows for which we
coin the name β-SVGD flows. These flows do not arise from a time re-parameterization of
the SVGD flow since their trajectories are different, nor can they be seen as the kernelized
gradient flows of the R´
enyi divergence.
Convergence rates. When β0, this returns back to the kernelized gradient flow of the
KL-divergence (SVGD flow); when β(1,0), the convergence rate of β-SVGD flows
is significantly improved than that of the SVGD flow in the case DKL (ρ0|π)is large.
Under a Stein Poincar´
e inequality, we derive an exponential convergence rate of 2-R´
enyi
divergence along 1-SVGD flow. Stein Poincar´
e inequality is proved to be weaker than Stein
log-Sobolev inequality, however like Stein log-Sobolev inequality, it is not clear to us when
it does hold.
Algorithm. We design β-SVGD algorithm based on a discretization of the β-SVGD flow
and we derive a descent lemmas for the population limit β-SVGD.
Experiments. Finally, we do some experiments to illustrate the advantages of β-SVGD
with negative β. The simulation results on β-SVGD corroborate our theory.
1.2 RELATED WORKS
The SVGD sampling technique was first presented in the fundamental work of Liu & Wang (2016).
Since then, a number of SVGD variations have been put out. The following is a partial list: Newton
version SVGD (Detommaso et al., 2018), stochastic SVGD (Gorham et al., 2020), mirrored SVGD
(Shi et al., 2021), random-batch method SVGD (Li et al., 2020) and matrix kernel SVGD (Wang
et al., 2019). The theoretical knowledge of SVGD is still constrained to population limit SVGD. The
first work to demonstrate the convergence of SVGD in the population limit was by Liu (2017); Korba
et al. (2020) then derived a similar descent lemma for the population limit SVGD using a different
approach. However, their results relied on the path information and thus were not self-contained,
to provide a clean analysis, Salim et al. (2021) assumed a Talagrand’s T1inequality of the target
distribution πand gave the first iteration complexity analysis in terms of dimension d. Following
the work of Salim et al. (2021); Sun et al. (2022) derived a descent lemma for the population limit
SVGD under a non-smooth potential V.
In this paper, we consider a family of generalized divergences, R´
enyi divergence, and SVGD with
importance weights. For these two themes, we name a few but non-exclusive related results. Wang
et al. (2018) proposed to use the f-divergence instead of KL-divergence in the variational inference
problem, here fis a convex function; Yu et al. (2020) also considered variational inference with
2
f-divergence but with its dual form. Han & Liu (2017) considered combining importance sampling
with SVGD, however the importance weights were only used to adjust the final sampled points but
not in the iteration of SVGD as in this paper; Liu & Lee (2017) considered importance sampling, they
designed a black-box scheme to calculate the importance weights (they called them Stein importance
weights in their paper) of any set of points.
2 PRELIMINARIES
We assume the target distribution πeV, and we have oracle to calculate the value of eV(x)for
all xRd.
2.1 NOTATION
Let x= (x1, . . . , xd)>, y = (y1, . . . , yd)>Rd, denote hx, yi:= Pd
i=1 xiyiand kxk:=
phx, xi. For a square matrix BRd×d, the operator norm and Frobenius norm of Bare de-
fined respectively by kBkop := p%(B>B)and kBkF:= qPd
i=1 Pd
j=1 B2
i,j , respectively, where
%denotes the spectral radius. It is easy to verify that kBkop ≤ kBkF. Let P2(Rd)denote the
space of probability measures with finite second moment; that is, for any µ∈ P2(Rd)we have
Rkxk2(x)<+. The Wasserstein 2-distance between ρ, µ ∈ P2(Rd)is defined by
W2(ρ, µ) := inf
ηΓ(ρ,µ)qRkxyk2(x, y),
where Γ (ρ, µ)is the set of all joint distributions defined on Rd×Rdhaving ρand µas marginals.
The push-forward distribution of ρ∈ P2Rdby a map T:RdRd, denoted by T#ρ,
is defined as follows: for any measurable set Rd,T#ρ(Ω) := ρT1(Ω). By defini-
tion of the push-forward distribution, it is not hard to verify that the probability densities satisfy
T#ρ(T(x))|det D T(x)|=ρ(x), where DTis the Jacobian matrix of T. The reader can refer to
Villani (2009) for more details.
2.2 R ´
ENYI DIVERGENCE
Next, we define the R´
enyi divergence which plays an important role in information theory and
many other areas such as hypothesis testing (Morales Gonz´
alez et al., 2013) and multiple source
adaptation (Mansour et al., 2012).
Definition 1 (R´
enyi divergence) For two probability distributions ρand µon Rdand ρµ, the
R´
enyi divergence of positive order αis defined as
Dα(ρ|µ) :=
1
α1log Rρ
µα1(x)(x)0< α < , α 6= 1
Rlog ρ
µ(x)(x)α= 1
.(3)
If ρis not absolutely continuous with respect to µ, we set Dα(ρ|µ) = . Further, we denote
DKL (ρ|µ) := D1(ρ|µ).
R´
enyi divergence is non-negative, continuous and non-decreasing in terms of the parameter α;
specifically, we have DKL (ρ|µ) = limα1Dα(ρ|µ). More properties of R´
enyi divergence can
be found in a comprehensive article by Van Erven & Harremos (2014). Besides R´
enyi divergence,
there are other generalizations of the KL-divergence, e.g., admissible relative entropies (Arnold
et al., 2001).
2.3 BACKGROUND ON SVGD
Stein Variational Gradient Descent (SVGD) is defined on a Reproducing Kernel Hilbert
Space (RKHS) H0with a non-negative definite reproducing kernel k:Rd×RdR+. The
key feature of this space is its reproducing property:
f(x) = hf(·), k(x, ·)iH0,f∈ H0,(4)
3
where ,·iH0is the inner product defined on H0. Let Hbe the d-fold Cartesian product of H0.
That is, f H if and only if there exist f1,· · · , fd∈ H0such that f= (f1, . . . , fd)>. Naturally,
the inner product on His given by
hf, giH:=
d
P
i=1
hfi, giiH0, f = (f1, . . . , fd)>∈ H, g = (g1, . . . , gd)>∈ H.(5)
For more details of RKHS, the readers can refer to Berlinet & Thomas-Agnan (2011).
It is well known (see for example Ambrosio et al. (2005)) that log ρ
πis the Wasserstein gradient
of DKL (· | π)at ρ∈ P2(Rd). Liu & Wang (2016) proposed a kernelized Wasserstein gradient of
the KL-divergence, defined by
gρ(x) := Rk(x, y)log ρ
π(y)(y)∈ H.(6)
Integration by parts yields
gρ(x) = R[log π(y)k(x, y) + yk(x, y)] (y).(7)
Comparing the Wasserstein gradient log ρ
πwith (7), we find that the latter can be easily approx-
imated by
gρ(x)ˆgˆρ:= 1
N
N
P
i=1
[log π(xi)k(x, xi) + xik(x, xi)] ,(8)
with ˆρ=1
NPN
i=1 δxiand (xi)N
i=1 sampled from ρ. With the above notations, the SVGD update rule
xixi+γ
N
N
P
j=1 log π(xj)k(xi, xj) + xjk(xi, xj), i = 1, . . . , N, (9)
where γis the step-size, can be presented in the compact form ˆρ(Iγˆgˆρ)#ˆρ. When we talk
about the infinite particle SVGD, or population limit SVGD, we mean ρ(Iγgρ)#ρ. The
metric used in the study of SVGD is the Stein Fisher information or the Kernelized Stein Discrep-
ancy (KSD).
Definition 2 (Stein Fisher Information) Let ρ∈ P2(Rd). The Stein Fisher Information of ρrela-
tive to πis defined by
IStein(ρ|π) := RRk(x, y)log ρ
π(x),log ρ
π(y)(x)(y).(10)
A sufficient condition under which limn→∞ IStein(ρn|π)implies ρnπweakly can be
found in Gorham & Mackey (2017), which requires: i) the kernel kto be in the form k(x, y) =
c2+kxyk2θ
for some c > 0and θ(1,0); ii) πeVto be distant dissipative; roughly
speaking, this requires Vto be convex outside a compact set, see Gorham & Mackey (2017) for an
accurate definition.
In the study of the kernelized Wasserstein gradient (7) and its corresponding continuity equation
ρt
t + div (ρtgρt) = 0,
Duncan et al. (2019) introduced the following kernelized log-Sobolev inequality to prove the expo-
nential convergence of DKL (ρt|π)along the direction (7):
Definition 3 (Stein log-Sobolev inequality) We say πsatisfies the Stein log-Sobolev inequality
with constant λ > 0if
DKL(ρ|π)1
2λIStein(ρ|π).(11)
While this inequality can guarantee an exponential convergence rate of ρtto π, quantified by the
KL-divergence, the condition for πto satisfy the Stein log-Sobolev inequality is very restrictive. In
fact, little is known about when (11) holds.
3 CONTINUOUS TIME DYNAMICS OF THE β-SVGD FLOW
In this section, we mainly focus on the continuous time dynamics of the β-SVGD flow. Due to page
limitation, we leave all of the proofs to Appendix B.
4
Figure 1: The performance of β-SVGD with β= 0.5,0,0.5from left to right, but with
the same step-size. The blue dashed line is the target distribution π: the Gaussian mixture
2
5N(2,1) + 3
5N(6,1). The green solid line is the distribution generated by β-SVGD after 100
iterations; More experiments can be found in Appendix D.
3.1 β-SVGD FLOW
In this paper, a flow refers to some time-dependent vector field vt:RdRd. This time-dependent
vector field will influence the mass distribution on Rdby the continuity equation or the equation of
conservation of mass
ρt
t + div (ρtvt)=0,(12)
readers can refer to Ambrosio et al. (2005) for more details.
Definition 4 (β-SVGD flow) Given a weight parameter β(1,+), the β-SVGD flow is given
by
vβ
t(x) := π
ρtβ
(x)Rk(x, y)log ρt
π(y)t(y).(13)
Note that when β= 0, this is the negative kernelized Wasserstein gradient (6).
Note that we can not treat β-SVGD flow as the kernelized Wasserstein gradient flow of the (β+ 1)-
R´
enyi divergence. However, they are closely related, and we can derive the following theorem.
Theorem 1 (Main result) Along the β-SVGD flow (13), we have1
min
t[0,T ]IStein (ρt|π)1
TZT
0
IStein(ρt|π)dt
eβDβ+1(ρ0|π)
T β(β+1) β > 0
DKL(ρ0|π)
Tβ= 0
1
T β(β+1) β(1,0)
.(14)
Note the left hand side of (14) is the Stein Fisher information. When βdecreases from positive to
negative, the right hand side of (14) changes dramatically; it appears to be independent of ρ0and
π. If we do not know the R´
enyi divergence between ρ0and π, it seems the best convergence rate is
obtained by setting β=1
2, that is
mint[0,T ]IStein (ρt|π)1
TRT
0IStein(ρt|π)dt ≤≤ 4
T.
It is somewhat unexpected to observe that the time complexity is independent of ρ0and π, or to
be more precise, that it relies only very weakly on ρ0and πwhen β(1,0). We wish to
stress that this is not achieved by time re-parameterization. In the proof of Theorem 1, we can
see the term (π
/ρt)βin β-SVGD flow (13) is utilized to cancel term (ρt
/π)βin the Wasserstein
gradient of (β+ 1)-R´
enyi divergence. Actually, when β(1,0), this term has an added ad-
vantage and can be seen as the acceleration factor in front of the kernelized Wasserstein gradi-
ent of KL-divergence. Specifically, the negative kernelized Wasserstein gradient of KL-divergence
v0
t(x) := Rk(x, y)log( ρt
π)(y)t(y)is the vector field that compels ρtto approach π, while
1In fact, in the proof in Appendix B we know a stronger result. When β(1,0), the right hand side
of (14) is only weakly dependent on ρ0and πand should be
eβDβ+1(ρ0|π)eβDβ+1 (ρT|π)
T|β(β+1)|, which is less than
1
T β(β+1) .
5
摘要:

IMPROVEDSTEINVARIATIONALGRADIENTDESCENTWITHIMPORTANCEWEIGHTSLukangSunKAUSTlukang.sun@kaust.edu.saPeterRicht´arikKAUSTpeter.richtarik@kaust.edu.saABSTRACTSteinVariationalGradientDescent(SVGD)isapopularsamplingalgorithmusedinvariousmachinelearningtasks.ItiswellknownthatSVGDarisesfromadiscretizationof...

展开>> 收起<<
IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS Lukang Sun.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:4.18MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注