Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders Methodology and Asymptotics Kamélia Daudel Joe Benton Yuyang Shi Arnaud Doucet

2025-04-27 0 0 4.97MB 69 页 10玖币
侵权投诉
Alpha-divergence Variational Inference Meets Importance Weighted
Auto-Encoders: Methodology and Asymptotics
Kamélia Daudel Joe Benton* Yuyang Shi* Arnaud Doucet
Department of Statistics, University of Oxford, United Kingdom
ABSTRACT
Several algorithms involving the Variational Rényi (VR) bound have been proposed to
minimize an alpha-divergence between a target posterior distribution and a variational
distribution. Despite promising empirical results, those algorithms resort to biased stochastic
gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize
and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder
(IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and
notably leads to the same stochastic gradient descent procedure as the VR bound in the
reparameterized case, but this time by relying on unbiased gradient estimators. We then
provide two complementary theoretical analyses of the VR-IWAE bound and thus of the
standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these
bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.
Keywords Variational Inference
·
Alpha-Divergence
·
Importance Weighted Auto-encoder
·
High dimension
·
Weight collapse
1 Introduction
Variational inference methods aim at finding the best approximation to a target posterior density within a
so-called variational family of probability densities. This best approximation is traditionally obtained by
minimizing the exclusive Kullback–Leibler divergence [Wainwright and Jordan,2008,Blei et al.,2017],
however this divergence is known to have some drawbacks [for instance variance underestimation, see Minka,
2005].
As a result, alternative divergences have been explored [Minka,2005,Li and Turner,2016,Bui et al.,2016,
Dieng et al.,2017,Li and Gal,2017,Wang et al.,2018,Daudel et al.,2021,2023,Daudel and Douc,2021,
Rodríguez-Santana and Hernández-Lobato,2022], in particular the class of alpha-divergences. This family of
divergences is indexed by a scalar α. It provides additional flexibility that can in theory be used to overcome
the obstacles associated to the exclusive Kullback–Leibler divergence (which is recovered by letting α1).
Among those methods, techniques involving the Variational Rényi (VR) bound introduced in Li and Turner
[2016] have led to promising empirical results and have been linked to key algorithms such as the Importance
Weighted Auto-encoder (IWAE) algorithm [Burda et al.,2016] in the special case
α= 0
and the Black-Box
Alpha (BB-α) algorithm [Hernandez-Lobato et al.,2016].
Yet methods based on the VR bound are seen as lacking theoretical guarantees. This comes from the fact that
they are classified as biased in the community: by selecting the VR bound as the objective function, those
methods indeed resort to biased gradient estimators [Li and Turner,2016,Hernandez-Lobato et al.,2016,Bui
*: Equal contribution
arXiv:2210.06226v2 [stat.ML] 19 Jul 2023
et al.,2016,Li and Gal,2017,Geffner and Domke,2020,2021,Zhang et al.,2021,Rodríguez-Santana and
Hernández-Lobato,2022].
Geffner and Domke [2020] have recently provided insights from an empirical perspective regarding the
magnitude of the bias and its impact on the outcome of the optimization procedure when the (biased)
reparameterized gradient estimator of the VR bound is used. They observe that the resulting algorithm appears
to require an impractically large amount of computations to actually optimise the VR bound as the dimension
increases (and otherwise seems to simply return minimizers of the exclusive Kullback–Leibler divergence).
They postulate that this effect might be due to a weight degeneracy behavior [Bengtsson et al.,2008], but this
behavior is not quantified precisely from a theoretical point of view.
In this paper, our goal is to (i) develop theoretical guarantees for VR-based variational inference methods and
(ii) construct a theoretical framework elucidating the weight degeneracy behavior that has been empirically
observed for those techniques. The rest of this paper is organized as follows:
In Section 2, we provide some background notation and we review the main concepts behind the VR
bound.
In Section 3, we introduce the VR-IWAE bound. We show in Proposition 1that this bound, previously
defined by Li and Turner [2016] as the expectation of the biased Monte Carlo approximation of the
VR bound, can be actually interpreted as a variational bound which depends on an hyperparameter
α
with
α[0,1)
. In addition, we obtain that the VR-IWAE bound leads to the same stochastic gradient
descent procedure as the VR bound in the reparameterized case. Unlike the VR bound, the VR-IWAE
bound relies on unbiased gradient estimators and coincides with the IWAE bound for
α= 0
, fully
bridging the gap between both methodologies.
We then generalize the approach of Rainforth et al. [2018] – which characterizes the Signal-to-Noise
Ratio (SNR) of the reparameterized gradient estimators of the IWAE – to the VR-IWAE bound
and establish that the VR-IWAE bound with
α(0,1)
enjoys better theoretical properties than
the IWAE bound (Theorem 1). To further tackle potential SNR difficulties, we also extend the
doubly-reparameterized gradient estimator of the IWAE [Tucker et al.,2019] to the VR-IWAE bound
(Theorem 2).
In Section 4, we provide a thorough theoretical study of the VR-IWAE bound. Following Domke and
Sheldon [2018], we start by investigating the case where the dimension of the latent space
d
is fixed
and the number of Monte Carlo samples
N
in the VR-IWAE bound goes to infinity (Theorem 3).
Our analysis shows that the hyperparameter
α
allows us to balance between an error term depending
on both the encoder and the decoder parameters
(θ, ϕ)
and a term going to zero at a
1/N
rate. This
suggests that tuning αcan be beneficial to obtain the best empirical performances.
However, the relevance of such analysis can be limited for a high-dimensional latent space
d
(Exam-
ples 1and 2). We then propose a novel analysis where
N
does not grow as fast as exponentially with
d
(Theorems 4and 5) or sub-exponentially with
d1/3
(Theorem 6), which we use to revisit Examples
1and 2in Examples 3and 4respectively. This analysis suggests that in these regimes the VR-IWAE
bound, and hence in particular the IWAE bound, are of limited interest.
In Section 5, we detail how our work relates to the existing litterature.
Lastly, Section 6provides empirical evidence illustrating our theoretical claims for both toy and
real-data examples.
2 Background
Given a model with joint distribution
pθ(x, z)
parameterized by
θ
, where
x
denotes an observation and
z
is a
latent variable valued in
Rd
, one is interested in finding the parameter
θ
which best describes the observations
2
D={x1, . . . , xT}. This will be our running example. The corresponding posterior density satisfies:
pθ(z|D)
T
Y
i=1
pθ(xi, zi)(1)
with z= (z1, . . . , zT), so that the marginal log likelihood reads
(θ;D) =
T
X
i=1
(θ;xi)with (θ;x) := log pθ(x) = log Zpθ(x, z)dz.(2)
Unfortunately as this marginal log likelihood is typically intractable, finding
θ
maximizing it is difficult.
Variational bounds are then designed to act as surrogate objective functions more amenable to optimization.
Let
qϕ(z|x)
be a variational encoder parameterized by
ϕ
, common variational bounds are the Evidence Lower
BOund (ELBO) and the IWAE bound [Burda et al.,2016]:
ELBO(θ, ϕ;x) = Zqϕ(z|x) log wθ(z;x) dz,
(IWAE)
N(θ, ϕ;x) = Z Z N
Y
i=1
qϕ(zi|x) log
1
N
N
X
j=1
wθ,ϕ(zj;x)
dz1:N, N N
where for all zRd,
wθ,ϕ(z;x) = pθ(x, z)
qϕ(z|x).
The IWAE bound generalizes the ELBO (which is recovered for
N= 1
) and acts as a lower bound on
(θ;x)
that can be estimated in an unbiased manner. Instead of maximizing
(θ;D)
defined in
(2)
, one then considers
the surrogate objective
T
X
i=1
(IWAE)
N(θ, ϕ;xi)
which is optimized by performing stochastic gradient descent steps w.r.t.
(θ, ϕ)
on it combined to mini-
batching. Optimizing this objective w.r.t.
ϕ
is difficult due to high-variance gradients with low Signal-to-Noise
Ratio [Rainforth et al.,2018]. To mitigate this problem, reparameterized [Kingma and Welling,2014,Burda
et al.,2016] and doubly-reparameterized gradient estimators [Tucker et al.,2019] have been proposed.
Crucially, stochastic gradient schemes on the IWAE bound (and hence on the ELBO) only resort to unbiased
estimators in both the reparameterized [Kingma and Welling,2014,Burda et al.,2016] and the doubly-
reparameterized [Tucker et al.,2019] cases, providing theoretical justifications behind those approaches. In
particular, under the assumption that
z
can be reparameterized (that is
z=f(ε, ϕ;x)qϕ(·|x)
where
εq
)
and under common differentiability assumptions, the reparameterized gradient w.r.t.
ϕ
of the IWAE bound is
given by
ϕ(IWAE)
N(θ, ϕ;x) = Z Z N
Y
i=1
q(εi)
N
X
j=1
wθ,ϕ(zj;x)
PN
k=1 wθ,ϕ(zk;x)
ϕ log wθ,ϕ(f(εj, ϕ;x); x)
dε1:N
and the doubly-reparameterized one by
ϕ(IWAE)
N(θ, ϕ;x)
=Z Z N
Y
i=1
q(εi)
N
X
j=1 wθ,ϕ(zj;x)
PN
k=1 wθ,ϕ(zk;x)!2
ϕ log wθ,ϕ(f(εj, ϕ;x); x)|ϕ=ϕ
dε1:N.(3)
3
Unbiased Monte Carlo estimators of both gradients are hence respectively given by
N
X
j=1
wθ,ϕ(zj;x)
PN
k=1 wθ,ϕ(zk;x)
ϕ log wθ,ϕ(f(εj, ϕ;x); x)(4)
and
N
X
j=1 wθ,ϕ(zj;x)
PN
k=1 wθ,ϕ(zk;x)!2
ϕ log wθ,ϕ(f(εj, ϕ;x); x)|ϕ=ϕ,
with
ε1, . . . , εN
being i.i.d. samples generated from
q
and
zj=f(εj, ϕ;x)
for all
j= 1 . . . N
.Maddison
et al. [2017] and Domke and Sheldon [2018] in particular established that the variational gap - that is the
difference between the IWAE bound and the marginal log-likelihood - goes to zero at a fast
1/N
rate when the
dimension of the latent space dis fixed and the number of samples Ngoes to infinity.
Another example of variational bound is the Variational Rényi (VR) bound introduced by Li and Turner [2016]:
it is defined for all αR\ {1}by
L(α)(θ, ϕ;x) = 1
1αlog Zqϕ(z|x)wθ,ϕ(z;x)1αdz(5)
and it generalizes the ELBO [which corresponds to the extension by continuity of the VR bound to the
case
α= 1
, see Li and Turner,2016, Theorem 1]. It is also a lower (resp. upper) bound on the marginal
log-likelihood (θ;x)for all α > 0(resp. α < 0).
In the spirit of the IWAE bound optimisation framework, the VR bound is used for variational inference
purposes in [Li and Turner,2016, Section 4.1, 4.2 and 5.2] to optimise the marginal log-likelihood
(θ, D)
defined in (2) by considering the global objective function
T
X
i=1 L(α)(θ, ϕ;xi)
and by performing stochastic gradient descent steps w.r.t.
(θ, ϕ)
on it paired up with mini-batching and
reparameterization. This VR bound methodology has provided positive empirical results compared to the usual
case
α= 1
and has been widely adopted in the literature [Li and Turner,2016,Bui et al.,2016,Hernandez-
Lobato et al.,2016,Li and Gal,2017,Zhang et al.,2021,Rodríguez-Santana and Hernández-Lobato,2022].
As discussed in the remark below, this methodology is obviously not limited to the choice of posterior density
defined in (1) and is more broadly applicable.
Remark 1 (Black-box Alpha energy function) Let
p0(z)
be a prior on a latent variable
z
valued in
Rd
and
by p(x|z)the likelihood of the observation xgiven z, we might consider the posterior density
p(z|D)p0(z)
T
Y
i=1
p(xi|z),(6)
leading to the marginal log-likelihood
˜
(D) = log Zp(D, z)dz= log p0(z)
T
Y
i=1
p(xi|z)dz!.
Here, the latent variable
z
valued in
Rd
is shared across all the observations. Now further assume that the
prior density
p0(z) = exp(s(z)Tϕ0log Z(ϕ0))
has an exponential form, with
ϕ0
and
s
being the natural
parameters and the sufficient statistics respectively and
Z(ϕ0)
being the normalizing constant ensuring that
p0is a probability density function.
4
In order to find the best approximation to the posterior density
(6)
,Hernandez-Lobato et al. [2016] offers to
minimize the Black-Box Alpha (BB-α) energy function, which is defined by: for all αR\ {1},
E(ϕ) = log Z(ϕ0)log Z(˜
ϕ)1
1α
T
X
i=1
log Zqϕ(z)p(xi|z)
fϕ(z)1α
dz!
where
fϕ(z) = exp(s(z)Tϕ)
is within the same exponential family as the prior and
qϕ(z) = exp(s(z)T˜
ϕ
log Z(˜
ϕ))
with
˜
ϕ=T ϕ +ϕ0
denoting the natural parameters of
qϕ
and
Z(˜
ϕ)
its normalizing constant. Here,
the minimisation is carried out via stochastic gradient descent w.r.t.
ϕ
combined with mini-batching and
reparameterization.
As observed in Li and Gal [2017], minimizing
E(ϕ)
w.r.t.
ϕ
is equivalent to maximizing the sum of VR bounds
T
X
i=1
T
1αlog Zqϕ(z)wθ,ϕ(z;x)1α
Tdz
w.r.t. ϕ, where this time wθ,ϕ(z;x) = p(xi|z)Tp0(z)/qϕ(z).
However, the stochastic gradient descent scheme originating from having selected the VR bound as the
objective function suffers from one important shortcoming: it relies on biased gradient estimators for all
α /∈ {0,1}
, meaning that there exists no convergence guarantees for the whole scheme. Indeed, Li and Turner
[2016] show that the gradient of the VR bound w.r.t. ϕsatisfies
ϕL(α)(θ, ϕ;x) = Rq(ε)wθ,ϕ(z;x)1α
ϕ log wθ(f(ε, ϕ;x); x) dε
Rq(ε)wθ,ϕ(z;x)1αdε,
with z=f(ε, ϕ;x)qϕ(·|x)where εq. The gradient above being intractable, they approximate it using
N
X
j=1
wθ,ϕ(zj;x)1α
PN
k=1 wθ,ϕ(zk;x)1α
ϕ log wθ,ϕ(f(εj, ϕ;x); x),(7)
where
ε1, . . . , εN
are i.i.d. samples generated from
q
and
zj=f(εj, ϕ;x)
for all
j= 1 . . . N
. The cases
α= 0
and
α= 1
recover the stochastic reparameterized gradients of the IWAE bound
(4)
and of the ELBO
(consider
(4)
with
N= 1
). As a result, we can trace them back to unbiased stochastic gradient descent schemes
for IWAE bound and ELBO optimisation respectively. Yet, this is no longer the case when
α /∈ {0,1}
, hence
impeding the theoretical guarantees of the scheme.
In addition, due to the log function, the VR bound itself can only be approximated using biased Monte Carlo
estimators, with [Li and Turner,2016, Section 4.1] using
1
1αlog
1
N
N
X
j=1
wθ,ϕ(Zj;x)1α
(8)
where
Z1, . . . , ZN
are i.i.d. samples generated from
qϕ
. Furthermore, while the VR bound and the IWAE
bound approaches are linked via the gradient estimator
(7)
, the VR bound does not recover the IWAE bound
when α= 0.
The next section aims at overcoming the theoretical difficulties regarding the VR bound mentioned above.
3 The VR-IWAE bound
For all αR\ {1}, let us introduce the quantity
(α)
N(θ, ϕ;x) := 1
1αZ Z N
Y
i=1
qϕ(zi|x) log
1
N
N
X
j=1
wθ,ϕ(zj;x)1α
dz1:N,(9)
5
摘要:

Alpha-divergenceVariationalInferenceMeetsImportanceWeightedAuto-Encoders:MethodologyandAsymptoticsKaméliaDaudelJoeBenton*YuyangShi*ArnaudDoucetDepartmentofStatistics,UniversityofOxford,UnitedKingdomABSTRACTSeveralalgorithmsinvolvingtheVariationalRényi(VR)boundhavebeenproposedtominimizeanalpha-diverg...

展开>> 收起<<
Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders Methodology and Asymptotics Kamélia Daudel Joe Benton Yuyang Shi Arnaud Doucet.pdf

共69页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:69 页 大小:4.97MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 69
客服
关注