Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders Methodology and Asymptotics Kamélia Daudel Joe Benton Yuyang Shi Arnaud Doucet

2025-04-27 0 0 4.97MB 69 页 10玖币

侵权投诉

Alpha-divergence Variational Inference Meets Importance Weighted

Auto-Encoders: Methodology and Asymptotics

Kamélia Daudel Joe Benton* Yuyang Shi* Arnaud Doucet

Department of Statistics, University of Oxford, United Kingdom

ABSTRACT

Several algorithms involving the Variational Rényi (VR) bound have been proposed to

minimize an alpha-divergence between a target posterior distribution and a variational

distribution. Despite promising empirical results, those algorithms resort to biased stochastic

gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize

and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder

(IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and

notably leads to the same stochastic gradient descent procedure as the VR bound in the

reparameterized case, but this time by relying on unbiased gradient estimators. We then

provide two complementary theoretical analyses of the VR-IWAE bound and thus of the

standard IWAE bound. Those analyses shed light on the beneﬁts or lack thereof of these

bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.

Keywords Variational Inference

Alpha-Divergence

Importance Weighted Auto-encoder

High dimension

Weight collapse

1 Introduction

Variational inference methods aim at ﬁnding the best approximation to a target posterior density within a

so-called variational family of probability densities. This best approximation is traditionally obtained by

minimizing the exclusive Kullback–Leibler divergence [Wainwright and Jordan,2008,Blei et al.,2017],

however this divergence is known to have some drawbacks [for instance variance underestimation, see Minka,

2005].

As a result, alternative divergences have been explored [Minka,2005,Li and Turner,2016,Bui et al.,2016,

Dieng et al.,2017,Li and Gal,2017,Wang et al.,2018,Daudel et al.,2021,2023,Daudel and Douc,2021,

Rodríguez-Santana and Hernández-Lobato,2022], in particular the class of alpha-divergences. This family of

divergences is indexed by a scalar α. It provides additional ﬂexibility that can in theory be used to overcome

the obstacles associated to the exclusive Kullback–Leibler divergence (which is recovered by letting α→1).

Among those methods, techniques involving the Variational Rényi (VR) bound introduced in Li and Turner

[2016] have led to promising empirical results and have been linked to key algorithms such as the Importance

Weighted Auto-encoder (IWAE) algorithm [Burda et al.,2016] in the special case

α= 0

and the Black-Box

Alpha (BB-α) algorithm [Hernandez-Lobato et al.,2016].

Yet methods based on the VR bound are seen as lacking theoretical guarantees. This comes from the fact that

they are classiﬁed as biased in the community: by selecting the VR bound as the objective function, those

methods indeed resort to biased gradient estimators [Li and Turner,2016,Hernandez-Lobato et al.,2016,Bui

*: Equal contribution

arXiv:2210.06226v2 [stat.ML] 19 Jul 2023

et al.,2016,Li and Gal,2017,Geffner and Domke,2020,2021,Zhang et al.,2021,Rodríguez-Santana and

Hernández-Lobato,2022].

Geffner and Domke [2020] have recently provided insights from an empirical perspective regarding the

magnitude of the bias and its impact on the outcome of the optimization procedure when the (biased)

reparameterized gradient estimator of the VR bound is used. They observe that the resulting algorithm appears

to require an impractically large amount of computations to actually optimise the VR bound as the dimension

increases (and otherwise seems to simply return minimizers of the exclusive Kullback–Leibler divergence).

They postulate that this effect might be due to a weight degeneracy behavior [Bengtsson et al.,2008], but this

behavior is not quantiﬁed precisely from a theoretical point of view.

In this paper, our goal is to (i) develop theoretical guarantees for VR-based variational inference methods and

(ii) construct a theoretical framework elucidating the weight degeneracy behavior that has been empirically

observed for those techniques. The rest of this paper is organized as follows:

•

In Section 2, we provide some background notation and we review the main concepts behind the VR

bound.

•

In Section 3, we introduce the VR-IWAE bound. We show in Proposition 1that this bound, previously

deﬁned by Li and Turner [2016] as the expectation of the biased Monte Carlo approximation of the

VR bound, can be actually interpreted as a variational bound which depends on an hyperparameter

with

α∈[0,1)

. In addition, we obtain that the VR-IWAE bound leads to the same stochastic gradient

descent procedure as the VR bound in the reparameterized case. Unlike the VR bound, the VR-IWAE

bound relies on unbiased gradient estimators and coincides with the IWAE bound for

α= 0

, fully

bridging the gap between both methodologies.

We then generalize the approach of Rainforth et al. [2018] – which characterizes the Signal-to-Noise

Ratio (SNR) of the reparameterized gradient estimators of the IWAE – to the VR-IWAE bound

and establish that the VR-IWAE bound with

α∈(0,1)

enjoys better theoretical properties than

the IWAE bound (Theorem 1). To further tackle potential SNR difﬁculties, we also extend the

doubly-reparameterized gradient estimator of the IWAE [Tucker et al.,2019] to the VR-IWAE bound

(Theorem 2).

•

In Section 4, we provide a thorough theoretical study of the VR-IWAE bound. Following Domke and

Sheldon [2018], we start by investigating the case where the dimension of the latent space

is ﬁxed

and the number of Monte Carlo samples

in the VR-IWAE bound goes to inﬁnity (Theorem 3).

Our analysis shows that the hyperparameter

allows us to balance between an error term depending

on both the encoder and the decoder parameters

(θ, ϕ)

and a term going to zero at a

1/N

rate. This

suggests that tuning αcan be beneﬁcial to obtain the best empirical performances.

However, the relevance of such analysis can be limited for a high-dimensional latent space

(Exam-

ples 1and 2). We then propose a novel analysis where

does not grow as fast as exponentially with

(Theorems 4and 5) or sub-exponentially with

d1/3

(Theorem 6), which we use to revisit Examples

1and 2in Examples 3and 4respectively. This analysis suggests that in these regimes the VR-IWAE

bound, and hence in particular the IWAE bound, are of limited interest.

• In Section 5, we detail how our work relates to the existing litterature.

•

Lastly, Section 6provides empirical evidence illustrating our theoretical claims for both toy and

real-data examples.

2 Background

Given a model with joint distribution

pθ(x, z)

parameterized by

, where

denotes an observation and

is a

latent variable valued in

, one is interested in ﬁnding the parameter

which best describes the observations

D={x1, . . . , xT}. This will be our running example. The corresponding posterior density satisﬁes:

pθ(z|D)∝

i=1

pθ(xi, zi)(1)

with z= (z1, . . . , zT), so that the marginal log likelihood reads

ℓ(θ;D) =

i=1

ℓ(θ;xi)with ℓ(θ;x) := log pθ(x) = log Zpθ(x, z)dz.(2)

Unfortunately as this marginal log likelihood is typically intractable, ﬁnding

maximizing it is difﬁcult.

Variational bounds are then designed to act as surrogate objective functions more amenable to optimization.

Let

qϕ(z|x)

be a variational encoder parameterized by

, common variational bounds are the Evidence Lower

BOund (ELBO) and the IWAE bound [Burda et al.,2016]:

ELBO(θ, ϕ;x) = Zqϕ(z|x) log wθ,ϕ(z;x) dz,

ℓ(IWAE)

N(θ, ϕ;x) = Z Z N

i=1

qϕ(zi|x) log 



j=1

wθ,ϕ(zj;x)

dz1:N, N ∈N⋆

where for all z∈Rd,

wθ,ϕ(z;x) = pθ(x, z)

qϕ(z|x).

The IWAE bound generalizes the ELBO (which is recovered for

N= 1

) and acts as a lower bound on

ℓ(θ;x)

that can be estimated in an unbiased manner. Instead of maximizing

ℓ(θ;D)

deﬁned in

(2)

, one then considers

the surrogate objective

i=1

ℓ(IWAE)

N(θ, ϕ;xi)

which is optimized by performing stochastic gradient descent steps w.r.t.

(θ, ϕ)

on it combined to mini-

batching. Optimizing this objective w.r.t.

is difﬁcult due to high-variance gradients with low Signal-to-Noise

Ratio [Rainforth et al.,2018]. To mitigate this problem, reparameterized [Kingma and Welling,2014,Burda

et al.,2016] and doubly-reparameterized gradient estimators [Tucker et al.,2019] have been proposed.

Crucially, stochastic gradient schemes on the IWAE bound (and hence on the ELBO) only resort to unbiased

estimators in both the reparameterized [Kingma and Welling,2014,Burda et al.,2016] and the doubly-

reparameterized [Tucker et al.,2019] cases, providing theoretical justiﬁcations behind those approaches. In

particular, under the assumption that

can be reparameterized (that is

z=f(ε, ϕ;x)∼qϕ(·|x)

where

ε∼q

)

and under common differentiability assumptions, the reparameterized gradient w.r.t.

of the IWAE bound is

given by

∂

∂ϕℓ(IWAE)

N(θ, ϕ;x) = Z Z N

i=1

q(εi)



j=1

wθ,ϕ(zj;x)

k=1 wθ,ϕ(zk;x)

∂

∂ϕ log wθ,ϕ(f(εj, ϕ;x); x)

dε1:N

and the doubly-reparameterized one by

∂

∂ϕℓ(IWAE)

N(θ, ϕ;x)

=Z Z N

i=1

q(εi)



j=1 wθ,ϕ(zj;x)

k=1 wθ,ϕ(zk;x)!2∂

∂ϕ log wθ,ϕ′(f(εj, ϕ;x); x)|ϕ′=ϕ

dε1:N.(3)

Unbiased Monte Carlo estimators of both gradients are hence respectively given by

j=1

wθ,ϕ(zj;x)

k=1 wθ,ϕ(zk;x)

∂

∂ϕ log wθ,ϕ(f(εj, ϕ;x); x)(4)

and

j=1 wθ,ϕ(zj;x)

k=1 wθ,ϕ(zk;x)!2∂

∂ϕ log wθ,ϕ′(f(εj, ϕ;x); x)|ϕ′=ϕ,

with

ε1, . . . , εN

being i.i.d. samples generated from

and

zj=f(εj, ϕ;x)

for all

j= 1 . . . N

.Maddison

et al. [2017] and Domke and Sheldon [2018] in particular established that the variational gap - that is the

difference between the IWAE bound and the marginal log-likelihood - goes to zero at a fast

1/N

rate when the

dimension of the latent space dis ﬁxed and the number of samples Ngoes to inﬁnity.

Another example of variational bound is the Variational Rényi (VR) bound introduced by Li and Turner [2016]:

it is deﬁned for all α∈R\ {1}by

L(α)(θ, ϕ;x) = 1

1−αlog Zqϕ(z|x)wθ,ϕ(z;x)1−αdz(5)

and it generalizes the ELBO [which corresponds to the extension by continuity of the VR bound to the

case

α= 1

, see Li and Turner,2016, Theorem 1]. It is also a lower (resp. upper) bound on the marginal

log-likelihood ℓ(θ;x)for all α > 0(resp. α < 0).

In the spirit of the IWAE bound optimisation framework, the VR bound is used for variational inference

purposes in [Li and Turner,2016, Section 4.1, 4.2 and 5.2] to optimise the marginal log-likelihood

ℓ(θ, D)

deﬁned in (2) by considering the global objective function

i=1 L(α)(θ, ϕ;xi)

and by performing stochastic gradient descent steps w.r.t.

(θ, ϕ)

on it paired up with mini-batching and

reparameterization. This VR bound methodology has provided positive empirical results compared to the usual

case

α= 1

and has been widely adopted in the literature [Li and Turner,2016,Bui et al.,2016,Hernandez-

Lobato et al.,2016,Li and Gal,2017,Zhang et al.,2021,Rodríguez-Santana and Hernández-Lobato,2022].

As discussed in the remark below, this methodology is obviously not limited to the choice of posterior density

deﬁned in (1) and is more broadly applicable.

Remark 1 (Black-box Alpha energy function) Let

p0(z)

be a prior on a latent variable

valued in

and

by p(x|z)the likelihood of the observation xgiven z, we might consider the posterior density

p(z|D)∝p0(z)

i=1

p(xi|z),(6)

leading to the marginal log-likelihood

ℓ(D) = log Zp(D, z)dz= log p0(z)

i=1

p(xi|z)dz!.

Here, the latent variable

valued in

is shared across all the observations. Now further assume that the

prior density

p0(z) = exp(s(z)Tϕ0−log Z(ϕ0))

has an exponential form, with

ϕ0

and

being the natural

parameters and the sufﬁcient statistics respectively and

Z(ϕ0)

being the normalizing constant ensuring that

p0is a probability density function.

In order to ﬁnd the best approximation to the posterior density

(6)

,Hernandez-Lobato et al. [2016] offers to

minimize the Black-Box Alpha (BB-α) energy function, which is deﬁned by: for all α∈R\ {1},

E(ϕ) = log Z(ϕ0)−log Z(˜

ϕ)−1

1−α

i=1

log Zqϕ(z)p(xi|z)

fϕ(z)1−α

dz!

where

fϕ(z) = exp(s(z)Tϕ)

is within the same exponential family as the prior and

qϕ(z) = exp(s(z)T˜

ϕ−

log Z(˜

ϕ))

with

ϕ=T ϕ +ϕ0

denoting the natural parameters of

qϕ

and

Z(˜

ϕ)

its normalizing constant. Here,

the minimisation is carried out via stochastic gradient descent w.r.t.

combined with mini-batching and

reparameterization.

As observed in Li and Gal [2017], minimizing

E(ϕ)

w.r.t.

is equivalent to maximizing the sum of VR bounds

i=1

1−αlog Zqϕ(z)wθ,ϕ(z;x)1−α

Tdz

w.r.t. ϕ, where this time wθ,ϕ(z;x) = p(xi|z)Tp0(z)/qϕ(z).

However, the stochastic gradient descent scheme originating from having selected the VR bound as the

objective function suffers from one important shortcoming: it relies on biased gradient estimators for all

α /∈ {0,1}

, meaning that there exists no convergence guarantees for the whole scheme. Indeed, Li and Turner

[2016] show that the gradient of the VR bound w.r.t. ϕsatisﬁes

∂

∂ϕL(α)(θ, ϕ;x) = Rq(ε)wθ,ϕ(z;x)1−α∂

∂ϕ log wθ,ϕ(f(ε, ϕ;x); x) dε

Rq(ε)wθ,ϕ(z;x)1−αdε,

with z=f(ε, ϕ;x)∼qϕ(·|x)where ε∼q. The gradient above being intractable, they approximate it using

j=1

wθ,ϕ(zj;x)1−α

k=1 wθ,ϕ(zk;x)1−α

∂

∂ϕ log wθ,ϕ(f(εj, ϕ;x); x),(7)

where

ε1, . . . , εN

are i.i.d. samples generated from

and

zj=f(εj, ϕ;x)

for all

j= 1 . . . N

. The cases

α= 0

and

α= 1

recover the stochastic reparameterized gradients of the IWAE bound

(4)

and of the ELBO

(consider

(4)

with

N= 1

). As a result, we can trace them back to unbiased stochastic gradient descent schemes

for IWAE bound and ELBO optimisation respectively. Yet, this is no longer the case when

α /∈ {0,1}

, hence

impeding the theoretical guarantees of the scheme.

In addition, due to the log function, the VR bound itself can only be approximated using biased Monte Carlo

estimators, with [Li and Turner,2016, Section 4.1] using

1−αlog 



j=1

wθ,ϕ(Zj;x)1−α

(8)

where

Z1, . . . , ZN

are i.i.d. samples generated from

qϕ

. Furthermore, while the VR bound and the IWAE

bound approaches are linked via the gradient estimator

(7)

, the VR bound does not recover the IWAE bound

when α= 0.

The next section aims at overcoming the theoretical difﬁculties regarding the VR bound mentioned above.

3 The VR-IWAE bound

For all α∈R\ {1}, let us introduce the quantity

ℓ(α)

N(θ, ϕ;x) := 1

1−αZ Z N

i=1

qϕ(zi|x) log 



j=1

wθ,ϕ(zj;x)1−α

dz1:N,(9)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Alpha-divergenceVariationalInferenceMeetsImportanceWeightedAuto-Encoders:MethodologyandAsymptoticsKaméliaDaudelJoeBenton*YuyangShi*ArnaudDoucetDepartmentofStatistics,UniversityofOxford,UnitedKingdomABSTRACTSeveralalgorithmsinvolvingtheVariationalRényi(VR)boundhavebeenproposedtominimizeanalpha-diverg...

展开>> 收起<<

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders Methodology and Asymptotics Kamélia Daudel Joe Benton Yuyang Shi Arnaud Doucet.pdf

共69页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders Methodology and Asymptotics Kamélia Daudel Joe Benton Yuyang Shi Arnaud Doucet

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: