et al.,2016,Li and Gal,2017,Geffner and Domke,2020,2021,Zhang et al.,2021,Rodríguez-Santana and
Hernández-Lobato,2022].
Geffner and Domke [2020] have recently provided insights from an empirical perspective regarding the
magnitude of the bias and its impact on the outcome of the optimization procedure when the (biased)
reparameterized gradient estimator of the VR bound is used. They observe that the resulting algorithm appears
to require an impractically large amount of computations to actually optimise the VR bound as the dimension
increases (and otherwise seems to simply return minimizers of the exclusive Kullback–Leibler divergence).
They postulate that this effect might be due to a weight degeneracy behavior [Bengtsson et al.,2008], but this
behavior is not quantified precisely from a theoretical point of view.
In this paper, our goal is to (i) develop theoretical guarantees for VR-based variational inference methods and
(ii) construct a theoretical framework elucidating the weight degeneracy behavior that has been empirically
observed for those techniques. The rest of this paper is organized as follows:
•
In Section 2, we provide some background notation and we review the main concepts behind the VR
bound.
•
In Section 3, we introduce the VR-IWAE bound. We show in Proposition 1that this bound, previously
defined by Li and Turner [2016] as the expectation of the biased Monte Carlo approximation of the
VR bound, can be actually interpreted as a variational bound which depends on an hyperparameter
α
with
α∈[0,1)
. In addition, we obtain that the VR-IWAE bound leads to the same stochastic gradient
descent procedure as the VR bound in the reparameterized case. Unlike the VR bound, the VR-IWAE
bound relies on unbiased gradient estimators and coincides with the IWAE bound for
α= 0
, fully
bridging the gap between both methodologies.
We then generalize the approach of Rainforth et al. [2018] – which characterizes the Signal-to-Noise
Ratio (SNR) of the reparameterized gradient estimators of the IWAE – to the VR-IWAE bound
and establish that the VR-IWAE bound with
α∈(0,1)
enjoys better theoretical properties than
the IWAE bound (Theorem 1). To further tackle potential SNR difficulties, we also extend the
doubly-reparameterized gradient estimator of the IWAE [Tucker et al.,2019] to the VR-IWAE bound
(Theorem 2).
•
In Section 4, we provide a thorough theoretical study of the VR-IWAE bound. Following Domke and
Sheldon [2018], we start by investigating the case where the dimension of the latent space
d
is fixed
and the number of Monte Carlo samples
N
in the VR-IWAE bound goes to infinity (Theorem 3).
Our analysis shows that the hyperparameter
α
allows us to balance between an error term depending
on both the encoder and the decoder parameters
(θ, ϕ)
and a term going to zero at a
1/N
rate. This
suggests that tuning αcan be beneficial to obtain the best empirical performances.
However, the relevance of such analysis can be limited for a high-dimensional latent space
d
(Exam-
ples 1and 2). We then propose a novel analysis where
N
does not grow as fast as exponentially with
d
(Theorems 4and 5) or sub-exponentially with
d1/3
(Theorem 6), which we use to revisit Examples
1and 2in Examples 3and 4respectively. This analysis suggests that in these regimes the VR-IWAE
bound, and hence in particular the IWAE bound, are of limited interest.
• In Section 5, we detail how our work relates to the existing litterature.
•
Lastly, Section 6provides empirical evidence illustrating our theoretical claims for both toy and
real-data examples.
2 Background
Given a model with joint distribution
pθ(x, z)
parameterized by
θ
, where
x
denotes an observation and
z
is a
latent variable valued in
Rd
, one is interested in finding the parameter
θ
which best describes the observations
2