Optimizing Hierarchical Image VAEs for Sample Quality Eric Luhman ericluhman2gmail.comTroy Luhman

2025-04-29
0
0
3.82MB
21 页
10玖币
侵权投诉
Optimizing Hierarchical Image VAEs for Sample Quality
Eric Luhman∗
ericluhman2@gmail.com
Troy Luhman∗
troyluhman@gmail.com
Abstract
While hierarchical variational autoencoders (VAEs) have achieved great density
estimation on image modeling tasks, samples from their prior tend to look less con-
vincing than models with similar log-likelihood. We attribute this to learned repre-
sentations that over-emphasize compressing imperceptible details of the image. To
address this, we introduce a KL-reweighting strategy to control the amount of infor-
mation in each latent group, and employ a Gaussian output layer to reduce sharp-
ness in the learning objective. To trade off image diversity for fidelity, we addition-
ally introduce a classifier-free guidance strategy for hierarchical VAEs. We demon-
strate the effectiveness of these techniques in our experiments. Code is available at
https://github.com/tcl9876/visual-vae.
1 Introduction
Deep likelihood based models have achieved impressive capabilities on unsupervised im-
age tasks. Models such as autoregressive models, diffusion models (Ho et al., 2020), and
variational autoencoders (Kingma and Welling, 2013; Sønderby et al., 2016) all perform
excellently on the log-likelihood metric (Child et al., 2019; Kingma et al., 2021; Vahdat
and Kautz, 2020), indicating the promise of each approach. Autoregressive and diffusion
models have additionally proved capable of generating high-fidelity images (Razavi et al.,
2019; Dhariwal and Nichol, 2021), reaching unprecedented levels of performance in complex
text-to-image tasks (Ramesh et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Yu
et al., 2022).
While images sampled from autoregressive or diffusion priors usually exhibit a high degree
of realism, the same often cannot be said of VAEs despite their good likelihood. This result
is especially surprising considering the close similarities between VAEs and diffusion models.
Both models optimize a variational inference objective, employ a stack of diagonal Gaussian
latent distributions, and exhibit coarse-to-fine generation behavior (Ho et al., 2020; Child,
2020). The primary difference between them is the use of learned posterior distributions in
VAEs, compared to the manually specified posteriors in diffusion models. Given the gap in
performance, one might wonder if fixed posteriors are indeed better suited for high-fidelity
image synthesis.
We begin our paper by offering insights on why existing VAEs are unable to produce high
quality samples despite their good likelihood. Specifically, the structure of an image needs
only a few bits of information to encode, with the majority being occupied by minuscule
details. We argue that hierarchical VAEs are naturally inclined to model these fine details,
and can largely ignore global structure since it contributes relatively little to the likelihood.
In this work, we are motivated by the single goal of improving the perceptual quality of
samples from the prior of hierarchical VAEs. To this end, we propose two techniques to
∗Equal Contribution
1
arXiv:2210.10205v1 [cs.LG] 18 Oct 2022
emphasize modeling global structure. The first technique offers control over the amount
of information in each latent group by reweighting terms of the ELBO, which can be used
to allocate more latent groups to the first few bits of information. We also replace the
discretized mixture of logistics output layer with a Gaussian distribution trained with a
continuous KL objective, which greatly reduces the KL used while maintaining near perfect
reconstructions. Orthogonal to these techniques, we also introduce a classifier-free guidance
strategy for VAEs that trades image diversity for fidelity at sampling time.
We test our method on the CIFAR-10 dataset, showing that our techniques improve the
visual quality of generated samples, reducing FID by up to 2×over a controlled baseline.
We additionally demonstrate its superior compression capabilities at low rate as further
justification of our method despite worse likelihoods. Finally, we verify the effectiveness of
classifier-free guidance on class-conditional ImageNet 642.
2 Background
2.1 Hierarchical Variational Autoencoders
We provide a brief review of hierarchical VAEs in this section; a more thorough introduction
can be found in Appendix A. A hierarchical variational autoencoder is a generative model
that uses a sequence of latent variables z:={z1,...,zN}to estimate a joint distribution
p(x,z) = p(x|z)p(z), with p(z):=QN
i=1 p(zi|z<i). In general, the true posterior p(z|x) is
intractable, so an approximate posterior q(z|x):=QN
i=1 q(zi|z<i,x) is used instead. The
generative model is trained to minimize the following variational inference objective:
L:=Eq(z|x)[−log p(x|z)] + KL(q(z1|x)kp(z1))
| {z }
L1
+
N
X
i=2
Eq(z<i|x)[KL(q(zi|z<i,x)kp(zi|z<i))]
| {z }
Li
(1)
Generally, the posterior q(zi|z<i,x) is typically a trainable diagonal Gaussian that is learned
via the reparamerization trick (Kingma and Welling, 2013; Rezende et al., 2014). This differs
from autoregressive and diffusion models, which optimize a similar variational bound, only
their posterior is untrainable and has fixed dimensionality.
The objective in Equation 1 can be interpreted as a lossless codelength of the data x, where
the −log p(x|z) term corresponds to the distortion measured in nats, while the KL terms
Li=Eq(z<i|x)[KL(q(zi|z<i,x)kp(zi|z<i))] make up the rate (Chen et al., 2016). Intuitively,
the distortion term incentivizes the model to accurately reconstruct the data, while the KL
terms encourage it to do so with as little information as possible.
2.2 Considerations When Directly Optimizing the ELBO
When implementing hierarchical image VAEs with neural networks, a leading strategy is to
have latent variables zistart at low resolution and increase in resolution (Child, 2020; Vahdat
and Kautz, 2020). This choice reflects the inductive bias that image generation should be
done in a coarse to fine manner, starting from low-level structure and progressively adding
finer details.
Additionally, since low-level features require much less information to encode than high-
level features, we might expect the amount of KL in later latent groups to be significantly
higher than in earlier groups. A case otherwise could indicate poor compression of low-
2
level features. This is particularly important when generating visually appealing samples,
because small differences between the aggregated posterior and the prior can cause many
prior samples to fall outside of the posteriors encountered in training. For example, Sinha
et al. (2021) showed how even a single bit of KL between q(zi|z<i) and p(zi|z<i) can create
as much as a 50% prior hole. For stochastic layers that encode image structure, this prior
hole would lead to many structurally incoherent images being sampled.
Despite these low-quality samples, the log-likelihood would be virtually unchanged, consid-
ering most natural images take several thousands bits or more to encode. We hypothesize
that during optimization, VAEs are naturally inclined to focus on modeling high-level fea-
tures that constitute the vast majority of the model’s code lengths, weakening their ability
to model low-level features. While allocating an equal number of stochastic layers to low,
intermediate, and high-level features might be best for sample quality, a model optimized for
likelihood would direct most its layers towards encoding high-level features. Such behavior
has empirically been observed in diffusion models, where Kingma et al. (2021) found that
assigning more stochastic layers towards modeling imperceptible perturbations improved
likelihood at the expense of sample quality.
3 Techniques for Improving Sample Quality
3.1 Controlling the Amount of Information in Each Layer
As discussed in Section 2.2, it might be beneficial if VAEs allocated more stochastic layers
for the first few bits of information, which encode important aspects of image structure.
However, optimizing Equation 1 with gradient descent will generally lead to most stochastic
layers being allocated to high-level features. While many prior works have introduced tech-
niques to prevent layers from encoding zero information, i.e. posterior collapse, (Sønderby
et al., 2016; Chen et al., 2016; Vahdat et al., 2018), none offer fine-grained control over the
amount of information in each.
We are interested in learning VAE posteriors that follow a certain “information schedule”
which specifies the desired amount of information in each stochastic layer relative to the
total amount. To facilitate a desirable information schedule, we propose to reweight the
KL terms of the ELBO based on their value relative to the target KL determined by the
information schedule. The weighted objective is of the form
L:=Eq(z|x)[−log p(x|z)] +
N
X
i=1
λ(Li, a, b)Li(2)
where we choose a=2
3ltargetiPiLi,b=4
3ltargetiPiLi, and ltarget1, ltarget2,...ltargetNis
a pre-specified sequence of positive constants such that Piltargeti= 1. Intuitively, ltargeti
represents how much KL the i-th latent group should contain relative to the total KL; we
choose it to be relative because some images inherently require more KL than others, and
set a range of [a, b] to give the posterior flexibility. The weighting function is defined as:
λ(Li, a, b) =
max(Li/a, 0.1) Li< a
1a≤Li≤b
1 + min((Li−b)/a),1) Li> b
(3)
If the current KL is within the target range, we use the normal weighting λ= 1. As the
KL decreases below the lower target, we downweight it to encourage the model to use more
information in this latent group; this downweighting factor becomes stronger the farther
3
the KL is from the target. Similarly, if the KL is above the maximum target, we upweight
it to discourage use of this group. When implementing this in practice, we apply a stop
gradient to the weighting function to make it non-differentiable with respect to the model
parameters, and constrain λto be between 0.1 and 2. As for the choice of ltargeti, we choose
it to be an increasing geometric sequence plus a small constant, where ltargetNis about 100
times larger than ltarget1. More specific details can be found in Appendix B.2.
Figure 1: Information schedules for VAEs
with and without KL reweighting, compared
to the diffusion model from Ho et al. (2020)
on the CIFAR-10 dataset.
Figure 2: Ratio of the observed KL in
each stochastic layer compared to the target
range determined by the information sched-
ule. Most fall within the range or slightly
above.
Figure 1 shows the cumulative percentage of information used at each stochastic layer for
VAEs with and without our KL-reweighting strategy, and a diffusion model for comparison.
Without reweighting the KL terms, most information is added in the middle layers. With the
information schedule, it follows a much steeper schedule that adds most the information in
the last few stochastic layers. This makes the model allocate less capacity towards modeling
high-level features, and more towards global structure. This manner of adding information
bears closer similarity to a diffusion model, and we hypothesize such behavior is beneficial
to the success of both.
3.2 Improving Learning Signal with Gaussian Decoders
Previous work in hierarchical VAEs have achieved very good image reconstructions, but rel-
atively poor samples from the prior. For instance, NVAE reconstructions on the CIFAR-10
train set achieve a FID (Heusel et al., 2017) of 2.67, but unconditional samples achieve a FID
of 51.71, indicating a large prior hole. We hypothesize that the gap between reconstruction
and samples stems from the discrete log-likelihood parameterization of the reconstruction
loss. Specifically, the 8-bit log-likelihood term requires performing almost perfect recon-
structions to achieve low distortion; a model attaining a reconstruction loss of 1.8 bpd1
must assign a geometric average of 29% probability to the exact pixel out of 255 possible
values.
We are interested in whether a squared-error reconstruction loss would lead to better learning
signal. There might be several reasons for this. Firstly, de-emphasizing the reconstruction
loss would in turn cause a decrease in KL and a smaller prior hole. Additionally, a squared-
error loss acts in continuous space, which might be more natural for image color values than
a discrete log-likelihood loss.
Optimizing a squared error loss of the form γ||x−ˆ
xθ(z)|| is equivalent to minimizing
1Vahdat and Kautz (2020) report a reconstruction loss of 1.8 bits per dim on the CIFAR-10 training set.
4
the KL divergence between the distributions q(˜
x|x):=N(˜
x;x, σ2
outputI) and p(˜
x|z):=
N(˜
x;ˆ
xθ(z), σ2
outputI) where σoutput =1
√2γ. This form of the reconstruction loss more closely
resembles the other KL terms in the loss objective. In our experiments, we opt to let the
prior learn the variance of p(˜
x|z) with a diagonal Gaussian distribution Σθ(z) = diag (σθ(z)).
Our new optimization objective becomes
L:=Eq(z|x)[KL(q(˜
x|x)kp(˜
x|z))] +
N
X
i=1
λ(Li, a, b)Li(4)
In our experiments, we set σoutput = 0.025 on data that has been scaled to [-1, 1], or about
3.2 pixels for a [0, 255] scale. We choose this value to encourage sharp reconstructions that
appear perceptually the same, while still allowing for significantly more room for error when
it comes to predicting the exact pixel. Nevertheless, this choice of σoutput upweights the
reconstruction loss by nearly three orders of magnitude compared to the L2objective with
γ= 1, leading to reconstructions and samples that are much less blurry.
While hierarchical VAEs commonly parameterize the p(x|z) term with a discretized mixture
of logistics (DMoL) layer (Salimans et al., 2017; Kingma et al., 2016), our neural network
outputs means and variances of a continuous Gaussian distribution. As such, we parame-
terize the p(x|z) term using a Gaussian CDF function that corresponds to the probability
of a sample from p(˜
x|z) landing in the correct bin, as done in Ho et al. (2020):
p(x|z) =
D
Y
i=1 Zδ+(xi)
δ−(xi)N(x;ˆ
xθ(z)i,σθ(z)i)dx
δ−(x) = (−∞ x=−1
x−1
255 x > −1δ+(x) = (∞x= 1
x+1
255 x < 1
(5)
where Dis the data dimensionality and subscript idenotes the i-th dimension. To generate
samples, one could randomly sample from p(˜
x|z) and display these. However, sampling from
this distribution essentially adds random noise to the predicted image, which would hurt
visual quality. As such, we output only the predicted mean when displaying samples.
3.3 Classifier-free guidance in Conditional VAEs
Because of their inclusive KL divergence objective, imperfect likelihood-based models will
assign high probability to low-density regions of the data distribution; samples from these
regions result in low quality images. As such, we are interested in a way to improve fidelity
at the expense of diversity. One technique that has recently achieved great success in
diffusion models is classifier-free guidance (Ho and Salimans, 2022; Nichol et al., 2021). For
a conditional model p(x|c), this sampling technique draws samples from 1
Zp(x|c)(p(x|c)
p(x))w∝
p(x|c)p(c|x)w; which reweights the data distribution according to how likely a sample can be
classified as the correct class. This classification uses the model itself to estimate conditional
and unconditional probabilities, avoiding the need for an external classification network.
To facilitate guided sampling in VAEs, we first drop the class label in the prior with 10%
probability during training to learn unconditional prior transitions p(zi|z<i). During sam-
pling, we keep two separate running hidden states for the conditional and unconditional
generation paths, which output latent distributions p(zi|z<i,c) = Nzi;µc,diag σc2and
p(zi|z<i) = Nzi;µu,diag σu2respectively. The unconditional generation path uses a
dummy label as its “conditioning”. From the computed conditional and unconditional distri-
bution parameters, we define a guided probability distribution pguided(zi|z<i,c), and draw a
5
摘要:
展开>>
收起<<
OptimizingHierarchicalImageVAEsforSampleQualityEricLuhman*ericluhman2@gmail.comTroyLuhman*troyluhman@gmail.comAbstractWhilehierarchicalvariationalautoencoders(VAEs)haveachievedgreatdensityestimationonimagemodelingtasks,samplesfromtheirpriortendtolooklesscon-vincingthanmodelswithsimilarlog-likelihood...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源
价格:10玖币
属性:21 页
大小:3.82MB
格式:PDF
时间:2025-04-29
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载