Optimizing Hierarchical Image VAEs for Sample Quality Eric Luhman ericluhman2gmail.comTroy Luhman

2025-04-29 0 0 3.82MB 21 页 10玖币

侵权投诉

Optimizing Hierarchical Image VAEs for Sample Quality

Eric Luhman∗

ericluhman2@gmail.com

Troy Luhman∗

troyluhman@gmail.com

Abstract

While hierarchical variational autoencoders (VAEs) have achieved great density

estimation on image modeling tasks, samples from their prior tend to look less con-

vincing than models with similar log-likelihood. We attribute this to learned repre-

sentations that over-emphasize compressing imperceptible details of the image. To

address this, we introduce a KL-reweighting strategy to control the amount of infor-

mation in each latent group, and employ a Gaussian output layer to reduce sharp-

ness in the learning objective. To trade oﬀ image diversity for ﬁdelity, we addition-

ally introduce a classiﬁer-free guidance strategy for hierarchical VAEs. We demon-

strate the eﬀectiveness of these techniques in our experiments. Code is available at

https://github.com/tcl9876/visual-vae.

1 Introduction

Deep likelihood based models have achieved impressive capabilities on unsupervised im-

age tasks. Models such as autoregressive models, diﬀusion models (Ho et al., 2020), and

variational autoencoders (Kingma and Welling, 2013; Sønderby et al., 2016) all perform

excellently on the log-likelihood metric (Child et al., 2019; Kingma et al., 2021; Vahdat

and Kautz, 2020), indicating the promise of each approach. Autoregressive and diﬀusion

models have additionally proved capable of generating high-ﬁdelity images (Razavi et al.,

2019; Dhariwal and Nichol, 2021), reaching unprecedented levels of performance in complex

text-to-image tasks (Ramesh et al., 2021; Rombach et al., 2022; Ramesh et al., 2022; Yu

et al., 2022).

While images sampled from autoregressive or diﬀusion priors usually exhibit a high degree

of realism, the same often cannot be said of VAEs despite their good likelihood. This result

is especially surprising considering the close similarities between VAEs and diﬀusion models.

Both models optimize a variational inference objective, employ a stack of diagonal Gaussian

latent distributions, and exhibit coarse-to-ﬁne generation behavior (Ho et al., 2020; Child,

2020). The primary diﬀerence between them is the use of learned posterior distributions in

VAEs, compared to the manually speciﬁed posteriors in diﬀusion models. Given the gap in

performance, one might wonder if ﬁxed posteriors are indeed better suited for high-ﬁdelity

image synthesis.

We begin our paper by oﬀering insights on why existing VAEs are unable to produce high

quality samples despite their good likelihood. Speciﬁcally, the structure of an image needs

only a few bits of information to encode, with the majority being occupied by minuscule

details. We argue that hierarchical VAEs are naturally inclined to model these ﬁne details,

and can largely ignore global structure since it contributes relatively little to the likelihood.

In this work, we are motivated by the single goal of improving the perceptual quality of

samples from the prior of hierarchical VAEs. To this end, we propose two techniques to

∗Equal Contribution

arXiv:2210.10205v1 [cs.LG] 18 Oct 2022

emphasize modeling global structure. The ﬁrst technique oﬀers control over the amount

of information in each latent group by reweighting terms of the ELBO, which can be used

to allocate more latent groups to the ﬁrst few bits of information. We also replace the

discretized mixture of logistics output layer with a Gaussian distribution trained with a

continuous KL objective, which greatly reduces the KL used while maintaining near perfect

reconstructions. Orthogonal to these techniques, we also introduce a classiﬁer-free guidance

strategy for VAEs that trades image diversity for ﬁdelity at sampling time.

We test our method on the CIFAR-10 dataset, showing that our techniques improve the

visual quality of generated samples, reducing FID by up to 2×over a controlled baseline.

We additionally demonstrate its superior compression capabilities at low rate as further

justiﬁcation of our method despite worse likelihoods. Finally, we verify the eﬀectiveness of

classiﬁer-free guidance on class-conditional ImageNet 642.

2 Background

2.1 Hierarchical Variational Autoencoders

We provide a brief review of hierarchical VAEs in this section; a more thorough introduction

can be found in Appendix A. A hierarchical variational autoencoder is a generative model

that uses a sequence of latent variables z:={z1,...,zN}to estimate a joint distribution

p(x,z) = p(x|z)p(z), with p(z):=QN

i=1 p(zi|z<i). In general, the true posterior p(z|x) is

intractable, so an approximate posterior q(z|x):=QN

i=1 q(zi|z<i,x) is used instead. The

generative model is trained to minimize the following variational inference objective:

L:=Eq(z|x)[−log p(x|z)] + KL(q(z1|x)kp(z1))

| {z }

i=2

Eq(z<i|x)[KL(q(zi|z<i,x)kp(zi|z<i))]

| {z }

(1)

Generally, the posterior q(zi|z<i,x) is typically a trainable diagonal Gaussian that is learned

via the reparamerization trick (Kingma and Welling, 2013; Rezende et al., 2014). This diﬀers

from autoregressive and diﬀusion models, which optimize a similar variational bound, only

their posterior is untrainable and has ﬁxed dimensionality.

The objective in Equation 1 can be interpreted as a lossless codelength of the data x, where

the −log p(x|z) term corresponds to the distortion measured in nats, while the KL terms

Li=Eq(z<i|x)[KL(q(zi|z<i,x)kp(zi|z<i))] make up the rate (Chen et al., 2016). Intuitively,

the distortion term incentivizes the model to accurately reconstruct the data, while the KL

terms encourage it to do so with as little information as possible.

2.2 Considerations When Directly Optimizing the ELBO

When implementing hierarchical image VAEs with neural networks, a leading strategy is to

have latent variables zistart at low resolution and increase in resolution (Child, 2020; Vahdat

and Kautz, 2020). This choice reﬂects the inductive bias that image generation should be

done in a coarse to ﬁne manner, starting from low-level structure and progressively adding

ﬁner details.

Additionally, since low-level features require much less information to encode than high-

level features, we might expect the amount of KL in later latent groups to be signiﬁcantly

higher than in earlier groups. A case otherwise could indicate poor compression of low-

level features. This is particularly important when generating visually appealing samples,

because small diﬀerences between the aggregated posterior and the prior can cause many

prior samples to fall outside of the posteriors encountered in training. For example, Sinha

et al. (2021) showed how even a single bit of KL between q(zi|z<i) and p(zi|z<i) can create

as much as a 50% prior hole. For stochastic layers that encode image structure, this prior

hole would lead to many structurally incoherent images being sampled.

Despite these low-quality samples, the log-likelihood would be virtually unchanged, consid-

ering most natural images take several thousands bits or more to encode. We hypothesize

that during optimization, VAEs are naturally inclined to focus on modeling high-level fea-

tures that constitute the vast majority of the model’s code lengths, weakening their ability

to model low-level features. While allocating an equal number of stochastic layers to low,

intermediate, and high-level features might be best for sample quality, a model optimized for

likelihood would direct most its layers towards encoding high-level features. Such behavior

has empirically been observed in diﬀusion models, where Kingma et al. (2021) found that

assigning more stochastic layers towards modeling imperceptible perturbations improved

likelihood at the expense of sample quality.

3 Techniques for Improving Sample Quality

3.1 Controlling the Amount of Information in Each Layer

As discussed in Section 2.2, it might be beneﬁcial if VAEs allocated more stochastic layers

for the ﬁrst few bits of information, which encode important aspects of image structure.

However, optimizing Equation 1 with gradient descent will generally lead to most stochastic

layers being allocated to high-level features. While many prior works have introduced tech-

niques to prevent layers from encoding zero information, i.e. posterior collapse, (Sønderby

et al., 2016; Chen et al., 2016; Vahdat et al., 2018), none oﬀer ﬁne-grained control over the

amount of information in each.

We are interested in learning VAE posteriors that follow a certain “information schedule”

which speciﬁes the desired amount of information in each stochastic layer relative to the

total amount. To facilitate a desirable information schedule, we propose to reweight the

KL terms of the ELBO based on their value relative to the target KL determined by the

information schedule. The weighted objective is of the form

L:=Eq(z|x)[−log p(x|z)] +

i=1

λ(Li, a, b)Li(2)

where we choose a=2

3ltargetiPiLi,b=4

3ltargetiPiLi, and ltarget1, ltarget2,...ltargetNis

a pre-speciﬁed sequence of positive constants such that Piltargeti= 1. Intuitively, ltargeti

represents how much KL the i-th latent group should contain relative to the total KL; we

choose it to be relative because some images inherently require more KL than others, and

set a range of [a, b] to give the posterior ﬂexibility. The weighting function is deﬁned as:

λ(Li, a, b) = 









max(Li/a, 0.1) Li< a

1a≤Li≤b

1 + min((Li−b)/a),1) Li> b

(3)

If the current KL is within the target range, we use the normal weighting λ= 1. As the

KL decreases below the lower target, we downweight it to encourage the model to use more

information in this latent group; this downweighting factor becomes stronger the farther

the KL is from the target. Similarly, if the KL is above the maximum target, we upweight

it to discourage use of this group. When implementing this in practice, we apply a stop

gradient to the weighting function to make it non-diﬀerentiable with respect to the model

parameters, and constrain λto be between 0.1 and 2. As for the choice of ltargeti, we choose

it to be an increasing geometric sequence plus a small constant, where ltargetNis about 100

times larger than ltarget1. More speciﬁc details can be found in Appendix B.2.

Figure 1: Information schedules for VAEs

with and without KL reweighting, compared

to the diﬀusion model from Ho et al. (2020)

on the CIFAR-10 dataset.

Figure 2: Ratio of the observed KL in

each stochastic layer compared to the target

range determined by the information sched-

ule. Most fall within the range or slightly

above.

Figure 1 shows the cumulative percentage of information used at each stochastic layer for

VAEs with and without our KL-reweighting strategy, and a diﬀusion model for comparison.

Without reweighting the KL terms, most information is added in the middle layers. With the

information schedule, it follows a much steeper schedule that adds most the information in

the last few stochastic layers. This makes the model allocate less capacity towards modeling

high-level features, and more towards global structure. This manner of adding information

bears closer similarity to a diﬀusion model, and we hypothesize such behavior is beneﬁcial

to the success of both.

3.2 Improving Learning Signal with Gaussian Decoders

Previous work in hierarchical VAEs have achieved very good image reconstructions, but rel-

atively poor samples from the prior. For instance, NVAE reconstructions on the CIFAR-10

train set achieve a FID (Heusel et al., 2017) of 2.67, but unconditional samples achieve a FID

of 51.71, indicating a large prior hole. We hypothesize that the gap between reconstruction

and samples stems from the discrete log-likelihood parameterization of the reconstruction

loss. Speciﬁcally, the 8-bit log-likelihood term requires performing almost perfect recon-

structions to achieve low distortion; a model attaining a reconstruction loss of 1.8 bpd1

must assign a geometric average of 29% probability to the exact pixel out of 255 possible

values.

We are interested in whether a squared-error reconstruction loss would lead to better learning

signal. There might be several reasons for this. Firstly, de-emphasizing the reconstruction

loss would in turn cause a decrease in KL and a smaller prior hole. Additionally, a squared-

error loss acts in continuous space, which might be more natural for image color values than

a discrete log-likelihood loss.

Optimizing a squared error loss of the form γ||x−ˆ

xθ(z)|| is equivalent to minimizing

1Vahdat and Kautz (2020) report a reconstruction loss of 1.8 bits per dim on the CIFAR-10 training set.

the KL divergence between the distributions q(˜

x|x):=N(˜

x;x, σ2

outputI) and p(˜

x|z):=

N(˜

x;ˆ

xθ(z), σ2

outputI) where σoutput =1

√2γ. This form of the reconstruction loss more closely

resembles the other KL terms in the loss objective. In our experiments, we opt to let the

prior learn the variance of p(˜

x|z) with a diagonal Gaussian distribution Σθ(z) = diag (σθ(z)).

Our new optimization objective becomes

L:=Eq(z|x)[KL(q(˜

x|x)kp(˜

x|z))] +

i=1

λ(Li, a, b)Li(4)

In our experiments, we set σoutput = 0.025 on data that has been scaled to [-1, 1], or about

3.2 pixels for a [0, 255] scale. We choose this value to encourage sharp reconstructions that

appear perceptually the same, while still allowing for signiﬁcantly more room for error when

it comes to predicting the exact pixel. Nevertheless, this choice of σoutput upweights the

reconstruction loss by nearly three orders of magnitude compared to the L2objective with

γ= 1, leading to reconstructions and samples that are much less blurry.

While hierarchical VAEs commonly parameterize the p(x|z) term with a discretized mixture

of logistics (DMoL) layer (Salimans et al., 2017; Kingma et al., 2016), our neural network

outputs means and variances of a continuous Gaussian distribution. As such, we parame-

terize the p(x|z) term using a Gaussian CDF function that corresponds to the probability

of a sample from p(˜

x|z) landing in the correct bin, as done in Ho et al. (2020):

p(x|z) =

i=1 Zδ+(xi)

δ−(xi)N(x;ˆ

xθ(z)i,σθ(z)i)dx

δ−(x) = (−∞ x=−1

x−1

255 x > −1δ+(x) = (∞x= 1

x+1

255 x < 1

(5)

where Dis the data dimensionality and subscript idenotes the i-th dimension. To generate

samples, one could randomly sample from p(˜

x|z) and display these. However, sampling from

this distribution essentially adds random noise to the predicted image, which would hurt

visual quality. As such, we output only the predicted mean when displaying samples.

3.3 Classiﬁer-free guidance in Conditional VAEs

Because of their inclusive KL divergence objective, imperfect likelihood-based models will

assign high probability to low-density regions of the data distribution; samples from these

regions result in low quality images. As such, we are interested in a way to improve ﬁdelity

at the expense of diversity. One technique that has recently achieved great success in

diﬀusion models is classiﬁer-free guidance (Ho and Salimans, 2022; Nichol et al., 2021). For

a conditional model p(x|c), this sampling technique draws samples from 1

Zp(x|c)(p(x|c)

p(x))w∝

p(x|c)p(c|x)w; which reweights the data distribution according to how likely a sample can be

classiﬁed as the correct class. This classiﬁcation uses the model itself to estimate conditional

and unconditional probabilities, avoiding the need for an external classiﬁcation network.

To facilitate guided sampling in VAEs, we ﬁrst drop the class label in the prior with 10%

probability during training to learn unconditional prior transitions p(zi|z<i). During sam-

pling, we keep two separate running hidden states for the conditional and unconditional

generation paths, which output latent distributions p(zi|z<i,c) = Nzi;µc,diag σc2and

p(zi|z<i) = Nzi;µu,diag σu2respectively. The unconditional generation path uses a

dummy label as its “conditioning”. From the computed conditional and unconditional distri-

bution parameters, we deﬁne a guided probability distribution pguided(zi|z<i,c), and draw a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimizingHierarchicalImageVAEsforSampleQualityEricLuhman*ericluhman2@gmail.comTroyLuhman*troyluhman@gmail.comAbstractWhilehierarchicalvariationalautoencoders(VAEs)haveachievedgreatdensityestimationonimagemodelingtasks,samplesfromtheirpriortendtolooklesscon-vincingthanmodelswithsimilarlog-likelihood...

展开>> 收起<<

Optimizing Hierarchical Image VAEs for Sample Quality Eric Luhman ericluhman2gmail.comTroy Luhman.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimizing Hierarchical Image VAEs for Sample Quality Eric Luhman ericluhman2gmail.comTroy Luhman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: