Benign Autoencoders Semyon Malamud Teng Andrea Xu and Antoine Didisheim Ecole Polytechnique F ed erale de Lausanne EPFL

2025-04-27 0 0 3.51MB 81 页 10玖币

侵权投诉

Benign Autoencoders

Semyon Malamud∗†, Teng Andrea Xu†, and Antoine Didisheim‡

†´

Ecole Polytechnique F´ed´erale de Lausanne (EPFL)

‡University of Geneva

August 29, 2023

Abstract

Recent progress in Generative Artiﬁcial Intelligence (AI) relies on eﬃcient data rep-

resentations, often featuring encoder-decoder architectures. We formalize the mathe-

matical problem of ﬁnding the optimal encoder-decoder pair and characterize its solu-

tion, which we name the “benign autoencoder” (BAE). We prove that BAE projects

data onto a manifold whose dimension is the optimal compressibility dimension of the

generative problem. We highlight surprising connections between BAE and several

recent developments in AI, such as conditional GANs, context encoders, stable diﬀu-

sion, stacked autoencoders, and the learning capabilities of generative models. As an

illustration, we show how BAE can ﬁnd optimal, low-dimensional latent representa-

tions that improve the performance of a discriminator under a distribution shift. By

compressing “malignant” data dimensions, BAE leads to smoother and more stable

gradients.

∗Semyon Malamud is at the Swiss Finance Institute, EPFL, and CEPR. Teng Andrea Xu is at EPFL.

Antoine Didisheim is at the University of Geneva. Email: semyon.malamud@epﬂ.ch. We thank Emanuel

Abbe and Philipp Schneider for their helpful comments and suggestions. We also acknowledge the ﬁnancial

support of the Swiss National Science Foundation, Grant 100018 192692. and the Swiss Finance Institute.

All errors are our own. This work was supported by a grant from the Swiss National Supercomputing Centre

(CSCS) under project ID sm81.

arXiv:2210.00637v4 [cs.LG] 28 Aug 2023

1 Introduction

The success of modern generative models relies on neural network architectures for build-

ing powerful representations of the data, typically featuring an encoder (responsible for fea-

ture learning) and a decoder (responsible for data generation).1Most such encoder-decoder

architectures feature a bottleneck, with the latent dimension of the encoder often being

much smaller than the dimension of the original data. Extensive experimental evidence

suggests that a lower-dimensional latent space improves the quality of generative models

by allowing them to generate data based on several key features of the latent represen-

tation. For example, this is the case for the variational autoencoder (VAE; Kingma and

Welling (2013), Makhzani, Shlens, Jaitly, Goodfellow, and Frey (2015)), generative adver-

sarial networks (GANs; Radford, Metz, and Chintala (2015), Che, Li, Jacob, Bengio, and

Li (2016), Peng, Kanazawa, Toyer, Abbeel, and Levine (2018), Goodfellow, Pouget-Abadie,

Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio (2020), Donahue, Kr¨ahenb¨uhl, and

Darrell (2016), Dumoulin, Belghazi, Poole, Mastropietro, Lamb, Arjovsky, and Courville

(2016), Pathak, Krahenbuhl, Donahue, Darrell, and Efros (2016)), and stable diﬀusion (Sohl-

Dickstein, Weiss, Maheswaranathan, and Ganguli (2015); Ho, Jain, and Abbeel (2020)). The

same idea of encoding data into a low-dimensional manifold and then decoding it for discrim-

inative purposes underlies recent successful attempts to build powerful, general perception

models, such as those of Jaegle, Gimeno, Brock, Vinyals, Zisserman, and Carreira (2021)

and Girdhar, El-Nouby, Liu, Singh, Alwala, Joulin, and Misra (2023).2

The impressive empirical achievements of the models cited above have further widened

the gap between their performance and our theoretical understanding thereof. In particular,

little is known about the role of bottlenecks and the geometry of the respective latent spaces.3

In this paper, we try to bridge this gap. To this end, we formally deﬁne the generative prob-

lem of ﬁnding the best encoder-decoder architecture. Using novel mathematical techniques

combining ideas from optimal transport theory Villani (2009) and metric geometry Burago,

Burago, and Ivanov (2022), we characterize the solution to the optimal encoder-decoder

problem, that we name the benign autoencoder (BAE). We show that BAE optimally reg-

1While the original text generation and translation models used encoder-decoder architectures, the recent

progress in large language models (LLMs) relies on decoder-only architectures. Understanding the role of

encoders for LLMs is an important direction for future research.

2The Perceiver of Jaegle et al. (2021) is designed to handle arbitrary conﬁgurations of diﬀerent modalities

(images, audio, and video data) using a single Transformer-based architecture. It introduces a small set

of latent units that forms a bottleneck eliminating the quadratic scaling problem of classical Transformers

and decoupling the network depth from the input’s size. The authors use a bottleneck of dimension 512

for the image encoding, which is a huge dimensionality reduction, compared to the input dimension of

224 ×224 = 50176 pixels.

3For some recent progress in the theoretical understanding of GANs, see, Arjovsky and Bottou (2017).

ularizes the generative problem by compressing the “malignant” dimensions of the data,

thus convexifying the problem through dimensionality reduction.4We also characterize the

latent dimension of the optimal BAE that we refer to as the compressibility dimension of

the learning problem.

In addition to providing a theoretical foundation for optimal latent representations in

several important generative problems (see, e.g., Che et al. (2016), Peng et al. (2018), Good-

fellow et al. (2020), Pathak et al. (2016)), we test our theory on the distance-regularized GAN

and context-encoder settings with the CelebA-HQ dataset Karras, Aila, Laine, and Lehtinen

(2017). In the Appendix, we also show how to use our results to study optimal, super-

vised, denoising autoencoders with the MNIST LeCun, Boser, Denker, Henderson, Howard,

Hubbard, and Jackel (1989) and FMNIST Xiao, Rasul, and Vollgraf (2017) datasets. In all

experiments, we ﬁnd evidence of the existence of an optimal latent dimension (much lower

than the dimension of the data). In particular, we show that using an encoder with a latent

dimension larger than the compressibility dimension either deteriorates generative models’

performance or is meaningless. This is due to wasted computational resources, and it does

not lead to any performance increase.

In an eﬀort to understand the beneﬁts of encoder-decoder architectures, previous pa-

pers used heuristics and intuition to suggest that penalization of the reconstruction error

in generative models leads to smoother and more stable gradients (see, e.g., Che et al.

(2016)). This paper vindicates and provides a theoretical formalization for this intuition.

Our main theorem implies that BAE convexiﬁes the objective function’s dependence on the

data. Namely, the objective becomes convex when restricted to the optimal feature manifold

(the low-dimensional manifold on which the auto-encoded data lives). The gradient of a

convex function is always regular because it is a monotone map; this monotonicity removes

“spikes” and makes the gradient more stable.

2 Background

Since the onset of GANs, lower-dimensional representations have played a key role in

generative AI. For example, in image generation, modern GAN architectures Karras et al.

4It is known that convex problems are well-behaved because they have unique global minima and gradient

descent algorithms are guaranteed to converge to these minima. However, BAE exploits a diﬀerent form of

convexity: It makes the average model accuracy depend on the input (training) data in a convex fashion.

The dependence on training data is an important ingredient of the theory of adversarial attacks. See, e.g.,

Goodfellow, Shlens, and Szegedy (2014a) and Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry (2019).

BAE regularizes the dependence on training data by removing “spikes in the gradient” and making the

gradient map monotone.

(2017); Karras, Laine, and Aila (2019); Karras, Laine, Aittala, Hellsten, Lehtinen, and

Aila (2020); Karras, Aittala, Laine, H¨ark¨onen, Hellsten, Lehtinen, and Aila (2021) use a

lower-dimensional latent space of 4 ×4×512 to generate high-resolution images and videos

(1024 ×1024 ×3). Similar behavior is observed in the latest Diﬀusion Probabilistic Models

Sohl-Dickstein et al. (2015); Ho et al. (2020); Li, Prabhudesai, Duggal, Brown, and Pathak

(2023).

Although GANS achieve state-of-the-art results on various tasks, they are often highly

unstable. As Che et al. (2016) show, this behavior is driven by a special form of a curse of

dimensionality that can be solved by training an autoencoder with a small latent dimension.

In a similar vein, Peng et al. (2018) show that introducing an auto-encoder trained with a

VAE-type reconstruction loss and a low-dimensional bottleneck signiﬁcantly improves the

performance of GANs, as well as models of imitation learning and inverse reinforcement

learning.

Our paper also relates to the tight connection between generative and discriminative

problems, which has been discussed in many papers, starting with the inﬂuential work of

Hinton (2007): “To Recognize Shapes, First Learn to Generate Images.” See also Ng and

Jordan (2001). Recent evidence suggests that conditional generative models are also good

classiﬁers. See, e.g., Li et al. (2023); Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal,

Neelakantan, Shyam, Sastry, Askell, et al. (2020). Our results provide additional intuition

for this phenomenon and its link to eﬃcient latent representations.

For discriminative (classiﬁcation and regression) problems, Tishby and Zaslavsky (2015)

argue that the success of deep neural networks might be related to their ability to extract

eﬃcient representations of the relevant features of the input layer for predicting the output

label. Tishby and Zaslavsky (2015) refer to this phenomenon as the optimal information

bottleneck.5Several subsequent papers have introduced methodologies targeted at creating

optimal bottlenecks with a minimal loss of mutual information. See, e.g., Alemi, Fischer, Dil-

lon, and Murphy (2016)Oord, Li, and Vinyals (2018), Hjelm, Fedorov, Lavoie-Marchildon,

Grewal, Bachman, Trischler, and Bengio (2018), Achille and Soatto (2018a), Alemi (2020).

In particular, Alemi et al. (2016) provide evidence that eﬃciently trained bottlenecks im-

prove classiﬁcation accuracy and adversarial robustness; Achille and Soatto (2018a) link

information bottlenecks to invariance to nuisances, irrelevant features that provide no useful

information. The mechanism behind the BAE algorithm proposed in this paper is diﬀerent.

The bottleneck created by BAE does not remove noise or useless features; instead, we prove

that some dimensions of data are useful (contain important information) but are malignant

5Recent research shows that the ability of NNs to learn eﬃcient low-dimensional representations is key

to their performance. See Ghorbani, Mei, Misiakiewicz, and Montanari (2020).

for the speciﬁc learning algorithm. BAE identiﬁes those dimensions and erases them.

3 Preliminaries on Autoencoders

We start our analysis by introducing a mathematical formalism behind encoder-decoder

architectures.

Let ˜

Xbe a set of messages, and Zthe space of encoded messages (henceforth, code

space). Data pre-processing is a map F:˜

X → X, where X ⊂ RLis the space of pre-

processed messages. E.g., Fcould be a form of data normalization, image resizing, data

whitening, or masking (for context encoders). An encoder is a map E:X → Z, and

a decoder is a map D:Z → ˆ

X.An autoencoder (AE) is the composition of the two:

A:X → ˆ

X,A(x) = D(E(x)).Given a parametric family {Eϕ}ϕ∈Φof encoders and a

parametric family {Dθ}θ∈Θof decoders, the classic optimal encoding problem is to solve

minθ,ϕ E[ℓ(Dθ(Eϕ(F(˜x))), g(˜x))] for some loss function ℓ, where g:˜

X → ˆ

Xis a target

data transformation. For example: (i) For a standard autoencoder,6˜

X=X=ˆ

X,and

both Fand gare identity maps so that the objective becomes to reconstruct the original

data x= ˜xbased on its latent representation minθ,ϕ E[ℓ(Dθ(Eϕ(x)), x)].(ii) In the context

encoding problem of images (Pathak et al. (2016)), a part of the data is masked using a

mask indicator ˆ

M, so that F(˜x) = (1 −ˆ

M)⊙˜xis the partially masked image. At the

same time, the optimal encoding-decoding problem is to reconstruct the masked part of the

image, g(˜x) = M ⊙ ˜x, based entirely on the partially masked image: The goal is to solve

minθ,ϕ E[ℓ(Dθ(Eϕ((1 −ˆ

M)⊙˜x)),ˆ

M⊙˜x].(iii) For image-to-image translation (Isola, Zhu,

Zhou, and Efros (2017)),7˜x= (y, x) is a pair, and the objective is to morph xinto y, so that

F(˜x) = xand g(˜x) = y, and we minimize minθ,ϕ E[ℓ(Dθ(Eϕ(x)), y)].

Given a prior probability distribution p(dx) of x=F(˜x) on X,a probabilistic encoder is

a joint probability distribution p(dx, dz) on X × Z satisfying

px(dx) = ZZ

p(dx, dz) = p(dx).(1)

6One of the most popular algorithms for unsupervised data representation is based on training an au-

toencoder (Rumelhart and McClelland,1986): An artiﬁcial neural network that learns how to eﬃciently

encode data in a lower-dimensional space with a minimal reconstruction loss. These models play a key role

in unsupervised data representation and feature engineering as powerful non-linear dimensionality reduction

techniques, see Hinton, Osindero, and Teh (2006), Hinton and Salakhutdinov (2006), Bengio, LeCun, et al.

(2007), Erhan, Courville, Bengio, and Vincent (2010), Baldi (2012), Zemel, Wu, Swersky, Pitassi, and Dwork

(2013), Makhzani and Frey (2013), Makhzani and Frey (2015), Achille and Soatto (2018b), Makhzani (2018),

Kenfack, Khan, Hussain, and Kazmi (2021), and Gu, Kelly, and Xiu (2021).

7See also Choi, Uh, Yoo, and Ha (2020) for a related problem of image synthesis.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BenignAutoencodersSemyonMalamud∗†,TengAndreaXu†,andAntoineDidisheim‡†´EcolePolytechniqueF´ed´eraledeLausanne(EPFL)‡UniversityofGenevaAugust29,2023AbstractRecentprogressinGenerativeArtificialIntelligence(AI)reliesonefficientdatarep-resentations,oftenfeaturingencoder-decoderarchitectures.Weformalizeth...

展开>> 收起<<

Benign Autoencoders Semyon Malamud Teng Andrea Xu and Antoine Didisheim Ecole Polytechnique F ed erale de Lausanne EPFL.pdf

共81页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Benign Autoencoders Semyon Malamud Teng Andrea Xu and Antoine Didisheim Ecole Polytechnique F ed erale de Lausanne EPFL

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: