Benign Autoencoders Semyon Malamud Teng Andrea Xu and Antoine Didisheim Ecole Polytechnique F ed erale de Lausanne EPFL

2025-04-27 0 0 3.51MB 81 页 10玖币
侵权投诉
Benign Autoencoders
Semyon Malamud, Teng Andrea Xu, and Antoine Didisheim
´
Ecole Polytechnique F´ed´erale de Lausanne (EPFL)
University of Geneva
August 29, 2023
Abstract
Recent progress in Generative Artificial Intelligence (AI) relies on efficient data rep-
resentations, often featuring encoder-decoder architectures. We formalize the mathe-
matical problem of finding the optimal encoder-decoder pair and characterize its solu-
tion, which we name the “benign autoencoder” (BAE). We prove that BAE projects
data onto a manifold whose dimension is the optimal compressibility dimension of the
generative problem. We highlight surprising connections between BAE and several
recent developments in AI, such as conditional GANs, context encoders, stable diffu-
sion, stacked autoencoders, and the learning capabilities of generative models. As an
illustration, we show how BAE can find optimal, low-dimensional latent representa-
tions that improve the performance of a discriminator under a distribution shift. By
compressing “malignant” data dimensions, BAE leads to smoother and more stable
gradients.
Semyon Malamud is at the Swiss Finance Institute, EPFL, and CEPR. Teng Andrea Xu is at EPFL.
Antoine Didisheim is at the University of Geneva. Email: semyon.malamud@epfl.ch. We thank Emanuel
Abbe and Philipp Schneider for their helpful comments and suggestions. We also acknowledge the financial
support of the Swiss National Science Foundation, Grant 100018 192692. and the Swiss Finance Institute.
All errors are our own. This work was supported by a grant from the Swiss National Supercomputing Centre
(CSCS) under project ID sm81.
arXiv:2210.00637v4 [cs.LG] 28 Aug 2023
1 Introduction
The success of modern generative models relies on neural network architectures for build-
ing powerful representations of the data, typically featuring an encoder (responsible for fea-
ture learning) and a decoder (responsible for data generation).1Most such encoder-decoder
architectures feature a bottleneck, with the latent dimension of the encoder often being
much smaller than the dimension of the original data. Extensive experimental evidence
suggests that a lower-dimensional latent space improves the quality of generative models
by allowing them to generate data based on several key features of the latent represen-
tation. For example, this is the case for the variational autoencoder (VAE; Kingma and
Welling (2013), Makhzani, Shlens, Jaitly, Goodfellow, and Frey (2015)), generative adver-
sarial networks (GANs; Radford, Metz, and Chintala (2015), Che, Li, Jacob, Bengio, and
Li (2016), Peng, Kanazawa, Toyer, Abbeel, and Levine (2018), Goodfellow, Pouget-Abadie,
Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio (2020), Donahue, Kr¨ahenb¨uhl, and
Darrell (2016), Dumoulin, Belghazi, Poole, Mastropietro, Lamb, Arjovsky, and Courville
(2016), Pathak, Krahenbuhl, Donahue, Darrell, and Efros (2016)), and stable diffusion (Sohl-
Dickstein, Weiss, Maheswaranathan, and Ganguli (2015); Ho, Jain, and Abbeel (2020)). The
same idea of encoding data into a low-dimensional manifold and then decoding it for discrim-
inative purposes underlies recent successful attempts to build powerful, general perception
models, such as those of Jaegle, Gimeno, Brock, Vinyals, Zisserman, and Carreira (2021)
and Girdhar, El-Nouby, Liu, Singh, Alwala, Joulin, and Misra (2023).2
The impressive empirical achievements of the models cited above have further widened
the gap between their performance and our theoretical understanding thereof. In particular,
little is known about the role of bottlenecks and the geometry of the respective latent spaces.3
In this paper, we try to bridge this gap. To this end, we formally define the generative prob-
lem of finding the best encoder-decoder architecture. Using novel mathematical techniques
combining ideas from optimal transport theory Villani (2009) and metric geometry Burago,
Burago, and Ivanov (2022), we characterize the solution to the optimal encoder-decoder
problem, that we name the benign autoencoder (BAE). We show that BAE optimally reg-
1While the original text generation and translation models used encoder-decoder architectures, the recent
progress in large language models (LLMs) relies on decoder-only architectures. Understanding the role of
encoders for LLMs is an important direction for future research.
2The Perceiver of Jaegle et al. (2021) is designed to handle arbitrary configurations of different modalities
(images, audio, and video data) using a single Transformer-based architecture. It introduces a small set
of latent units that forms a bottleneck eliminating the quadratic scaling problem of classical Transformers
and decoupling the network depth from the input’s size. The authors use a bottleneck of dimension 512
for the image encoding, which is a huge dimensionality reduction, compared to the input dimension of
224 ×224 = 50176 pixels.
3For some recent progress in the theoretical understanding of GANs, see, Arjovsky and Bottou (2017).
2
ularizes the generative problem by compressing the “malignant” dimensions of the data,
thus convexifying the problem through dimensionality reduction.4We also characterize the
latent dimension of the optimal BAE that we refer to as the compressibility dimension of
the learning problem.
In addition to providing a theoretical foundation for optimal latent representations in
several important generative problems (see, e.g., Che et al. (2016), Peng et al. (2018), Good-
fellow et al. (2020), Pathak et al. (2016)), we test our theory on the distance-regularized GAN
and context-encoder settings with the CelebA-HQ dataset Karras, Aila, Laine, and Lehtinen
(2017). In the Appendix, we also show how to use our results to study optimal, super-
vised, denoising autoencoders with the MNIST LeCun, Boser, Denker, Henderson, Howard,
Hubbard, and Jackel (1989) and FMNIST Xiao, Rasul, and Vollgraf (2017) datasets. In all
experiments, we find evidence of the existence of an optimal latent dimension (much lower
than the dimension of the data). In particular, we show that using an encoder with a latent
dimension larger than the compressibility dimension either deteriorates generative models’
performance or is meaningless. This is due to wasted computational resources, and it does
not lead to any performance increase.
In an effort to understand the benefits of encoder-decoder architectures, previous pa-
pers used heuristics and intuition to suggest that penalization of the reconstruction error
in generative models leads to smoother and more stable gradients (see, e.g., Che et al.
(2016)). This paper vindicates and provides a theoretical formalization for this intuition.
Our main theorem implies that BAE convexifies the objective function’s dependence on the
data. Namely, the objective becomes convex when restricted to the optimal feature manifold
(the low-dimensional manifold on which the auto-encoded data lives). The gradient of a
convex function is always regular because it is a monotone map; this monotonicity removes
“spikes” and makes the gradient more stable.
2 Background
Since the onset of GANs, lower-dimensional representations have played a key role in
generative AI. For example, in image generation, modern GAN architectures Karras et al.
4It is known that convex problems are well-behaved because they have unique global minima and gradient
descent algorithms are guaranteed to converge to these minima. However, BAE exploits a different form of
convexity: It makes the average model accuracy depend on the input (training) data in a convex fashion.
The dependence on training data is an important ingredient of the theory of adversarial attacks. See, e.g.,
Goodfellow, Shlens, and Szegedy (2014a) and Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry (2019).
BAE regularizes the dependence on training data by removing “spikes in the gradient” and making the
gradient map monotone.
3
(2017); Karras, Laine, and Aila (2019); Karras, Laine, Aittala, Hellsten, Lehtinen, and
Aila (2020); Karras, Aittala, Laine, H¨ark¨onen, Hellsten, Lehtinen, and Aila (2021) use a
lower-dimensional latent space of 4 ×4×512 to generate high-resolution images and videos
(1024 ×1024 ×3). Similar behavior is observed in the latest Diffusion Probabilistic Models
Sohl-Dickstein et al. (2015); Ho et al. (2020); Li, Prabhudesai, Duggal, Brown, and Pathak
(2023).
Although GANS achieve state-of-the-art results on various tasks, they are often highly
unstable. As Che et al. (2016) show, this behavior is driven by a special form of a curse of
dimensionality that can be solved by training an autoencoder with a small latent dimension.
In a similar vein, Peng et al. (2018) show that introducing an auto-encoder trained with a
VAE-type reconstruction loss and a low-dimensional bottleneck significantly improves the
performance of GANs, as well as models of imitation learning and inverse reinforcement
learning.
Our paper also relates to the tight connection between generative and discriminative
problems, which has been discussed in many papers, starting with the influential work of
Hinton (2007): “To Recognize Shapes, First Learn to Generate Images.” See also Ng and
Jordan (2001). Recent evidence suggests that conditional generative models are also good
classifiers. See, e.g., Li et al. (2023); Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal,
Neelakantan, Shyam, Sastry, Askell, et al. (2020). Our results provide additional intuition
for this phenomenon and its link to efficient latent representations.
For discriminative (classification and regression) problems, Tishby and Zaslavsky (2015)
argue that the success of deep neural networks might be related to their ability to extract
efficient representations of the relevant features of the input layer for predicting the output
label. Tishby and Zaslavsky (2015) refer to this phenomenon as the optimal information
bottleneck.5Several subsequent papers have introduced methodologies targeted at creating
optimal bottlenecks with a minimal loss of mutual information. See, e.g., Alemi, Fischer, Dil-
lon, and Murphy (2016)Oord, Li, and Vinyals (2018), Hjelm, Fedorov, Lavoie-Marchildon,
Grewal, Bachman, Trischler, and Bengio (2018), Achille and Soatto (2018a), Alemi (2020).
In particular, Alemi et al. (2016) provide evidence that efficiently trained bottlenecks im-
prove classification accuracy and adversarial robustness; Achille and Soatto (2018a) link
information bottlenecks to invariance to nuisances, irrelevant features that provide no useful
information. The mechanism behind the BAE algorithm proposed in this paper is different.
The bottleneck created by BAE does not remove noise or useless features; instead, we prove
that some dimensions of data are useful (contain important information) but are malignant
5Recent research shows that the ability of NNs to learn efficient low-dimensional representations is key
to their performance. See Ghorbani, Mei, Misiakiewicz, and Montanari (2020).
4
for the specific learning algorithm. BAE identifies those dimensions and erases them.
3 Preliminaries on Autoencoders
We start our analysis by introducing a mathematical formalism behind encoder-decoder
architectures.
Let ˜
Xbe a set of messages, and Zthe space of encoded messages (henceforth, code
space). Data pre-processing is a map F:˜
X → X, where X RLis the space of pre-
processed messages. E.g., Fcould be a form of data normalization, image resizing, data
whitening, or masking (for context encoders). An encoder is a map E:X → Z, and
a decoder is a map D:Z → ˆ
X.An autoencoder (AE) is the composition of the two:
A:X → ˆ
X,A(x) = D(E(x)).Given a parametric family {Eϕ}ϕΦof encoders and a
parametric family {Dθ}θΘof decoders, the classic optimal encoding problem is to solve
minθ,ϕ E[(Dθ(Eϕ(F(˜x))), g(˜x))] for some loss function , where g:˜
X → ˆ
Xis a target
data transformation. For example: (i) For a standard autoencoder,6˜
X=X=ˆ
X,and
both Fand gare identity maps so that the objective becomes to reconstruct the original
data x= ˜xbased on its latent representation minθ,ϕ E[(Dθ(Eϕ(x)), x)].(ii) In the context
encoding problem of images (Pathak et al. (2016)), a part of the data is masked using a
mask indicator ˆ
M, so that F(˜x) = (1 ˆ
M)˜xis the partially masked image. At the
same time, the optimal encoding-decoding problem is to reconstruct the masked part of the
image, g(˜x) = M ⊙ ˜x, based entirely on the partially masked image: The goal is to solve
minθ,ϕ E[(Dθ(Eϕ((1 ˆ
M)˜x)),ˆ
M˜x].(iii) For image-to-image translation (Isola, Zhu,
Zhou, and Efros (2017)),7˜x= (y, x) is a pair, and the objective is to morph xinto y, so that
F(˜x) = xand g(˜x) = y, and we minimize minθ,ϕ E[(Dθ(Eϕ(x)), y)].
Given a prior probability distribution p(dx) of x=F(˜x) on X,a probabilistic encoder is
a joint probability distribution p(dx, dz) on X × Z satisfying
px(dx) = ZZ
p(dx, dz) = p(dx).(1)
6One of the most popular algorithms for unsupervised data representation is based on training an au-
toencoder (Rumelhart and McClelland,1986): An artificial neural network that learns how to efficiently
encode data in a lower-dimensional space with a minimal reconstruction loss. These models play a key role
in unsupervised data representation and feature engineering as powerful non-linear dimensionality reduction
techniques, see Hinton, Osindero, and Teh (2006), Hinton and Salakhutdinov (2006), Bengio, LeCun, et al.
(2007), Erhan, Courville, Bengio, and Vincent (2010), Baldi (2012), Zemel, Wu, Swersky, Pitassi, and Dwork
(2013), Makhzani and Frey (2013), Makhzani and Frey (2015), Achille and Soatto (2018b), Makhzani (2018),
Kenfack, Khan, Hussain, and Kazmi (2021), and Gu, Kelly, and Xiu (2021).
7See also Choi, Uh, Yoo, and Ha (2020) for a related problem of image synthesis.
5
摘要:

BenignAutoencodersSemyonMalamud∗†,TengAndreaXu†,andAntoineDidisheim‡†´EcolePolytechniqueF´ed´eraledeLausanne(EPFL)‡UniversityofGenevaAugust29,2023AbstractRecentprogressinGenerativeArtificialIntelligence(AI)reliesonefficientdatarep-resentations,oftenfeaturingencoder-decoderarchitectures.Weformalizeth...

展开>> 收起<<
Benign Autoencoders Semyon Malamud Teng Andrea Xu and Antoine Didisheim Ecole Polytechnique F ed erale de Lausanne EPFL.pdf

共81页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:81 页 大小:3.51MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 81
客服
关注