
1 Introduction
The success of modern generative models relies on neural network architectures for build-
ing powerful representations of the data, typically featuring an encoder (responsible for fea-
ture learning) and a decoder (responsible for data generation).1Most such encoder-decoder
architectures feature a bottleneck, with the latent dimension of the encoder often being
much smaller than the dimension of the original data. Extensive experimental evidence
suggests that a lower-dimensional latent space improves the quality of generative models
by allowing them to generate data based on several key features of the latent represen-
tation. For example, this is the case for the variational autoencoder (VAE; Kingma and
Welling (2013), Makhzani, Shlens, Jaitly, Goodfellow, and Frey (2015)), generative adver-
sarial networks (GANs; Radford, Metz, and Chintala (2015), Che, Li, Jacob, Bengio, and
Li (2016), Peng, Kanazawa, Toyer, Abbeel, and Levine (2018), Goodfellow, Pouget-Abadie,
Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio (2020), Donahue, Kr¨ahenb¨uhl, and
Darrell (2016), Dumoulin, Belghazi, Poole, Mastropietro, Lamb, Arjovsky, and Courville
(2016), Pathak, Krahenbuhl, Donahue, Darrell, and Efros (2016)), and stable diffusion (Sohl-
Dickstein, Weiss, Maheswaranathan, and Ganguli (2015); Ho, Jain, and Abbeel (2020)). The
same idea of encoding data into a low-dimensional manifold and then decoding it for discrim-
inative purposes underlies recent successful attempts to build powerful, general perception
models, such as those of Jaegle, Gimeno, Brock, Vinyals, Zisserman, and Carreira (2021)
and Girdhar, El-Nouby, Liu, Singh, Alwala, Joulin, and Misra (2023).2
The impressive empirical achievements of the models cited above have further widened
the gap between their performance and our theoretical understanding thereof. In particular,
little is known about the role of bottlenecks and the geometry of the respective latent spaces.3
In this paper, we try to bridge this gap. To this end, we formally define the generative prob-
lem of finding the best encoder-decoder architecture. Using novel mathematical techniques
combining ideas from optimal transport theory Villani (2009) and metric geometry Burago,
Burago, and Ivanov (2022), we characterize the solution to the optimal encoder-decoder
problem, that we name the benign autoencoder (BAE). We show that BAE optimally reg-
1While the original text generation and translation models used encoder-decoder architectures, the recent
progress in large language models (LLMs) relies on decoder-only architectures. Understanding the role of
encoders for LLMs is an important direction for future research.
2The Perceiver of Jaegle et al. (2021) is designed to handle arbitrary configurations of different modalities
(images, audio, and video data) using a single Transformer-based architecture. It introduces a small set
of latent units that forms a bottleneck eliminating the quadratic scaling problem of classical Transformers
and decoupling the network depth from the input’s size. The authors use a bottleneck of dimension 512
for the image encoding, which is a huge dimensionality reduction, compared to the input dimension of
224 ×224 = 50176 pixels.
3For some recent progress in the theoretical understanding of GANs, see, Arjovsky and Bottou (2017).
2