or distribution over latent variables which is then decoded back to the input image by the decoder
network. However, the unstructured nature of the latent variables means that they do not separate into
any specific and interpretable sources of variation. To achieve disentanglement, methods have been
proposed that encourage independence between latent variables [
4
,
5
]. However, these methods make
no prior or structural assumptions about what these latent variables should encode, even when some
sources of variation e.g. an object’s location or pose are prevalent and well-known. Recently, several
methods have proposed structured models that explicitly model rotation or translation within their
generative networks [
6
,
7
,
8
] by formulating the image generative model as a function mapping coor-
dinates in space to pixel values. Although promising, only the generative portion of these methods is
equivariant to rotation and translation, and the inference networks have lackluster performance due to
poor inductive bias for these structured latents.
To address this, we propose TARGET-VAE, a Translation and Rotation Group Equivariant Variational
Auto-Encoder. TARGET-VAE is able to learn semantic object representations that are invariant to
pose and location from images corrupted by these transformations, by structurally decomposing
the image generative factors into semantic, rotation, and translation components. We perform
approximate inference with a group equivariant convolutional neural network [
9
] and specially
formulated approximate posterior distribution that allows us to disentangle the latent variables into a
rotationally equivariant latent rotation, translationally equivariant latent translation, and rotation and
translation invariant semantic latent variables. By combining this with a spatial generator network, our
framework is completely invariant to rotation and translation, unsupervised, and fully differentiable;
the model is trained end-to-end using only observed images. In comprehensive experiments, we
show that TARGET-VAE accurately infers the rotation and translation of objects without supervision
and learns high quality object representations even when training images are heavily confounded by
rotation and translation. We then show that this framework can be used to map continuous variation
in 2D images of proteins collected with cryo-EM.
2 Related Work
In the recent years, there has been significant progress in machine learning methods for unsupervised
semantic analysis of images using VAEs [
6
,
10
,
11
,
12
], flow-based methods [
13
], Generative
Adversarial Networks (GAN) [
14
,
15
,
16
], and capsule networks [
17
]. These methods generally seek
to learn a low-dimensional representation of each image in a dataset, or a distribution over this latent,
by learning to reconstruct the image from the latent variable. The latent variable, then, must capture
variation between images in the dataset in order for them to be accurately reconstructed. These
latents can then be used as features for downstream analysis, as they capture image content. However,
these representations must capture all information needed to reconstruct an image including common
transformations that are not semantically meaningful. This often results in latent representations that
group objects primarily by location, pose, or other nuisance variables, rather than type. One simple
approach to address this is data augmentation. In this approach, additional images are created by
applying a known set of nuisance transformations to the original training dataset. A network is then
trained to be invariant to these transformations by penalizing the difference between low-dimensional
representations of the same image with different nuisance transforms applied [
18
,
19
]. However, this
approach is data inefficient and, more importantly, does not guarantee invariance to the underlying
nuisance transformations. This general approach has improved with recent contrastive learning
methods [20, 21, 22], but these ultimately suffer from the same underlying problems.
Other methods approach this problem by applying constraints on the unstructured latent space
[
4
,
5
,
15
]. In this general class of methods, the latent space is not explicitly structured to disentangle
semantic latent variables from latents representing other transformations. Instead, the models need to
be inspected post-hoc to determine if a meaningfully separated concepts have been encoded in each
latent variable and what those concepts are. A few works have addressed explicitly disentangling
the latent space into structured and unstructured variables. Sun et al.
[17]
propose a 3D point cloud
representation of objects decomposed into capsules. Using a self-supervised learning scheme, their
model learns transformation-invariant descriptors along with transformation-equivariant poses for
each capsule, improving reconstruction and clustering. Bepler et al.
[6]
propose spatial-VAE, a VAE
framework that divides the latent variables into unstructured, rotation, and translation components.
However, only the generative part of spatial-VAE is equivariant to rotation and translation and the
inference network struggles to perform meaningful inference on these transformations.
2