Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant V AE Alireza Nasiri

2025-05-06 0 0 1.7MB 17 页 10玖币
侵权投诉
Unsupervised Object Representation Learning using
Translation and Rotation Group Equivariant VAE
Alireza Nasiri
Simons Machine Learning Center
New York Structural Biology Center
anasiri@nysbc.org
Tristan Bepler
Simons Machine Learning Center
New York Structural Biology Center
tbepler@nysbc.org
Abstract
In many imaging modalities, objects of interest can occur in a variety of locations
and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location
and pose of an object does not change its semantics (i.e. the object’s essence). That
is, the specific location and rotation of an airplane in satellite imagery, or the 3d
rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron
micrograph, do not change the intrinsic nature of those objects. Here, we consider
the problem of learning semantic representations of objects that are invariant to
pose and location in a fully unsupervised manner. We address shortcomings in
previous approaches to this problem by introducing TARGET-VAE, a translation
and rotation group-equivariant variational autoencoder framework. TARGET-VAE
combines three core innovations: 1) a rotation and translation group-equivariant
encoder architecture, 2) a structurally disentangled distribution over latent rota-
tion, translation, and a rotation-translation-invariant semantic object representation,
which are jointly inferred by the approximate inference network, and 3) a spa-
tially equivariant generator network. In comprehensive experiments, we show that
TARGET-VAE learns disentangled representations without supervision that signifi-
cantly improve upon, and avoid the pathologies of, previous methods. When trained
on images highly corrupted by rotation and translation, the semantic representations
learned by TARGET-VAE are similar to those learned on consistently posed ob-
jects, dramatically improving clustering in the semantic latent space. Furthermore,
TARGET-VAE is able to perform remarkably accurate unsupervised pose and
location inference. We expect methods like TARGET-VAE will underpin future ap-
proaches for unsupervised object generation, pose prediction, and object detection.
Our code is available at https://github.com/SMLC-NYSBC/TARGET-VAE.
1 Introduction
In many imaging modalities, objects of interest are arbitrarily located and oriented within the image
frame. Examples include airplanes in satellite images, galaxies in astronomy images [
1
], and particles
in single-particle cryo-electron microscopy (cryo-EM) micrographs [
2
]. However, neither the location
nor the rotation of an object within an image frame changes the nature (i.e. the semantics) of the
object itself. An airplane is an airplane regardless of where it is in the image, and different rotations
of a particle in a cryo-EM micrograph are still projections of the same protein. Hence, there is great
interest in learning semantic representations of objects that are invariant to their locations and poses.
In general, unsupervised representation learning methods do not recover representations that disen-
tangle the semantics of an object from its location or its pose. Popular unsupervised deep learning
methods for images, such as variational autoencoders (VAE) [
3
], usually use encoder-decoder frame-
works in which an encoder network is learned to map an input image to an unstructured latent variable
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12918v2 [cs.CV] 3 Jan 2023
or distribution over latent variables which is then decoded back to the input image by the decoder
network. However, the unstructured nature of the latent variables means that they do not separate into
any specific and interpretable sources of variation. To achieve disentanglement, methods have been
proposed that encourage independence between latent variables [
4
,
5
]. However, these methods make
no prior or structural assumptions about what these latent variables should encode, even when some
sources of variation e.g. an object’s location or pose are prevalent and well-known. Recently, several
methods have proposed structured models that explicitly model rotation or translation within their
generative networks [
6
,
7
,
8
] by formulating the image generative model as a function mapping coor-
dinates in space to pixel values. Although promising, only the generative portion of these methods is
equivariant to rotation and translation, and the inference networks have lackluster performance due to
poor inductive bias for these structured latents.
To address this, we propose TARGET-VAE, a Translation and Rotation Group Equivariant Variational
Auto-Encoder. TARGET-VAE is able to learn semantic object representations that are invariant to
pose and location from images corrupted by these transformations, by structurally decomposing
the image generative factors into semantic, rotation, and translation components. We perform
approximate inference with a group equivariant convolutional neural network [
9
] and specially
formulated approximate posterior distribution that allows us to disentangle the latent variables into a
rotationally equivariant latent rotation, translationally equivariant latent translation, and rotation and
translation invariant semantic latent variables. By combining this with a spatial generator network, our
framework is completely invariant to rotation and translation, unsupervised, and fully differentiable;
the model is trained end-to-end using only observed images. In comprehensive experiments, we
show that TARGET-VAE accurately infers the rotation and translation of objects without supervision
and learns high quality object representations even when training images are heavily confounded by
rotation and translation. We then show that this framework can be used to map continuous variation
in 2D images of proteins collected with cryo-EM.
2 Related Work
In the recent years, there has been significant progress in machine learning methods for unsupervised
semantic analysis of images using VAEs [
6
,
10
,
11
,
12
], flow-based methods [
13
], Generative
Adversarial Networks (GAN) [
14
,
15
,
16
], and capsule networks [
17
]. These methods generally seek
to learn a low-dimensional representation of each image in a dataset, or a distribution over this latent,
by learning to reconstruct the image from the latent variable. The latent variable, then, must capture
variation between images in the dataset in order for them to be accurately reconstructed. These
latents can then be used as features for downstream analysis, as they capture image content. However,
these representations must capture all information needed to reconstruct an image including common
transformations that are not semantically meaningful. This often results in latent representations that
group objects primarily by location, pose, or other nuisance variables, rather than type. One simple
approach to address this is data augmentation. In this approach, additional images are created by
applying a known set of nuisance transformations to the original training dataset. A network is then
trained to be invariant to these transformations by penalizing the difference between low-dimensional
representations of the same image with different nuisance transforms applied [
18
,
19
]. However, this
approach is data inefficient and, more importantly, does not guarantee invariance to the underlying
nuisance transformations. This general approach has improved with recent contrastive learning
methods [20, 21, 22], but these ultimately suffer from the same underlying problems.
Other methods approach this problem by applying constraints on the unstructured latent space
[
4
,
5
,
15
]. In this general class of methods, the latent space is not explicitly structured to disentangle
semantic latent variables from latents representing other transformations. Instead, the models need to
be inspected post-hoc to determine if a meaningfully separated concepts have been encoded in each
latent variable and what those concepts are. A few works have addressed explicitly disentangling
the latent space into structured and unstructured variables. Sun et al.
[17]
propose a 3D point cloud
representation of objects decomposed into capsules. Using a self-supervised learning scheme, their
model learns transformation-invariant descriptors along with transformation-equivariant poses for
each capsule, improving reconstruction and clustering. Bepler et al.
[6]
propose spatial-VAE, a VAE
framework that divides the latent variables into unstructured, rotation, and translation components.
However, only the generative part of spatial-VAE is equivariant to rotation and translation and the
inference network struggles to perform meaningful inference on these transformations.
2
Group-Convolution’s
Feature Maps
Encoder Decoder
q(𝑡, 𝑟 | 𝑦)
Inference
module for
attention
q𝜃 𝑡, 𝑟, 𝑦)
𝒒 𝜽 𝒕, 𝒓, 𝒚)
𝒒 𝒛 𝒕, 𝒓, 𝒚) Sampled z from 𝒒 𝒛 𝒕, 𝒓, 𝒚) based on
𝒒 𝒕, 𝒓 𝒚
Pixel values
Pixel
coordinates
Fully connected layers
Inference
module
for z
Inference
module
for 𝜃
𝒒 𝒕, 𝒓 𝒚
Rotated and translated coordinates using sampled 𝜽and 𝒕
q𝑧 𝑡, 𝑟, 𝑦)
x1
x2
Random Fourier
feature expansion
z
Fully-connected
Layer on coordinates
Fully-connected
Layer on representation
+
Figure 1: The TARGET-VAE framework. The encoder uses group-equivariant convolutional lay-
ers to output mixture distributions over semantic representations, rotation, and translation. The
transformation-equivariant generator reconstructs the image based on the representation value
z
, and
the transformed coordinates of the pixels.
To disentangle object semantic representations from spatial transformations, the inference network
should also be equivariant to the spatial transformations. Translation-equivariance can be achieved by
using convolutional layers in the inference model. However, convolutional layers are not rotation-
equivariant. There has been many studies in designing transformation-equivariant and specifically
rotation-equivariant neural networks [
9
,
23
,
24
,
25
]. In Group-equivariant Convolutional Neural
Networks (G-CNNs) [
9
], 2d rotation space is discretized and filters are applied over this rotation
dimension in addition to the spatial dimensions of the image. This creates a network that is structurally
equivariant to the discrete rotation group at the cost of additional compute. G-CNNs allow us to
create a rotation-equivariant inference network.
A number of studies have proposed spatial transformation equivariant generative models, by modeling
an image as a function that maps 2D or 3D spatial coordinates to the values of the pixels at those
coordinates [
7
,
8
,
26
,
27
,
6
]. Due this mapping, any transformation of the spatial coordinates produces
exactly the same transformation of the generated image. However, other than Bepler et al.
[6]
, none of
these studies perform inference on spatial transformations. To the best of our knowledge, TARGET-
VAE is the first method to have rotation and translation equivariance in both the inference and the
generative networks to achieve disentangling.
3 Method
TARGET-VAE disentangles image latent variables into an unstructured semantic vector, and rotation
and translation latents (Figure 1). The approximate posterior distribution is factorized into translation,
rotation, and content components where the joint distribution over translation and discrete rotation
groups is defined by an attention map output by the equivariant encoder. The fine-grained rotation
and content vectors are drawn from a mixture model where each of the joint translation and discrete
rotation combinations are components of the mixture and are used to select mean and standard
deviations of the continuous latents output by the encoder for each possible rotation and translation.
Sampled translation, rotation, and content vectors are fed through the spatial decoder to generate
images conditioned on the latent variables.
3.1 Image generation process
Images can be considered as the combination of discreetly identifiable pixels, and the spatial trans-
formation of an image is equivalent to transforming the spatial coordinates of its pixels. To have an
image generation process adherent to the transformations identified in the latent space, we define our
3
generator as a function which maps the spatial coordinates of the pixels to their values. The generator
outputs a probability distribution as
p(ˆyi|z, xi)
, where
z
,
xi
, and
ˆyi
are the latent content vector, the
spatial coordinate of the
i
th pixel, and the value of that pixel, respectively. Similar to [
3
], for an
image with
n
pixels, we can define the probability of the image generated in this manner as sum over
probability of its pixel values:
log p(ˆy|z) =
n
X
i=1
log p(ˆyi|z, xi)(1)
To define spatial coordinates over the pixels of the image, we use Cartesian coordinates with the
origin at the center of the image. Translation and rotation of the image is achieved by shifting the
origin and rotating coordinates around it, respectively. Depending on the value of the content vector,
z
, the generator acts as a function over the pixel coordinates to produce an image. In this setup, the
generation process is equivariant to transformations of the coordinate system. Assuming
R(θ)
as the
rotation matrix for angle
θ
, and translation value of
t
, the probability of the generated image from
Formula 1 is
log p(ˆy|z, θ, t) =
n
X
i=1
log p(ˆyi|z, R(θ)xi+t)(2)
In this study, we focus specifically on rotation and translation of the spatial coordinates, but the image
generative process extends to any transformation of the coordinate space.
3.2 Approximate inference on latent variables
We implement a VAE framework to perform approximate inference on content, rotation, and transla-
tion latent variables. Since rotation optimization is non-convex, we implement a mixture distribution
over the 2D rotation space that allows the model to choose from
r
discrete components. These
components are used to approximate the posterior distributions over the rotation angle
θ
. We repre-
sent the overall approximate posterior as
q(z, θ, t, r|y)
, where
z
is the latent content vector,
θ
is the
rotation angle,
t
and
r
refer to the translation and discretized rotation components, and
y
is the input
image. By making the simplifying assumption that
q(z|t, r, y)
and
q(θ|t, r, y)
are independent, the
approximate posterior distribution factorizes as
q(z, θ, t, r|y) = q(z|t, r, y)q(θ|t, r, y)q(t, r|y), (3)
where
q(t, r|y)
is the joint distribution over discrete translations and rotations and
q(θ|t, r, y)
and
q(z|t, r, y)
are the distributions over the real-valued rotation and the latent content vector conditioned
on t,r, and the input y.
The Kullback-Leibler (KL) divergence between the approximate posterior and the prior over the
latent variables is then
KL(q(z, θ, t, r|y)||p(z, θ, t, r)) = X
z,θ,t,r
q(z, θ, t, r|y)log q(z, θ, t, r|y)
p(z, θ, t, r). (4)
To simplify this equation, we use the factorization from Equation 3 to reduce the KL-divergence to the
following (see Appendix A1 for the full derivation), with the joint prior factorized into independent
priors over the latent variables,
KL(q(z, θ, t, r|y)||p(z, θ, t, r)) = KLt,r +X
t,r
q(t, r) (KLθ+KLz), where (5)
KLt,r =X
t,r
q(t, r|y)log q(t, r|y)
p(t, r),
KLθ=KL (q(θ|t, r, y)||p(θ|r)) , and
KLz=KL (q(z|t, r, y)||p(z)) .
4
摘要:

UnsupervisedObjectRepresentationLearningusingTranslationandRotationGroupEquivariantVAEAlirezaNasiriSimonsMachineLearningCenterNewYorkStructuralBiologyCenteranasiri@nysbc.orgTristanBeplerSimonsMachineLearningCenterNewYorkStructuralBiologyCentertbepler@nysbc.orgAbstractInmanyimagingmodalities,objectso...

展开>> 收起<<
Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant V AE Alireza Nasiri.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.7MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注