Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant V AE Alireza Nasiri

2025-05-06 0 0 1.7MB 17 页 10玖币

侵权投诉

Unsupervised Object Representation Learning using

Translation and Rotation Group Equivariant VAE

Alireza Nasiri

Simons Machine Learning Center

New York Structural Biology Center

anasiri@nysbc.org

Tristan Bepler

Simons Machine Learning Center

New York Structural Biology Center

tbepler@nysbc.org

Abstract

In many imaging modalities, objects of interest can occur in a variety of locations

and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location

and pose of an object does not change its semantics (i.e. the object’s essence). That

is, the speciﬁc location and rotation of an airplane in satellite imagery, or the 3d

rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron

micrograph, do not change the intrinsic nature of those objects. Here, we consider

the problem of learning semantic representations of objects that are invariant to

pose and location in a fully unsupervised manner. We address shortcomings in

previous approaches to this problem by introducing TARGET-VAE, a translation

and rotation group-equivariant variational autoencoder framework. TARGET-VAE

combines three core innovations: 1) a rotation and translation group-equivariant

encoder architecture, 2) a structurally disentangled distribution over latent rota-

tion, translation, and a rotation-translation-invariant semantic object representation,

which are jointly inferred by the approximate inference network, and 3) a spa-

tially equivariant generator network. In comprehensive experiments, we show that

TARGET-VAE learns disentangled representations without supervision that signiﬁ-

cantly improve upon, and avoid the pathologies of, previous methods. When trained

on images highly corrupted by rotation and translation, the semantic representations

learned by TARGET-VAE are similar to those learned on consistently posed ob-

jects, dramatically improving clustering in the semantic latent space. Furthermore,

TARGET-VAE is able to perform remarkably accurate unsupervised pose and

location inference. We expect methods like TARGET-VAE will underpin future ap-

proaches for unsupervised object generation, pose prediction, and object detection.

Our code is available at https://github.com/SMLC-NYSBC/TARGET-VAE.

1 Introduction

In many imaging modalities, objects of interest are arbitrarily located and oriented within the image

frame. Examples include airplanes in satellite images, galaxies in astronomy images [

], and particles

in single-particle cryo-electron microscopy (cryo-EM) micrographs [

]. However, neither the location

nor the rotation of an object within an image frame changes the nature (i.e. the semantics) of the

object itself. An airplane is an airplane regardless of where it is in the image, and different rotations

of a particle in a cryo-EM micrograph are still projections of the same protein. Hence, there is great

interest in learning semantic representations of objects that are invariant to their locations and poses.

In general, unsupervised representation learning methods do not recover representations that disen-

tangle the semantics of an object from its location or its pose. Popular unsupervised deep learning

methods for images, such as variational autoencoders (VAE) [

], usually use encoder-decoder frame-

works in which an encoder network is learned to map an input image to an unstructured latent variable

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12918v2 [cs.CV] 3 Jan 2023

or distribution over latent variables which is then decoded back to the input image by the decoder

network. However, the unstructured nature of the latent variables means that they do not separate into

any speciﬁc and interpretable sources of variation. To achieve disentanglement, methods have been

proposed that encourage independence between latent variables [

]. However, these methods make

no prior or structural assumptions about what these latent variables should encode, even when some

sources of variation e.g. an object’s location or pose are prevalent and well-known. Recently, several

methods have proposed structured models that explicitly model rotation or translation within their

generative networks [

] by formulating the image generative model as a function mapping coor-

dinates in space to pixel values. Although promising, only the generative portion of these methods is

equivariant to rotation and translation, and the inference networks have lackluster performance due to

poor inductive bias for these structured latents.

To address this, we propose TARGET-VAE, a Translation and Rotation Group Equivariant Variational

Auto-Encoder. TARGET-VAE is able to learn semantic object representations that are invariant to

pose and location from images corrupted by these transformations, by structurally decomposing

the image generative factors into semantic, rotation, and translation components. We perform

approximate inference with a group equivariant convolutional neural network [

] and specially

formulated approximate posterior distribution that allows us to disentangle the latent variables into a

rotationally equivariant latent rotation, translationally equivariant latent translation, and rotation and

translation invariant semantic latent variables. By combining this with a spatial generator network, our

framework is completely invariant to rotation and translation, unsupervised, and fully differentiable;

the model is trained end-to-end using only observed images. In comprehensive experiments, we

show that TARGET-VAE accurately infers the rotation and translation of objects without supervision

and learns high quality object representations even when training images are heavily confounded by

rotation and translation. We then show that this framework can be used to map continuous variation

in 2D images of proteins collected with cryo-EM.

2 Related Work

In the recent years, there has been signiﬁcant progress in machine learning methods for unsupervised

semantic analysis of images using VAEs [

], ﬂow-based methods [

], Generative

Adversarial Networks (GAN) [

], and capsule networks [

]. These methods generally seek

to learn a low-dimensional representation of each image in a dataset, or a distribution over this latent,

by learning to reconstruct the image from the latent variable. The latent variable, then, must capture

variation between images in the dataset in order for them to be accurately reconstructed. These

latents can then be used as features for downstream analysis, as they capture image content. However,

these representations must capture all information needed to reconstruct an image including common

transformations that are not semantically meaningful. This often results in latent representations that

group objects primarily by location, pose, or other nuisance variables, rather than type. One simple

approach to address this is data augmentation. In this approach, additional images are created by

applying a known set of nuisance transformations to the original training dataset. A network is then

trained to be invariant to these transformations by penalizing the difference between low-dimensional

representations of the same image with different nuisance transforms applied [

]. However, this

approach is data inefﬁcient and, more importantly, does not guarantee invariance to the underlying

nuisance transformations. This general approach has improved with recent contrastive learning

methods [20, 21, 22], but these ultimately suffer from the same underlying problems.

Other methods approach this problem by applying constraints on the unstructured latent space

[

]. In this general class of methods, the latent space is not explicitly structured to disentangle

semantic latent variables from latents representing other transformations. Instead, the models need to

be inspected post-hoc to determine if a meaningfully separated concepts have been encoded in each

latent variable and what those concepts are. A few works have addressed explicitly disentangling

the latent space into structured and unstructured variables. Sun et al.

[17]

propose a 3D point cloud

representation of objects decomposed into capsules. Using a self-supervised learning scheme, their

model learns transformation-invariant descriptors along with transformation-equivariant poses for

each capsule, improving reconstruction and clustering. Bepler et al.

[6]

propose spatial-VAE, a VAE

framework that divides the latent variables into unstructured, rotation, and translation components.

However, only the generative part of spatial-VAE is equivariant to rotation and translation and the

inference network struggles to perform meaningful inference on these transformations.

Group-Convolution’s

Feature Maps

Encoder Decoder

q(𝑡, 𝑟 | 𝑦)

Inference

module for

attention

q𝜃 𝑡, 𝑟, 𝑦)

𝒒 𝜽 𝒕, 𝒓, 𝒚)

𝒒 𝒛 𝒕, 𝒓, 𝒚) Sampled z from 𝒒 𝒛 𝒕, 𝒓, 𝒚) based on

𝒒 𝒕, 𝒓 𝒚

Pixel values

Pixel

coordinates

Fully connected layers

Inference

module

for z

Inference

module

for 𝜃

𝒒 𝒕, 𝒓 𝒚

Rotated and translated coordinates using sampled 𝜽and 𝒕

q𝑧 𝑡, 𝑟, 𝑦)

…

Random Fourier

feature expansion

…

Fully-connected

Layer on coordinates

Fully-connected

Layer on representation

…

Figure 1: The TARGET-VAE framework. The encoder uses group-equivariant convolutional lay-

ers to output mixture distributions over semantic representations, rotation, and translation. The

transformation-equivariant generator reconstructs the image based on the representation value

, and

the transformed coordinates of the pixels.

To disentangle object semantic representations from spatial transformations, the inference network

should also be equivariant to the spatial transformations. Translation-equivariance can be achieved by

using convolutional layers in the inference model. However, convolutional layers are not rotation-

equivariant. There has been many studies in designing transformation-equivariant and speciﬁcally

rotation-equivariant neural networks [

]. In Group-equivariant Convolutional Neural

Networks (G-CNNs) [

], 2d rotation space is discretized and ﬁlters are applied over this rotation

dimension in addition to the spatial dimensions of the image. This creates a network that is structurally

equivariant to the discrete rotation group at the cost of additional compute. G-CNNs allow us to

create a rotation-equivariant inference network.

A number of studies have proposed spatial transformation equivariant generative models, by modeling

an image as a function that maps 2D or 3D spatial coordinates to the values of the pixels at those

coordinates [

]. Due this mapping, any transformation of the spatial coordinates produces

exactly the same transformation of the generated image. However, other than Bepler et al.

[6]

, none of

these studies perform inference on spatial transformations. To the best of our knowledge, TARGET-

VAE is the ﬁrst method to have rotation and translation equivariance in both the inference and the

generative networks to achieve disentangling.

3 Method

TARGET-VAE disentangles image latent variables into an unstructured semantic vector, and rotation

and translation latents (Figure 1). The approximate posterior distribution is factorized into translation,

rotation, and content components where the joint distribution over translation and discrete rotation

groups is deﬁned by an attention map output by the equivariant encoder. The ﬁne-grained rotation

and content vectors are drawn from a mixture model where each of the joint translation and discrete

rotation combinations are components of the mixture and are used to select mean and standard

deviations of the continuous latents output by the encoder for each possible rotation and translation.

Sampled translation, rotation, and content vectors are fed through the spatial decoder to generate

images conditioned on the latent variables.

3.1 Image generation process

Images can be considered as the combination of discreetly identiﬁable pixels, and the spatial trans-

formation of an image is equivalent to transforming the spatial coordinates of its pixels. To have an

image generation process adherent to the transformations identiﬁed in the latent space, we deﬁne our

generator as a function which maps the spatial coordinates of the pixels to their values. The generator

outputs a probability distribution as

p(ˆyi|z, xi)

, where

, and

ˆyi

are the latent content vector, the

spatial coordinate of the

th pixel, and the value of that pixel, respectively. Similar to [

], for an

image with

pixels, we can deﬁne the probability of the image generated in this manner as sum over

probability of its pixel values:

log p(ˆy|z) =

i=1

log p(ˆyi|z, xi)(1)

To deﬁne spatial coordinates over the pixels of the image, we use Cartesian coordinates with the

origin at the center of the image. Translation and rotation of the image is achieved by shifting the

origin and rotating coordinates around it, respectively. Depending on the value of the content vector,

, the generator acts as a function over the pixel coordinates to produce an image. In this setup, the

generation process is equivariant to transformations of the coordinate system. Assuming

R(θ)

as the

rotation matrix for angle

, and translation value of

, the probability of the generated image from

Formula 1 is

log p(ˆy|z, θ, t) =

i=1

log p(ˆyi|z, R(θ)xi+t)(2)

In this study, we focus speciﬁcally on rotation and translation of the spatial coordinates, but the image

generative process extends to any transformation of the coordinate space.

3.2 Approximate inference on latent variables

We implement a VAE framework to perform approximate inference on content, rotation, and transla-

tion latent variables. Since rotation optimization is non-convex, we implement a mixture distribution

over the 2D rotation space that allows the model to choose from

discrete components. These

components are used to approximate the posterior distributions over the rotation angle

. We repre-

sent the overall approximate posterior as

q(z, θ, t, r|y)

, where

is the latent content vector,

is the

rotation angle,

and

refer to the translation and discretized rotation components, and

is the input

image. By making the simplifying assumption that

q(z|t, r, y)

and

q(θ|t, r, y)

are independent, the

approximate posterior distribution factorizes as

q(z, θ, t, r|y) = q(z|t, r, y)q(θ|t, r, y)q(t, r|y), (3)

where

q(t, r|y)

is the joint distribution over discrete translations and rotations and

q(θ|t, r, y)

and

q(z|t, r, y)

are the distributions over the real-valued rotation and the latent content vector conditioned

on t,r, and the input y.

The Kullback-Leibler (KL) divergence between the approximate posterior and the prior over the

latent variables is then

KL(q(z, θ, t, r|y)||p(z, θ, t, r)) = X

z,θ,t,r

q(z, θ, t, r|y)log q(z, θ, t, r|y)

p(z, θ, t, r). (4)

To simplify this equation, we use the factorization from Equation 3 to reduce the KL-divergence to the

following (see Appendix A1 for the full derivation), with the joint prior factorized into independent

priors over the latent variables,

KL(q(z, θ, t, r|y)||p(z, θ, t, r)) = KLt,r +X

t,r

q(t, r) (KLθ+KLz), where (5)

KLt,r =X

t,r

q(t, r|y)log q(t, r|y)

p(t, r),

KLθ=KL (q(θ|t, r, y)||p(θ|r)) , and

KLz=KL (q(z|t, r, y)||p(z)) .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedObjectRepresentationLearningusingTranslationandRotationGroupEquivariantVAEAlirezaNasiriSimonsMachineLearningCenterNewYorkStructuralBiologyCenteranasiri@nysbc.orgTristanBeplerSimonsMachineLearningCenterNewYorkStructuralBiologyCentertbepler@nysbc.orgAbstractInmanyimagingmodalities,objectso...

展开>> 收起<<

Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant V AE Alireza Nasiri.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant V AE Alireza Nasiri

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: