Robust Self-Supervised Learning with Lie Groups Mark Ibrahim Diane Bouchacourt Ari Morcos

2025-05-03 0 0 3.97MB 26 页 10玖币

侵权投诉

Robust Self-Supervised Learning with Lie

Groups

Mark Ibrahim*

, Diane Bouchacourt*

, Ari Morcos

Fundamental AI Research (FAIR), Meta AI

Abstract

Deep learning has led to remarkable advances in computer vision. Even so, to-

day’s best models are brittle when presented with variations that differ even slightly

from those seen during training. Minor shifts in the pose, color, or illumination of an

object can lead to catastrophic misclassiﬁcations. State-of-the art models struggle

to understand how a set of variations can affect different objects. We propose a

framework for instilling a notion of how objects vary in more realistic settings. Our

approach applies the formalism of Lie groups to capture continuous transformations

to improve models’ robustness to distributional shifts. We apply our framework on

top of state-of-the-art self-supervised learning (SSL) models, ﬁnding that explicitly

modeling transformations with Lie groups leads to substantial performance gains

of greater than 10% for MAE on both known instances seen in typical poses now

presented in new poses, and on unknown instances in any pose. We also apply

our approach to ImageNet, ﬁnding that the Lie operator improves performance by

almost 4%. These results demonstrate the promise of learning transformations to

improve model robustness1.

1 Introduction

State-of-the-art models have proven adept at modeling a number of complex tasks, but

they struggle when presented with inputs different from those seen during training. For

example, while classiﬁcation models are very good at recognizing buses in the upright

position, they fail catastrophically when presented with an upside-down bus since such

images are generally not included in standard training sets (Alcorn et al.,2019). This

can be problematic for deployed systems as models are required to generalize to settings

not seen during training (“out-of-distribution (OOD) generalization”). One potential

explanation for this failure of OOD generalization is that models exploit any and all

correlations between inputs and targets. Consequently, models rely on heuristics that

while effective during training, may fail to generalize, leading to a form of “supervision

collapse” (Jo & Bengio,2017;Ilyas et al.,2019;Doersch et al.,2020;Geirhos et al.,

*equal contribution

1Code to reproduce all experiments will be made available.

arXiv:2210.13356v1 [cs.CV] 24 Oct 2022

2020a). However, a number of models trained without supervision (self-supervised)

have recently been proposed, many of which exhibit improved, but still limited OOD

robustness (Chen et al.,2020;Hendrycks et al.,2019;Geirhos et al.,2020b).

The most common approach to this problem is to reduce the distribution shift by

augmenting training data. Augmentations are also key for a number of contrastive

self-supervised approaches, such as SimCLR (Chen et al.,2020). While this approach

can be effective, it has a number of disadvantages. First, for image data, augmentations

can only be applied to the pixels themselves, making it easy to, for example, rotate the

entire image, but very difﬁcult to rotate a single object within the image. Since many of

the variations seen in real data cannot be approximated by pixel-level augmentations,

this can be quite limiting in practice. Second, similar to adversarial training (Madry

et al.,2017;Kurakin et al.,2016), while augmentation can improve performance on

known objects, it often fails to generalize to novel objects (Alcorn et al.,2019). Third,

augmenting to enable generalization for one form of variation can often harm the

performance on other forms of variation (Geirhos et al.,2018;Engstrom et al.,2019),

and is not guaranteed to provide the expected invariance to variations (Bouchacourt

et al.,2021b). Finally, enforcing invariance is not guaranteed to provide the correct

robustness that generalizes to new instances (as discussed in Section 2).

For these reasons, we choose to explicitly model the transformations of the data

as transformations in the latent representation rather than trying to be invariant to it.

To do so, we use the formalism of Lie groups. Informally, Lie groups are continuous

groups described by a set of real parameters (Hall,2003). While many continuous

transformations form matrix Lie groups (e.g., rotations), they lack the typical structure

of a vector space. However, Lie group have a corresponding vector space, their Lie

algebra, that can be described using basis matrices, allowing to describe the inﬁnite

number elements of the group by a ﬁnite number of basis matrices. Our goal will be to

learn such matrices to directly model the data variations.

To summarize, our approach structures the representation space to enable self-

supervised models to generalize variation across objects. Since many naturally occurring

transformations (e.g., pose, color, size, etc.) are continuous, we develop a theoretically-

motivated operator, the Lie operator, that acts in representation space (see Fig. 1).

Speciﬁcally, the Lie operator learns the continuous transformations observed in data as

a vector space, using a set of basis matrices. With this approach, we make the following

contributions:

We generate a novel dataset containing 3D objects in many different poses,

allowing us to explicitly evaluate the ability of models to generalize to both

known objects in unknown poses and to unknown objects in both known and

unknown poses (Section 3).

Using this dataset, we evaluate the generalization capabilities of a number of stan-

dard models, including ResNet-50, ViT, MLP-Mixer, SimCLR, CLIP, VICReg,

and MAE ﬁnding that all state-of-the-art models perform relatively poorly in this

setting (Section 3.2).

We incorporate our proposed Lie operator in two recent SSL approaches: masked

autoencoders (MAE, (He et al.,2021)), and Variance-Invariance-Covariance

Lie Operator

Evaluate Robustness

Natural variation Structure variation as

Transformations

Generalize variation

across instances

Self-Supervised Lie Operator

Known instance

in new pose

Unknown instance

in typical pose

Unknown instance

MAE Lie

+12.4%

]

]+10.3%

]+13.7%

g=exp(∑

tdLd)

infer

transformation

magnitude

learned Lie algebra basis

Known instances

typical pose

encourages

z) = zg

SSL encoder

only seen in

in representation space

atypical pose

SSL Loss Lie Loss

Loss =

Training Data

| ̂

g(z)−zg

L2 norm for t

^+|

| 2

Figure 1:

Summary of approach and gains.

We generate a novel dataset containing

rendered images of objects in typical and atypical poses, with some instances only seen

in typical, but not atypical poses (left). Using these data, we augment SSL models such

as MAE with a learned Lie operator which approximates the transformations in the

latent space induced by changes in pose (middle). Using this operator, we improve

performance by >10% for MAE for both known instances in new poses and unknown

instances in both typical and atypical poses (right).

Regularization (VICReg (Bardes et al.,2021)) to directly model transformations

in data (Section 2), resulting in substantial OOD performance gains of greater

than 10% for MAE and of up to 8% for VICReg (Section 4.1). We also incorporate

our Lie model in SimCLR (Chen et al.,2020) (Appendix E).

We run systemic ablations of each term of our learning objective in the MAE Lie

model, showing the relevance of every component for best performance (Section

4.3).

We challenge our learned Lie operator by running evaluation on Imagenet as

well as a rotated version of ImageNet, improving MAE performance by 3.94%

and 2.62%, respectively. We also improve over standard MAE with MAE Lie

when evaluated with ﬁnetuning on the iLab (Borji et al.,2016) dataset. These

experiments show the applicability of our proposed Lie operator to realistic,

challenging datasets (Section 5).

2 Methods

Humans recognize even new objects in all sorts of poses by observing continuous

changes in the world. To mimic humans’ ability to generalize variation, we propose a

method for learning continuous transformations from data. We assume we have access

to pairs of data, e.g., images,

x,x0

where

is a transformed version of

corresponding

to the same instance, e.g., a speciﬁc object, undergoing variations. This is a reasonable

form of supervision which one can easily acquire from video data for example, where

pairs of frames would correspond to pairs of transformed data, and has been employed

in previous works such as Ick & Lostanlen (2020); Locatello et al. (2020); Connor et al.

(2021b); Bouchacourt et al. (2021a).

Despite widespread use of such pairs in data augmentation, models still struggle to

generalize across distribution shifts (Alcorn et al.,2019;Geirhos et al.,2018;Engstrom

et al.,2019). To solve this brittleness, we take inspiration from previous work illustrating

the possibility of representing variation via latent operators (Connor et al.,2021b;

Bouchacourt et al.,2021a;Giannone et al.,2019). Acting in latent space avoids costly

modiﬁcations of the full input dimensions and allows us to express complex operators

using simpler latent operators. We learn to represent data variation explicitly as group

actions on the model’s representation space

. We use operators theoretically motivated

by group theory, as groups enforce properties such as composition and inverses, allowing

us to structure representations beyond that of an unconstrained network (such as an multi-

layer perceptron, MLP) (see Appendix A). Fortunately, many common transformations

can be described as groups.

We choose matrix Lie groups to model continuous transformations, as they can be

represented as matrices and described by real parameters. However, they do not possess

the basic properties of a vector space e.g., adding two rotation matrices does not produce

a rotation matrix, without which back-propagation is not possible. For each matrix Lie

group

however, there is a corresponding Lie algebra

which is the tangent space of

at the identity. A Lie algebra forms vector space where addition is closed and where

the space can be described using basis matrices. Under mild assumptions over

, any

element of

can be obtained by taking the exponential matrix map:

exp(tM)

where

is a matrix in

, and

is a real number. Beneﬁting from the vector space structure

of Lie algebras, we aim to learn a basis of the Lie algebra in order to represent the

group acting on the representation space, using the

exp

map. We refer the reader to

Appendix Aand Hall (2003); Stillwell (2008) for a more detailed presentation of matrix

Lie groups and their algebra.

2.1 Learning transformations from one sample to another via Lie

groups

We assume that there exists a ground-truth group element,

, which describes the trans-

formation in latent space between the embedding of

and

. Our goal is to approximate

using its Lie algebra. More formally, given a pair of samples (e.g., a pair of video

frames),

and

, and an encoder,

f(·)

, we compute the latent representations

and

z=f(x)

and

zg=f(x0)

. We therefore assume

zg=f(x0) = g(f(x))

. We aim to

learn an operator ˆgmaximizing the similarity between zˆg= ˆg(f(x)) and zg.

We construct our operator by using the vector space structure of the group’s Lie algebra

: we learn a basis of

as a set of matrices

Lk∈Rm×m, k ∈ {1, . . . , d}

where

is the

dimension of the Lie algebra and

is the dimension of the embedding space

. We use

this basis as a way to represent the group elements, and thus to model the variations that

data samples undergo. A matrix

in the Lie algebra

thus writes

M=Pd

k=1 tkLk

for some coordinate vector t= (t1, . . . , td)∈Rd.

Let us consider a pair

z,zg

with

zg=g(z)

. We assume the exponential map is

surjective2, thus, there exists Pd

k=1 tkLk∈gsuch that

g= exp(

k=1

tkLk) = exp(t|L)(1)

where

t= (t1, . . . , td)

are the coordinates in the Lie algebra of the group element

and Lis the 3D matrix in Rd×m×mconcatenating {Lk, k ∈ {1, . . . , d}}.

Our goal is to infer

ˆg

such that

ˆg(z)

(the application of the inferred

ˆg

to the repre-

sentation

) equals

. We denote

zˆg= ˆg(z)

. Speciﬁcally, our transformation inference

procedure consists of Steps 1-4 in Algorithm 1. For step 1 in Algorithm 1, similar to

Connor et al. (2021b); Bouchacourt et al. (2021a), we use a pair of latent codes to infer

the latent transformation. Additionally, we consider that we can access a scalar value,

denoted

δx,x0

, measuring “how much” the two samples differ. For example, considering

frames from a given video, this would be the number of frames that separates the

two original video frames

and

, and hence comes for free

.The coordinate vector

corresponding to the pair (z,zg)is then inferred using a a multi-layer perceptron h4:

t=h(z,zg, δx,x0)(2)

When

δx,x0

is small, we assume the corresponding images and embedding differ by a

transformation close to the identity which corresponds to an inﬁnitesimal

. To illustrate

this, we use a squared

-norm constraint on

. For smaller

δx,x0

(e.g., close-by frames

in a video), larger similarity between

z,zg

is expected, in which case we want to

encourage a smaller norm of ˆ

s(z,zg, δx,x0)||ˆ

t||2(3)

where

s:Z × Z × R→[0,1]

measures the similarity between

and

. In our

experiments, we simply use s(z,zg, δx,x0)as a function of δx,x0and compute

s(δx,x0) = 1

1 + exp(|δx,x0|).(4)

When

|δx,x0|

decreases,

increases which strengthens the constraint in Eq. 3: the

inferred vector of coordinates ˆ

tis therefore enforced to have small norm.

2This is the case if Gis compact and connected (Hall,2003, Corollary 11.10).

3Assuming that nearby video frames will be restricted to small transformations.

During the learning of the parameters of

, we detach the gradients of the latent code

with respect to

the parameters of the encoder f.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustSelf-SupervisedLearningwithLieGroupsMarkIbrahim*,DianeBouchacourt*,AriMorcosFundamentalAIResearch(FAIR),MetaAIAbstractDeeplearninghasledtoremarkableadvancesincomputervision.Evenso,to-day'sbestmodelsarebrittlewhenpresentedwithvariationsthatdifferevenslightlyfromthoseseenduringtraining.Minorshif...

展开>> 收起<<

Robust Self-Supervised Learning with Lie Groups Mark Ibrahim Diane Bouchacourt Ari Morcos.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Self-Supervised Learning with Lie Groups Mark Ibrahim Diane Bouchacourt Ari Morcos

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: