Robust Self-Supervised Learning with Lie Groups Mark Ibrahim Diane Bouchacourt Ari Morcos

2025-05-03 0 0 3.97MB 26 页 10玖币
侵权投诉
Robust Self-Supervised Learning with Lie
Groups
Mark Ibrahim*
, Diane Bouchacourt*
, Ari Morcos
Fundamental AI Research (FAIR), Meta AI
Abstract
Deep learning has led to remarkable advances in computer vision. Even so, to-
day’s best models are brittle when presented with variations that differ even slightly
from those seen during training. Minor shifts in the pose, color, or illumination of an
object can lead to catastrophic misclassifications. State-of-the art models struggle
to understand how a set of variations can affect different objects. We propose a
framework for instilling a notion of how objects vary in more realistic settings. Our
approach applies the formalism of Lie groups to capture continuous transformations
to improve models’ robustness to distributional shifts. We apply our framework on
top of state-of-the-art self-supervised learning (SSL) models, finding that explicitly
modeling transformations with Lie groups leads to substantial performance gains
of greater than 10% for MAE on both known instances seen in typical poses now
presented in new poses, and on unknown instances in any pose. We also apply
our approach to ImageNet, finding that the Lie operator improves performance by
almost 4%. These results demonstrate the promise of learning transformations to
improve model robustness1.
1 Introduction
State-of-the-art models have proven adept at modeling a number of complex tasks, but
they struggle when presented with inputs different from those seen during training. For
example, while classification models are very good at recognizing buses in the upright
position, they fail catastrophically when presented with an upside-down bus since such
images are generally not included in standard training sets (Alcorn et al.,2019). This
can be problematic for deployed systems as models are required to generalize to settings
not seen during training (“out-of-distribution (OOD) generalization”). One potential
explanation for this failure of OOD generalization is that models exploit any and all
correlations between inputs and targets. Consequently, models rely on heuristics that
while effective during training, may fail to generalize, leading to a form of “supervision
collapse” (Jo & Bengio,2017;Ilyas et al.,2019;Doersch et al.,2020;Geirhos et al.,
*equal contribution
1Code to reproduce all experiments will be made available.
1
arXiv:2210.13356v1 [cs.CV] 24 Oct 2022
2020a). However, a number of models trained without supervision (self-supervised)
have recently been proposed, many of which exhibit improved, but still limited OOD
robustness (Chen et al.,2020;Hendrycks et al.,2019;Geirhos et al.,2020b).
The most common approach to this problem is to reduce the distribution shift by
augmenting training data. Augmentations are also key for a number of contrastive
self-supervised approaches, such as SimCLR (Chen et al.,2020). While this approach
can be effective, it has a number of disadvantages. First, for image data, augmentations
can only be applied to the pixels themselves, making it easy to, for example, rotate the
entire image, but very difficult to rotate a single object within the image. Since many of
the variations seen in real data cannot be approximated by pixel-level augmentations,
this can be quite limiting in practice. Second, similar to adversarial training (Madry
et al.,2017;Kurakin et al.,2016), while augmentation can improve performance on
known objects, it often fails to generalize to novel objects (Alcorn et al.,2019). Third,
augmenting to enable generalization for one form of variation can often harm the
performance on other forms of variation (Geirhos et al.,2018;Engstrom et al.,2019),
and is not guaranteed to provide the expected invariance to variations (Bouchacourt
et al.,2021b). Finally, enforcing invariance is not guaranteed to provide the correct
robustness that generalizes to new instances (as discussed in Section 2).
For these reasons, we choose to explicitly model the transformations of the data
as transformations in the latent representation rather than trying to be invariant to it.
To do so, we use the formalism of Lie groups. Informally, Lie groups are continuous
groups described by a set of real parameters (Hall,2003). While many continuous
transformations form matrix Lie groups (e.g., rotations), they lack the typical structure
of a vector space. However, Lie group have a corresponding vector space, their Lie
algebra, that can be described using basis matrices, allowing to describe the infinite
number elements of the group by a finite number of basis matrices. Our goal will be to
learn such matrices to directly model the data variations.
To summarize, our approach structures the representation space to enable self-
supervised models to generalize variation across objects. Since many naturally occurring
transformations (e.g., pose, color, size, etc.) are continuous, we develop a theoretically-
motivated operator, the Lie operator, that acts in representation space (see Fig. 1).
Specifically, the Lie operator learns the continuous transformations observed in data as
a vector space, using a set of basis matrices. With this approach, we make the following
contributions:
1.
We generate a novel dataset containing 3D objects in many different poses,
allowing us to explicitly evaluate the ability of models to generalize to both
known objects in unknown poses and to unknown objects in both known and
unknown poses (Section 3).
2.
Using this dataset, we evaluate the generalization capabilities of a number of stan-
dard models, including ResNet-50, ViT, MLP-Mixer, SimCLR, CLIP, VICReg,
and MAE finding that all state-of-the-art models perform relatively poorly in this
setting (Section 3.2).
3.
We incorporate our proposed Lie operator in two recent SSL approaches: masked
autoencoders (MAE, (He et al.,2021)), and Variance-Invariance-Covariance
2
Lie Operator
Evaluate Robustness
Natural variation Structure variation as
Transformations
Generalize variation
across instances
Self-Supervised Lie Operator
Known instance
in new pose
Unknown instance
in typical pose
Unknown instance
MAE Lie
+12.4%
]
]+10.3%
]+13.7%
̂
g(
̂
g=exp(
d
̂
tdLd)
infer
transformation
magnitude
̂
t
learned Lie algebra basis
z
zg
Known instances
typical pose
typical pose
encourages
z) = zg
SSL encoder
only seen in
in representation space
atypical pose
SSL Loss Lie Loss
+
Loss =
Training Data
| ̂
g(z)zg
L2 norm for t
^+|
+|
| 2
2
Figure 1:
Summary of approach and gains.
We generate a novel dataset containing
rendered images of objects in typical and atypical poses, with some instances only seen
in typical, but not atypical poses (left). Using these data, we augment SSL models such
as MAE with a learned Lie operator which approximates the transformations in the
latent space induced by changes in pose (middle). Using this operator, we improve
performance by >10% for MAE for both known instances in new poses and unknown
instances in both typical and atypical poses (right).
Regularization (VICReg (Bardes et al.,2021)) to directly model transformations
in data (Section 2), resulting in substantial OOD performance gains of greater
than 10% for MAE and of up to 8% for VICReg (Section 4.1). We also incorporate
our Lie model in SimCLR (Chen et al.,2020) (Appendix E).
4.
We run systemic ablations of each term of our learning objective in the MAE Lie
model, showing the relevance of every component for best performance (Section
4.3).
5.
We challenge our learned Lie operator by running evaluation on Imagenet as
well as a rotated version of ImageNet, improving MAE performance by 3.94%
and 2.62%, respectively. We also improve over standard MAE with MAE Lie
when evaluated with finetuning on the iLab (Borji et al.,2016) dataset. These
experiments show the applicability of our proposed Lie operator to realistic,
challenging datasets (Section 5).
2 Methods
Humans recognize even new objects in all sorts of poses by observing continuous
changes in the world. To mimic humans’ ability to generalize variation, we propose a
method for learning continuous transformations from data. We assume we have access
3
to pairs of data, e.g., images,
x,x0
where
x0
is a transformed version of
x
corresponding
to the same instance, e.g., a specific object, undergoing variations. This is a reasonable
form of supervision which one can easily acquire from video data for example, where
pairs of frames would correspond to pairs of transformed data, and has been employed
in previous works such as Ick & Lostanlen (2020); Locatello et al. (2020); Connor et al.
(2021b); Bouchacourt et al. (2021a).
Despite widespread use of such pairs in data augmentation, models still struggle to
generalize across distribution shifts (Alcorn et al.,2019;Geirhos et al.,2018;Engstrom
et al.,2019). To solve this brittleness, we take inspiration from previous work illustrating
the possibility of representing variation via latent operators (Connor et al.,2021b;
Bouchacourt et al.,2021a;Giannone et al.,2019). Acting in latent space avoids costly
modifications of the full input dimensions and allows us to express complex operators
using simpler latent operators. We learn to represent data variation explicitly as group
actions on the model’s representation space
Z
. We use operators theoretically motivated
by group theory, as groups enforce properties such as composition and inverses, allowing
us to structure representations beyond that of an unconstrained network (such as an multi-
layer perceptron, MLP) (see Appendix A). Fortunately, many common transformations
can be described as groups.
We choose matrix Lie groups to model continuous transformations, as they can be
represented as matrices and described by real parameters. However, they do not possess
the basic properties of a vector space e.g., adding two rotation matrices does not produce
a rotation matrix, without which back-propagation is not possible. For each matrix Lie
group
G
however, there is a corresponding Lie algebra
g
which is the tangent space of
G
at the identity. A Lie algebra forms vector space where addition is closed and where
the space can be described using basis matrices. Under mild assumptions over
G
, any
element of
G
can be obtained by taking the exponential matrix map:
exp(tM)
where
M
is a matrix in
g
, and
t
is a real number. Benefiting from the vector space structure
of Lie algebras, we aim to learn a basis of the Lie algebra in order to represent the
group acting on the representation space, using the
exp
map. We refer the reader to
Appendix Aand Hall (2003); Stillwell (2008) for a more detailed presentation of matrix
Lie groups and their algebra.
2.1 Learning transformations from one sample to another via Lie
groups
We assume that there exists a ground-truth group element,
g
, which describes the trans-
formation in latent space between the embedding of
x
and
x0
. Our goal is to approximate
g
using its Lie algebra. More formally, given a pair of samples (e.g., a pair of video
frames),
x
and
x0
, and an encoder,
f(·)
, we compute the latent representations
z
and
zg
as
z=f(x)
and
zg=f(x0)
. We therefore assume
zg=f(x0) = g(f(x))
. We aim to
learn an operator ˆgmaximizing the similarity between zˆg= ˆg(f(x)) and zg.
We construct our operator by using the vector space structure of the group’s Lie algebra
g
: we learn a basis of
g
as a set of matrices
LkRm×m, k ∈ {1, . . . , d}
where
d
is the
dimension of the Lie algebra and
m
is the dimension of the embedding space
Z
. We use
4
this basis as a way to represent the group elements, and thus to model the variations that
data samples undergo. A matrix
M
in the Lie algebra
g
thus writes
M=Pd
k=1 tkLk
for some coordinate vector t= (t1, . . . , td)Rd.
Let us consider a pair
z,zg
with
zg=g(z)
. We assume the exponential map is
surjective2, thus, there exists Pd
k=1 tkLkgsuch that
g= exp(
d
X
k=1
tkLk) = exp(t|L)(1)
where
t= (t1, . . . , td)
are the coordinates in the Lie algebra of the group element
g
,
and Lis the 3D matrix in Rd×m×mconcatenating {Lk, k ∈ {1, . . . , d}}.
Our goal is to infer
ˆg
such that
ˆg(z)
(the application of the inferred
ˆg
to the repre-
sentation
z
) equals
zg
. We denote
zˆg= ˆg(z)
. Specifically, our transformation inference
procedure consists of Steps 1-4 in Algorithm 1. For step 1 in Algorithm 1, similar to
Connor et al. (2021b); Bouchacourt et al. (2021a), we use a pair of latent codes to infer
the latent transformation. Additionally, we consider that we can access a scalar value,
denoted
δx,x0
, measuring “how much” the two samples differ. For example, considering
frames from a given video, this would be the number of frames that separates the
two original video frames
x
and
x0
, and hence comes for free
3
.The coordinate vector
corresponding to the pair (z,zg)is then inferred using a a multi-layer perceptron h4:
ˆ
t=h(z,zg, δx,x0)(2)
When
δx,x0
is small, we assume the corresponding images and embedding differ by a
transformation close to the identity which corresponds to an infinitesimal
ˆ
t
. To illustrate
this, we use a squared
L2
-norm constraint on
ˆ
t
. For smaller
δx,x0
(e.g., close-by frames
in a video), larger similarity between
z,zg
is expected, in which case we want to
encourage a smaller norm of ˆ
t:
s(z,zg, δx,x0)||ˆ
t||2(3)
where
s:Z × Z × R[0,1]
measures the similarity between
z
and
zg
. In our
experiments, we simply use s(z,zg, δx,x0)as a function of δx,x0and compute
s(δx,x0) = 1
1 + exp(|δx,x0|).(4)
When
|δx,x0|
decreases,
s
increases which strengthens the constraint in Eq. 3: the
inferred vector of coordinates ˆ
tis therefore enforced to have small norm.
2This is the case if Gis compact and connected (Hall,2003, Corollary 11.10).
3Assuming that nearby video frames will be restricted to small transformations.
4
During the learning of the parameters of
h
, we detach the gradients of the latent code
z
with respect to
the parameters of the encoder f.
5
摘要:

RobustSelf-SupervisedLearningwithLieGroupsMarkIbrahim*,DianeBouchacourt*,AriMorcosFundamentalAIResearch(FAIR),MetaAIAbstractDeeplearninghasledtoremarkableadvancesincomputervision.Evenso,to-day'sbestmodelsarebrittlewhenpresentedwithvariationsthatdifferevenslightlyfromthoseseenduringtraining.Minorshif...

展开>> 收起<<
Robust Self-Supervised Learning with Lie Groups Mark Ibrahim Diane Bouchacourt Ari Morcos.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:3.97MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注