2020a). However, a number of models trained without supervision (self-supervised)
have recently been proposed, many of which exhibit improved, but still limited OOD
robustness (Chen et al.,2020;Hendrycks et al.,2019;Geirhos et al.,2020b).
The most common approach to this problem is to reduce the distribution shift by
augmenting training data. Augmentations are also key for a number of contrastive
self-supervised approaches, such as SimCLR (Chen et al.,2020). While this approach
can be effective, it has a number of disadvantages. First, for image data, augmentations
can only be applied to the pixels themselves, making it easy to, for example, rotate the
entire image, but very difficult to rotate a single object within the image. Since many of
the variations seen in real data cannot be approximated by pixel-level augmentations,
this can be quite limiting in practice. Second, similar to adversarial training (Madry
et al.,2017;Kurakin et al.,2016), while augmentation can improve performance on
known objects, it often fails to generalize to novel objects (Alcorn et al.,2019). Third,
augmenting to enable generalization for one form of variation can often harm the
performance on other forms of variation (Geirhos et al.,2018;Engstrom et al.,2019),
and is not guaranteed to provide the expected invariance to variations (Bouchacourt
et al.,2021b). Finally, enforcing invariance is not guaranteed to provide the correct
robustness that generalizes to new instances (as discussed in Section 2).
For these reasons, we choose to explicitly model the transformations of the data
as transformations in the latent representation rather than trying to be invariant to it.
To do so, we use the formalism of Lie groups. Informally, Lie groups are continuous
groups described by a set of real parameters (Hall,2003). While many continuous
transformations form matrix Lie groups (e.g., rotations), they lack the typical structure
of a vector space. However, Lie group have a corresponding vector space, their Lie
algebra, that can be described using basis matrices, allowing to describe the infinite
number elements of the group by a finite number of basis matrices. Our goal will be to
learn such matrices to directly model the data variations.
To summarize, our approach structures the representation space to enable self-
supervised models to generalize variation across objects. Since many naturally occurring
transformations (e.g., pose, color, size, etc.) are continuous, we develop a theoretically-
motivated operator, the Lie operator, that acts in representation space (see Fig. 1).
Specifically, the Lie operator learns the continuous transformations observed in data as
a vector space, using a set of basis matrices. With this approach, we make the following
contributions:
1.
We generate a novel dataset containing 3D objects in many different poses,
allowing us to explicitly evaluate the ability of models to generalize to both
known objects in unknown poses and to unknown objects in both known and
unknown poses (Section 3).
2.
Using this dataset, we evaluate the generalization capabilities of a number of stan-
dard models, including ResNet-50, ViT, MLP-Mixer, SimCLR, CLIP, VICReg,
and MAE finding that all state-of-the-art models perform relatively poorly in this
setting (Section 3.2).
3.
We incorporate our proposed Lie operator in two recent SSL approaches: masked
autoencoders (MAE, (He et al.,2021)), and Variance-Invariance-Covariance
2