
Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap
MLP-Mixer (Tolstikhin et al., 2021) adapts the idea of vision transformers to map an image to a sequence of
patches. This sequence is then processed by alternating plain multi-layer perceptrons (MLP) over the feature and
the sequence dimension, i.e., mixing features and mixing spatial information.
gMLP (Liu et al., 2021a) is another MLP-only vision architecture that differs from the MLP-Mixer by introducing
multiplicative spatial gating units between the alternating spatial and feature MLPs. Empirical results (Liu et al.,
2021a) show that the gMLP has a better accuracy-parameter ratio than the MLP-Mixer.
FNet (Lee-Thorp et al., 2021) replaces the learnable spatial mixing MLP of the MLP-Mixer architecture by a
fixed mixing step. In particular, a parameter-free 2-dimensional Fourier transform is applied over the sequence
and features dimensions of the input. Although the authors (Lee-Thorp et al., 2021) did not evaluate the model
for vision tasks, FNet’s similarity to patch-based MLP architectures makes it a natural candidate for vision tasks.
ConvMixer (Trockman and Kolter, 2022) replace the MLPs of the MLP-mixer architecture by alternating depth-
wise and point-wise 1D convolutions. While an MLP mixes all entries of the spatial and feature dimension, the
convolutions of the ConvMixer mix only local information, e.g., kernel size was set to 9 in (Trockman and Kolter,
2022). The authors claim a large part of the performance of MLP and vision transformers can be attributed to
the patch-based processing instead of the type of mixing representation (Trockman and Kolter, 2022).
Advanced convolutional architectures. Here, we briefly discuss modern variants of CNN architectures.
ResNet (He et al., 2016) add skip connections that bypass the convolutional layers. This simple modification
allows training much deeper networks than a pure sequential composition of layers. Consequently, skips
connections can be found in any modern neural network architecture, including patch-based and advanced
convolutional models.
MobileNetV2 (Sandler et al., 2018) replace the standard convolution operations by depth-wise separable con-
volutions that process the spatial and channel dimension separately. The resulting network requires fewer
floating-point operations to compute, which is beneficial for mobile and embedded applications.
EfficientNet (Tan and Le, 2019) is an efficient convolutional neural network architecture derived from an
automated neural architecture search. The objective of the search is to find a network topology that achieves
high performance while simultaneously running efficiently on CPU devices.
EfficientNet-v2 fixes the issue of EfficientNets that despite their efficiency on CPU inference, they can be slower
than existing architecture types on GPUs at training and inference.
RegNet (Radosavovic et al., 2020) is a neural network family that systematically explores the design space of
previously proposed advances in neural network design. The RegNet-Y subfamily specifically scales the width
of the network linearly with depth and comprises squeeze-and-excitation blocks.
ConvNext (Liu et al., 2022) is a network that subsumes many recent advances in the design of vision architectures,
including better activation functions, replacing batch-norm by layer-normalization, and a larger kernel size into
standard ResNets.
Imitation learning (IL).
IL describes learning an agent by expert demonstrations consist of observation-action
pairs (Schaal, 1999), directly via behavior cloning (Ho and Ermon, 2016), or indirectly via inverse reinforcement
learning (Ng and Russell, 2000). When IL agents are deployed online, they most often deviate from the expert
demonstrations leading to compounding errors and incorrect inference. Numerous works have tried to address
this problem by adding augmentation techniques that collect data from the cloned model in closed-loop settings.
This includes methods such as DAgger (Ross and Bagnell, 2010; Ross et al., 2011), state-aware imitation (Desai
et al., 2020; Le et al., 2018; Schroecker and Isbell, 2017), pre-trained policies through meta-learning (Duan et al.,
2017; Yu et al., 2018), min-max optimization schemes (Baram et al., 2017; Ho and Ermon, 2016; Sun and Ma,
2019; Wu et al., 2019), using insights from causal inference (Janner et al., 2021; Ortega et al., 2021), and using a
4