DRL Are All Vision Models Created Equal A Study of the Open-Loop to Closed-Loop Causality Gap

2025-05-03 0 0 3.8MB 17 页 10玖币
侵权投诉
DRL
Are All Vision Models Created Equal? A Study of the Open-Loop
to Closed-Loop Causality Gap
Mathias Lechner *1, Ramin Hasani 1, Alexander Amini 1, Tsun-Hsuan Wang 1,
Thomas Henzinger 2, and Daniela Rus 1
1Massachusetts Institute of Technology (MIT)
2Institute of Science and Technology Austria (IST Austria)
There is an ever-growing zoo of modern neural network models that can efficiently learn end-to-end control
from visual observations. These advanced deep models, ranging from convolutional to patch-based networks,
have been extensively tested on offline image classification and regression tasks. In this paper, we study these
vision architectures with respect to the open-loop to closed-loop causality gap, i.e., offline training followed
by an online closed-loop deployment. This causality gap typically emerges in robotics applications such as
autonomous driving, where a network is trained to imitate the control commands of a human. In this setting, two
situations arise: 1) Closed-loop testing in-distribution, where the test environment shares properties with those of
offline training data. 2) Closed-loop testing under distribution shifts and out-of-distribution. Contrary to recently
reported results, we show that under proper training guidelines, all vision models perform indistinguishably
well on in-distribution deployment, resolving the causality gap. In situation 2, We observe that the causality gap
disrupts performance regardless of the choice of the model architecture. Our results imply that the causality
gap can be solved in situation one with our proposed training guideline with any modern network architecture,
whereas achieving out-of-distribution generalization (situation two) requires further investigations, for instance,
on data diversity rather than the model architecture.
1. Introduction
Video demonstration of vision models in self driv-
ing. https://youtu.be/0GxKzv5Ej88
The deployment of end-to-end learning systems in robotics appli-
cations is increasing due to their ability to efficiently and automat-
ically learn representations from high-dimensional observations
such as visual inputs without the need for hand-crafting features,
bypassing perception to control.
In this space, a tremendous number of advanced deep learning
models have been proposed to perform competitively in end-to-
end perception-to-control tasks. For example, patch-based vision
architectures such as Vision Transformer (ViT) (Dosovitskiy et al.,
2020) have shown to be competitive with models based on convo-
lutional neural networks (CNNs) (Fukushima and Miyake, 1982;
Lechner et al., 2022; LeCun et al., 1989) in computer vision applications for which CNNs were the predominant
choice. A recent line of research, namely the MLPMixer (Tolstikhin et al., 2021), and ConvMixer (Trockman and
Kolter, 2022) suggested that the great generalization performance of ViT might be rooted in the patch structure
of the inputs rather than the choice of the architecture. There are also works suggesting that self-attention is
not crucial in vision Transformers and simply a gating projection in multi-layer perceptrons (MLPs) (Liu et al.,
*Correspondence E-mail: mlechner@mit.edu
arXiv:2210.04303v1 [cs.CV] 9 Oct 2022
Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap
0.25 0.30 0.35 0.40 0.45
Offline validation loss (lower is better)
10%
12%
14%
16%
18%
20%
22%
24%
Online crash likelihood (lower is better)
CNN baseline
MobileNetV2
ResNet18
ResNet34
EfficientNet
EfficientNet-v2
RegNet-y004
RegNet-y016
ConvNext
ConvMixer-Tiny
ConvMixer-S
ViT-S
ViT-Tiny
Swin-S
Swin-Tiny
MLP-Mixer-S
MLP-Tiny-S
gMLP-Tiny
gMLP-S
FNet-S
FNet-Tiny
Fig. 1
: Online deployment vs. offline training causality gap in perspective. Marker size is linearly proportional to the
number of trainable parameters.
2021a) or replacing self-attention sublayer with an unparameterized Fourier Transform (Lee-Thorp et al., 2021)
can outperform ViT.
These proposals are largely tested in offline settings where the output decisions of the network do not change the
next incoming inputs. In other words, patch-based and mixer models trained offline have not yet been evaluated
in a closed-loop with an environment where network actions affect next observations, such as in imitation
learning tasks. IL agents typically suffer from a causality gap arising from the transfer of models from open-loop
training to closed-loop testing. In this paper, we focus on investigating this gap in a systematic way.
In closed-loop testing, we need to be cognizant of two modes: 1) Closed-loop testing in-distribution. In this
setting, we test networks in environments that share similar properties to that of the training environment. 2)
Closed-loop testing under distribution shifts and out-of-distribution.
Testing models under both settings requires us to ensure fairness and proper evaluation of the effectiveness of
different model architectures in learning robust perception-to-control instances. To this end, we must validate
that all baseline models are trained to their best capability given the same amount of hyperparameter optimization
budget under a controlled training pipeline.
In an attempt to make a unified evaluation of advanced deep learning models in robot imitation learning tasks, in
this paper, we set out to design a series of imitation learning pipelines to train models in a controlled and fair
setting and test their generalization capability in and out of distribution.
In particular, we design an end-to-end autonomous driving (AD) IL pipeline based on a photorealistic AD
simulation platform called VISTA (Amini et al., 2020), which can test agents in a closed-loop AD environment
synthesizing novel views to assess their closed-loop generalization capabilities (Xiao et al., 2021, 2022).
Counterintuitively and in contrast to the recently reported results (Bai et al., 2021; Naseer et al., 2021; Paul and
Chen, 2022), we show that no new architecture is needed to bridge the causality gap between offline training
and online testing in-distribution, as our controlled training pipeline enables all models to perform remarkably
well on the given tasks. Moreover, for achieving out-of-distribution generalization, we observe that the causality
gap certainly affects the performance of models, again, almost regardless of the choice of their architecture.
These findings suggest the rethinking of the emphasis on the choice of popular models such as Transformers
2
Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap
(in-distribution) (in-distribution) (out-of-distribution) (out-of-distribution) (out-of-distribution)
Summer Winter Fall Spring Night
Fig. 2
: Visualization of sample observations used in our end-to-end AD experiment, spanning across various seasons and
times of the day.
over CNNs, as other factors such as proper training setup, augmentation strategies, and data diversity play a
more important role in generalization in and out of distribution.
To validate our results and make sure our conclusions are not specific to the AD domain, we further extended
our experiments (offline training, then online testing) to more standard visual behavior cloning benchmarks such
as perception-to-control arcade learning environment (ALE) (Bellemare et al., 2013) tasks. Similar conclusions
are then drawn in this case.
Our contributions are summarized below:
1.
The design of a unified end-to-end training infrastructure for a fair comparison of advanced deep learning
architectures in robot imitation learning applications
2.
Studying patch-based architectures and advanced CNNs as agents in closed-loop with their environments
via offline training (imitation) and online testing.
3.
Discovering new insights about where patch-based architectures generalize better than CNNs and vice
versa in test-cases in and out of distribution
2. Background and Related Works
In this section, we first discuss the image processing architectures studied in this work. Moreover, we recapitulate
related works on the understanding of how patch-based CV models process information differently than
convolutional architectures. Finally, we discuss existing works on bridging the gap between offline training -
online generalization.
Patch-based vision architectures.
Motivated by the success of Transformers (Vaswani et al., 2017) on natural
language processing (NLP) datasets, (Dosovitskiy et al., 2020) introduced the Vision Transformer (ViT) by
adapting the architecture for computer vision tasks. As transformers operate on a 1-dimensional sequence of
vectors, (Dosovitskiy et al., 2020) proposed to convert an image into a sequence by tiling it into patches. Each
patch is then flattened into a vector by concatenating all pixel values. Researchers have analyzed the difference
between how CNNs and ViTs process images (Raghu et al., 2021). Moreover, it has been claimed that vision
transforms are much more robust to image perturbations and occlusions (Naseer et al., 2021), as well as be able
to handle distribution-shifts (Bai et al., 2021; Lechner et al., 2021) better than CNNs. However, more recent
works have refuted the robustness claims of vision transformers (Fu et al., 2021) by showing that ViTs can be
less robust than convolutional networks when considering carefully crafted adversarial attacks.
Swin Transformer (Liu et al., 2021b) modifies the vision transformer by adding a hierarchical structure to the
feature sequence of patches. The Swin Transformer applies its attention mechanism not to the full sequence but
to a window that is shifted over the entire sequence. By increasing network depth, neighboring windows are
merged and pooled into large, less fine-grained windows. This hierarchical processing allows it to use smaller
patches without exploding the compute and memory footprint of the model.
3
Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap
MLP-Mixer (Tolstikhin et al., 2021) adapts the idea of vision transformers to map an image to a sequence of
patches. This sequence is then processed by alternating plain multi-layer perceptrons (MLP) over the feature and
the sequence dimension, i.e., mixing features and mixing spatial information.
gMLP (Liu et al., 2021a) is another MLP-only vision architecture that differs from the MLP-Mixer by introducing
multiplicative spatial gating units between the alternating spatial and feature MLPs. Empirical results (Liu et al.,
2021a) show that the gMLP has a better accuracy-parameter ratio than the MLP-Mixer.
FNet (Lee-Thorp et al., 2021) replaces the learnable spatial mixing MLP of the MLP-Mixer architecture by a
fixed mixing step. In particular, a parameter-free 2-dimensional Fourier transform is applied over the sequence
and features dimensions of the input. Although the authors (Lee-Thorp et al., 2021) did not evaluate the model
for vision tasks, FNet’s similarity to patch-based MLP architectures makes it a natural candidate for vision tasks.
ConvMixer (Trockman and Kolter, 2022) replace the MLPs of the MLP-mixer architecture by alternating depth-
wise and point-wise 1D convolutions. While an MLP mixes all entries of the spatial and feature dimension, the
convolutions of the ConvMixer mix only local information, e.g., kernel size was set to 9 in (Trockman and Kolter,
2022). The authors claim a large part of the performance of MLP and vision transformers can be attributed to
the patch-based processing instead of the type of mixing representation (Trockman and Kolter, 2022).
Advanced convolutional architectures. Here, we briefly discuss modern variants of CNN architectures.
ResNet (He et al., 2016) add skip connections that bypass the convolutional layers. This simple modification
allows training much deeper networks than a pure sequential composition of layers. Consequently, skips
connections can be found in any modern neural network architecture, including patch-based and advanced
convolutional models.
MobileNetV2 (Sandler et al., 2018) replace the standard convolution operations by depth-wise separable con-
volutions that process the spatial and channel dimension separately. The resulting network requires fewer
floating-point operations to compute, which is beneficial for mobile and embedded applications.
EfficientNet (Tan and Le, 2019) is an efficient convolutional neural network architecture derived from an
automated neural architecture search. The objective of the search is to find a network topology that achieves
high performance while simultaneously running efficiently on CPU devices.
EfficientNet-v2 fixes the issue of EfficientNets that despite their efficiency on CPU inference, they can be slower
than existing architecture types on GPUs at training and inference.
RegNet (Radosavovic et al., 2020) is a neural network family that systematically explores the design space of
previously proposed advances in neural network design. The RegNet-Y subfamily specifically scales the width
of the network linearly with depth and comprises squeeze-and-excitation blocks.
ConvNext (Liu et al., 2022) is a network that subsumes many recent advances in the design of vision architectures,
including better activation functions, replacing batch-norm by layer-normalization, and a larger kernel size into
standard ResNets.
Imitation learning (IL).
IL describes learning an agent by expert demonstrations consist of observation-action
pairs (Schaal, 1999), directly via behavior cloning (Ho and Ermon, 2016), or indirectly via inverse reinforcement
learning (Ng and Russell, 2000). When IL agents are deployed online, they most often deviate from the expert
demonstrations leading to compounding errors and incorrect inference. Numerous works have tried to address
this problem by adding augmentation techniques that collect data from the cloned model in closed-loop settings.
This includes methods such as DAgger (Ross and Bagnell, 2010; Ross et al., 2011), state-aware imitation (Desai
et al., 2020; Le et al., 2018; Schroecker and Isbell, 2017), pre-trained policies through meta-learning (Duan et al.,
2017; Yu et al., 2018), min-max optimization schemes (Baram et al., 2017; Ho and Ermon, 2016; Sun and Ma,
2019; Wu et al., 2019), using insights from causal inference (Janner et al., 2021; Ortega et al., 2021), and using a
4
摘要:

DRLAreAllVisionModelsCreatedEqual?AStudyoftheOpen-LooptoClosed-LoopCausalityGapMathiasLechner*1,RaminHasani1,AlexanderAmini1,Tsun-HsuanWang1,ThomasHenzinger2,andDanielaRus11MassachusettsInstituteofTechnology(MIT)2InstituteofScienceandTechnologyAustria(ISTAustria)Thereisanever-growingzooofmodernneura...

展开>> 收起<<
DRL Are All Vision Models Created Equal A Study of the Open-Loop to Closed-Loop Causality Gap.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.8MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注