DRL Are All Vision Models Created Equal A Study of the Open-Loop to Closed-Loop Causality Gap

2025-05-03 0 0 3.8MB 17 页 10玖币

侵权投诉

DRL

Are All Vision Models Created Equal? A Study of the Open-Loop

to Closed-Loop Causality Gap

Mathias Lechner *1, Ramin Hasani 1, Alexander Amini 1, Tsun-Hsuan Wang 1,

Thomas Henzinger 2, and Daniela Rus 1

1Massachusetts Institute of Technology (MIT)

2Institute of Science and Technology Austria (IST Austria)

There is an ever-growing zoo of modern neural network models that can efﬁciently learn end-to-end control

from visual observations. These advanced deep models, ranging from convolutional to patch-based networks,

have been extensively tested on ofﬂine image classiﬁcation and regression tasks. In this paper, we study these

vision architectures with respect to the open-loop to closed-loop causality gap, i.e., ofﬂine training followed

by an online closed-loop deployment. This causality gap typically emerges in robotics applications such as

autonomous driving, where a network is trained to imitate the control commands of a human. In this setting, two

situations arise: 1) Closed-loop testing in-distribution, where the test environment shares properties with those of

ofﬂine training data. 2) Closed-loop testing under distribution shifts and out-of-distribution. Contrary to recently

reported results, we show that under proper training guidelines, all vision models perform indistinguishably

well on in-distribution deployment, resolving the causality gap. In situation 2, We observe that the causality gap

disrupts performance regardless of the choice of the model architecture. Our results imply that the causality

gap can be solved in situation one with our proposed training guideline with any modern network architecture,

whereas achieving out-of-distribution generalization (situation two) requires further investigations, for instance,

on data diversity rather than the model architecture.

1. Introduction

Video demonstration of vision models in self driv-

ing. https://youtu.be/0GxKzv5Ej88

The deployment of end-to-end learning systems in robotics appli-

cations is increasing due to their ability to efﬁciently and automat-

ically learn representations from high-dimensional observations

such as visual inputs without the need for hand-crafting features,

bypassing perception to control.

In this space, a tremendous number of advanced deep learning

models have been proposed to perform competitively in end-to-

end perception-to-control tasks. For example, patch-based vision

architectures such as Vision Transformer (ViT) (Dosovitskiy et al.,

2020) have shown to be competitive with models based on convo-

lutional neural networks (CNNs) (Fukushima and Miyake, 1982;

Lechner et al., 2022; LeCun et al., 1989) in computer vision applications for which CNNs were the predominant

choice. A recent line of research, namely the MLPMixer (Tolstikhin et al., 2021), and ConvMixer (Trockman and

Kolter, 2022) suggested that the great generalization performance of ViT might be rooted in the patch structure

of the inputs rather than the choice of the architecture. There are also works suggesting that self-attention is

not crucial in vision Transformers and simply a gating projection in multi-layer perceptrons (MLPs) (Liu et al.,

*Correspondence E-mail: mlechner@mit.edu

arXiv:2210.04303v1 [cs.CV] 9 Oct 2022

Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap

0.25 0.30 0.35 0.40 0.45

Offline validation loss (lower is better)

10%

12%

14%

16%

18%

20%

22%

24%

Online crash likelihood (lower is better)

CNN baseline

MobileNetV2

ResNet18

ResNet34

EfficientNet

EfficientNet-v2

RegNet-y004

RegNet-y016

ConvNext

ConvMixer-Tiny

ConvMixer-S

ViT-S

ViT-Tiny

Swin-S

Swin-Tiny

MLP-Mixer-S

MLP-Tiny-S

gMLP-Tiny

gMLP-S

FNet-S

FNet-Tiny

Fig. 1

: Online deployment vs. ofﬂine training causality gap in perspective. Marker size is linearly proportional to the

number of trainable parameters.

2021a) or replacing self-attention sublayer with an unparameterized Fourier Transform (Lee-Thorp et al., 2021)

can outperform ViT.

These proposals are largely tested in ofﬂine settings where the output decisions of the network do not change the

next incoming inputs. In other words, patch-based and mixer models trained ofﬂine have not yet been evaluated

in a closed-loop with an environment where network actions affect next observations, such as in imitation

learning tasks. IL agents typically suffer from a causality gap arising from the transfer of models from open-loop

training to closed-loop testing. In this paper, we focus on investigating this gap in a systematic way.

In closed-loop testing, we need to be cognizant of two modes: 1) Closed-loop testing in-distribution. In this

setting, we test networks in environments that share similar properties to that of the training environment. 2)

Closed-loop testing under distribution shifts and out-of-distribution.

Testing models under both settings requires us to ensure fairness and proper evaluation of the effectiveness of

different model architectures in learning robust perception-to-control instances. To this end, we must validate

that all baseline models are trained to their best capability given the same amount of hyperparameter optimization

budget under a controlled training pipeline.

In an attempt to make a uniﬁed evaluation of advanced deep learning models in robot imitation learning tasks, in

this paper, we set out to design a series of imitation learning pipelines to train models in a controlled and fair

setting and test their generalization capability in and out of distribution.

In particular, we design an end-to-end autonomous driving (AD) IL pipeline based on a photorealistic AD

simulation platform called VISTA (Amini et al., 2020), which can test agents in a closed-loop AD environment

synthesizing novel views to assess their closed-loop generalization capabilities (Xiao et al., 2021, 2022).

Counterintuitively and in contrast to the recently reported results (Bai et al., 2021; Naseer et al., 2021; Paul and

Chen, 2022), we show that no new architecture is needed to bridge the causality gap between ofﬂine training

and online testing in-distribution, as our controlled training pipeline enables all models to perform remarkably

well on the given tasks. Moreover, for achieving out-of-distribution generalization, we observe that the causality

gap certainly affects the performance of models, again, almost regardless of the choice of their architecture.

These ﬁndings suggest the rethinking of the emphasis on the choice of popular models such as Transformers

Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap

(in-distribution) (in-distribution) (out-of-distribution) (out-of-distribution) (out-of-distribution)

Summer Winter Fall Spring Night

Fig. 2

: Visualization of sample observations used in our end-to-end AD experiment, spanning across various seasons and

times of the day.

over CNNs, as other factors such as proper training setup, augmentation strategies, and data diversity play a

more important role in generalization in and out of distribution.

To validate our results and make sure our conclusions are not speciﬁc to the AD domain, we further extended

our experiments (ofﬂine training, then online testing) to more standard visual behavior cloning benchmarks such

as perception-to-control arcade learning environment (ALE) (Bellemare et al., 2013) tasks. Similar conclusions

are then drawn in this case.

Our contributions are summarized below:

The design of a uniﬁed end-to-end training infrastructure for a fair comparison of advanced deep learning

architectures in robot imitation learning applications

Studying patch-based architectures and advanced CNNs as agents in closed-loop with their environments

via ofﬂine training (imitation) and online testing.

Discovering new insights about where patch-based architectures generalize better than CNNs and vice

versa in test-cases in and out of distribution

2. Background and Related Works

In this section, we ﬁrst discuss the image processing architectures studied in this work. Moreover, we recapitulate

related works on the understanding of how patch-based CV models process information differently than

convolutional architectures. Finally, we discuss existing works on bridging the gap between ofﬂine training -

online generalization.

Patch-based vision architectures.

Motivated by the success of Transformers (Vaswani et al., 2017) on natural

language processing (NLP) datasets, (Dosovitskiy et al., 2020) introduced the Vision Transformer (ViT) by

adapting the architecture for computer vision tasks. As transformers operate on a 1-dimensional sequence of

vectors, (Dosovitskiy et al., 2020) proposed to convert an image into a sequence by tiling it into patches. Each

patch is then ﬂattened into a vector by concatenating all pixel values. Researchers have analyzed the difference

between how CNNs and ViTs process images (Raghu et al., 2021). Moreover, it has been claimed that vision

transforms are much more robust to image perturbations and occlusions (Naseer et al., 2021), as well as be able

to handle distribution-shifts (Bai et al., 2021; Lechner et al., 2021) better than CNNs. However, more recent

works have refuted the robustness claims of vision transformers (Fu et al., 2021) by showing that ViTs can be

less robust than convolutional networks when considering carefully crafted adversarial attacks.

Swin Transformer (Liu et al., 2021b) modiﬁes the vision transformer by adding a hierarchical structure to the

feature sequence of patches. The Swin Transformer applies its attention mechanism not to the full sequence but

to a window that is shifted over the entire sequence. By increasing network depth, neighboring windows are

merged and pooled into large, less ﬁne-grained windows. This hierarchical processing allows it to use smaller

patches without exploding the compute and memory footprint of the model.

Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap

MLP-Mixer (Tolstikhin et al., 2021) adapts the idea of vision transformers to map an image to a sequence of

patches. This sequence is then processed by alternating plain multi-layer perceptrons (MLP) over the feature and

the sequence dimension, i.e., mixing features and mixing spatial information.

gMLP (Liu et al., 2021a) is another MLP-only vision architecture that differs from the MLP-Mixer by introducing

multiplicative spatial gating units between the alternating spatial and feature MLPs. Empirical results (Liu et al.,

2021a) show that the gMLP has a better accuracy-parameter ratio than the MLP-Mixer.

FNet (Lee-Thorp et al., 2021) replaces the learnable spatial mixing MLP of the MLP-Mixer architecture by a

ﬁxed mixing step. In particular, a parameter-free 2-dimensional Fourier transform is applied over the sequence

and features dimensions of the input. Although the authors (Lee-Thorp et al., 2021) did not evaluate the model

for vision tasks, FNet’s similarity to patch-based MLP architectures makes it a natural candidate for vision tasks.

ConvMixer (Trockman and Kolter, 2022) replace the MLPs of the MLP-mixer architecture by alternating depth-

wise and point-wise 1D convolutions. While an MLP mixes all entries of the spatial and feature dimension, the

convolutions of the ConvMixer mix only local information, e.g., kernel size was set to 9 in (Trockman and Kolter,

2022). The authors claim a large part of the performance of MLP and vision transformers can be attributed to

the patch-based processing instead of the type of mixing representation (Trockman and Kolter, 2022).

Advanced convolutional architectures. Here, we brieﬂy discuss modern variants of CNN architectures.

ResNet (He et al., 2016) add skip connections that bypass the convolutional layers. This simple modiﬁcation

allows training much deeper networks than a pure sequential composition of layers. Consequently, skips

connections can be found in any modern neural network architecture, including patch-based and advanced

convolutional models.

MobileNetV2 (Sandler et al., 2018) replace the standard convolution operations by depth-wise separable con-

volutions that process the spatial and channel dimension separately. The resulting network requires fewer

ﬂoating-point operations to compute, which is beneﬁcial for mobile and embedded applications.

EfﬁcientNet (Tan and Le, 2019) is an efﬁcient convolutional neural network architecture derived from an

automated neural architecture search. The objective of the search is to ﬁnd a network topology that achieves

high performance while simultaneously running efﬁciently on CPU devices.

EfﬁcientNet-v2 ﬁxes the issue of EfﬁcientNets that despite their efﬁciency on CPU inference, they can be slower

than existing architecture types on GPUs at training and inference.

RegNet (Radosavovic et al., 2020) is a neural network family that systematically explores the design space of

previously proposed advances in neural network design. The RegNet-Y subfamily speciﬁcally scales the width

of the network linearly with depth and comprises squeeze-and-excitation blocks.

ConvNext (Liu et al., 2022) is a network that subsumes many recent advances in the design of vision architectures,

including better activation functions, replacing batch-norm by layer-normalization, and a larger kernel size into

standard ResNets.

Imitation learning (IL).

IL describes learning an agent by expert demonstrations consist of observation-action

pairs (Schaal, 1999), directly via behavior cloning (Ho and Ermon, 2016), or indirectly via inverse reinforcement

learning (Ng and Russell, 2000). When IL agents are deployed online, they most often deviate from the expert

demonstrations leading to compounding errors and incorrect inference. Numerous works have tried to address

this problem by adding augmentation techniques that collect data from the cloned model in closed-loop settings.

This includes methods such as DAgger (Ross and Bagnell, 2010; Ross et al., 2011), state-aware imitation (Desai

et al., 2020; Le et al., 2018; Schroecker and Isbell, 2017), pre-trained policies through meta-learning (Duan et al.,

2017; Yu et al., 2018), min-max optimization schemes (Baram et al., 2017; Ho and Ermon, 2016; Sun and Ma,

2019; Wu et al., 2019), using insights from causal inference (Janner et al., 2021; Ortega et al., 2021), and using a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DRLAreAllVisionModelsCreatedEqual?AStudyoftheOpen-LooptoClosed-LoopCausalityGapMathiasLechner*1,RaminHasani1,AlexanderAmini1,Tsun-HsuanWang1,ThomasHenzinger2,andDanielaRus11MassachusettsInstituteofTechnology(MIT)2InstituteofScienceandTechnologyAustria(ISTAustria)Thereisanever-growingzooofmodernneura...

展开>> 收起<<

DRL Are All Vision Models Created Equal A Study of the Open-Loop to Closed-Loop Causality Gap.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DRL Are All Vision Models Created Equal A Study of the Open-Loop to Closed-Loop Causality Gap

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: