Context-Enhanced Stereo Transformer Weiyu Guo12 Zhaoshuo Li3 Yongkui Yang1 Zheng Wang1 Russell H. Taylor3 Mathias Unberath3 Alan Yuille3 and Yingwei Li3

2025-05-06 0 0 5.67MB 17 页 10玖币

侵权投诉

Context-Enhanced Stereo Transformer

Weiyu Guo1,2( ) , Zhaoshuo Li3( ) , Yongkui Yang1( ) , Zheng Wang1( ) ,

Russell H. Taylor3, Mathias Unberath3, Alan Yuille3, and Yingwei Li3( )

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,

Shenzhen, China

2University of Chinese Academy of Sciences, Beijing, China

3Johns Hopkins University, Baltimore, USA

{wy.guo,yk.yang,zheng.wang}@siat.ac.cn, {zli122, yingwei.li}@jhu.edu

Abstract. Stereo depth estimation is of great interest for computer vi-

sion research. However, existing methods struggles to generalize and pre-

dict reliably in hazardous regions, such as large uniform regions. To over-

come these limitations, we propose Context Enhanced Path (CEP). CEP

improves the generalization and robustness against common failure cases

in existing solutions by capturing the long-range global information. We

construct our stereo depth estimation model, Context Enhanced Stereo

Transformer (CSTR), by plugging CEP into the state-of-the-art stereo

depth estimation method Stereo Transformer. CSTR is examined on dis-

tinct public datasets, such as Scene Flow, Middlebury-2014, KITTI-2015,

and MPI-Sintel. We ﬁnd CSTR outperforms prior approaches by a large

margin. For example, in the zero-shot synthetic-to-real setting, CSTR

outperforms the best competing approaches on Middlebury-2014 dataset

by 11%. Our extensive experiments demonstrate that the long-range in-

formation is critical for stereo matching task and CEP successfully cap-

tures such information†.

Keywords: Stereo depth estimation, transformer, context extraction

1 Introduction

Stereo depth estimation is a critical task in computer vision that has been widely

used in various ﬁelds, such as robotics [27], autonomous driving [24], and 3D

scene reconstruction [29]. Recent developments in learning-based stereo dispar-

ity estimation algorithms generally use using techniques restricted to local infor-

mation for matching the feature patterns between the left and right images. For

example, prior works [2,8,36] construct a cost volume with pre-deﬁned disparity

range and use 3D convolutions to process the cost volume, limiting themselves

to the receptive ﬁeld of convolution kernel. Xu et al.[32] proposed to instead

process the cost volume using 2D convolutions, however, facing the same chal-

lenge. Recently, approaches that attempt to capture more global information

have been proposed. For example, STTR [17] and RAFT-Stereo [19] computes

†Code available at: github.com/guoweiyu/Context-Enhanced-Stereo-Transformer

arXiv:2210.11719v1 [cs.CV] 21 Oct 2022

2 W. Guo et al.

attention or correlation between all pixels of the left and right images on the

same epipolar lines. However, they all fail to take advantage of cross-epipolar

line information, which is a critical component of global information processing.

Thus, as shown in Figure 1, these methods cannot address hazardous regions like

textureless, large uniform regions, specularity, and transparency [15,37], which

are particularly challenging for stereo algorithms to produce reliable estimates.

The features of left and right frames in these regions are often similar or mis-

leading, which makes the feature matching ambiguous [37]. If disparities of these

regions cannot be reliably predicted, downstream applications, such as 3D object

detection [28], may be severely impacted due to missing or wrong predictions.

Therefore, in this paper, we seek to answer this critical question: how to guide

the stereo models properly handle those hazardous regions.

To address this question, we hypothesize that the long-range contextual in-

formation help to improve the predictions on hazardous regions. For example,

as shown in Figure 1 (a), previous work performs unreliably in large white wall.

However, if we could use the global information (e.g., orientation, edge infor-

mation) of the house, the prediction can be improved. Such global context in-

formation in theory will inform the model about the geometry on a global scale

and guide the model to resolve the ambiguity in prediction. To this end, we

proposed a plug-in module, called Context Enhanced Path (CEP), which helps

stereo matching models to better understand the global structure of the haz-

ardous regions. Compared to existing methods, CEP oﬀers the following three

unique advantages: (1) strong generalization ability, compared with previous

methods [2,17], CEP shows strong results on unseened real-world data even if

only training on synthetic data; (2) robustness against hazardous, thanks to

modeling the long-range contextual information.(3) generic, unlike [9,14,34], our

method serves as a plug-in that can be potentially applied to most of stereo

matching methods. We construct our stereo depth estimation model based on

CEP, namely Context Enhanced Stereo Transformer (CSTR). We have examined

CSTR on several popular and diverse datasets, such as, Middlebury-2014[26],

KITTI-2015 [24], and MPI sintel [1]. Our extensive experiments demonstrate

that (1) the long-range information is critical for stereo depth estimation, (2)

CSTR attains strong generalization ability, and (3) more importantly, CSTR can

better handle hazardous regions, such as texturelessness and disparity jumps

(shown in Figure 1 and Table 3). This result is attributed to our simple yet

powerful observation: using long-range contextual information to better under-

stand the global structure of the image can signiﬁcantly help stereo depth esti-

mation especially for those hazardous area. This result suggests that modeling

long-range context information is critical for building a robust and generalizable

stereo depth estimation algorithm.

To summarize, our contributions are 3-fold: (1) we found global contextual

information is critical for stereo depth estimation; (2) we design a plug-in module,

Context Enhanced Path (CEP), for generic stereo depth estimation models;

(3) we integrate our plug-in module and build a stereo matching model named

Context Enhanced Stereo Transformer (CSTR), which achieves the state-of-the-

CSTR 3

STTR

CSTR

（a）（b）（c）（d）（e）

Fig. 1. Sample visualizations of hazardous regions taken from KITTI-2015 and

Middlebury-2014 datasets. First row is the input left images. Second row is the dispar-

ity predicted by Stereo Transformer (STTR) [17]. Third row is the disparity predicted

by our proposed Context Enhanced Stereo Transformer (CSTR). The color map shown

on the right is based on the disparity value relative to the image width.

Fig. 2. Examples of hazardous regions including: (a) Texturelessness: the wall and the

ceiling in the room a (b) Specularity: the screen of a TV (c) Transparency: the sliding

door (d) Disparity jumps: objects such as bamboos, fences and plants give frequent

disparity discontinuities. Images are from Zhang et al . [37].

art generalisation results on several popular datasets, including Middlebury-

2014-2014[26], KITTI-2015 [24], and MPI-sintel [1].

2 Related Work

Rectiﬁed stereo depth estimation obtains per-pixel depth from the left and

right frames provided by the binocular camera. It has a wide range of applica-

tions in robotics, autonomous driving, scene understanding, 3D modeling, etc.

In contrast to the success of deep learning in many high-level vision problems,

low-level deep learning algorithms for vision tasks are still in their early stages

[15]. In the ﬁeld of stereo depth estimation, many works aim to improve a single

step of the classical pipeline by replacing it with a deep learning module [15,38],

where the quality of cost volume directly determines the accuracy of the dis-

parity map. Chen et al . proposed Deep Embed to learn a cost function from

4 W. Guo et al.

diﬀerent windows by processing multi-patches at diﬀerent resolutions [4]. After

cost volume computation, cost aggregation is essential for gathering large context

information from the huge cost volume. One of the most popular cost aggrega-

tion techniques is Semiglobal Matching (SGM) [12]. A global energy function

related to the disparity map is set to minimize this energy function to solve the

optimal disparity of each pixel. The raw disparity map should be reﬁned by a

post-processing algorithm.

Although there are still several remaining challenges, recently, end-to-end

deep learning begin to be used in binocular stereo depth estimation and domi-

nate dense disparity estimation in several well-known benchmarks. In order to

keep memory feasible and inference speed manageable, many researchers adopt

2D convolution-based methods. These architectures always contain a self-design

layer namely correlation layer in charge of computing correlation scores between

left and right features. Mayer et al.proposed an encoder-decoder architecture

based on U-net named DispNet [23]. Some researchers adopt 3D convolutions

in stereo matching which take a 4D tensor (disparity range, height, weight, fea-

ture) as the input and directly process a matching volume-like representation.

Chang et al. proposed Pyramidal Stereo Matching network (PSMNet) to inte-

grate Spatial Pyramidal Pooling layers (SPP) in the feature extractor [2]. How-

ever, these methods lead to large computational costs, such as huge memory cost

and low inference speed. Besides, the disparity range of the conventional meth-

ods are limited, preventing them to be used in many cases when the scenes are

close to the camera. Recently, Li et al.use a sequence-to-sequence perspective

to replace cost volume construction with dense pixel matching [17]. Lipson et al.

unify stereo and optical ﬂow approaches and utilize GRU to iteratively gener-

ate the ﬁnal disparity map [19]. Others [18,16] exploit auxiliary information for

detph estimation. However, stereo depth estimation is still limited by diﬃculties

like textureless surfaces, disparity jumps, and occlusions.

Hazardous Regions Most of stereo algorithms rely on the following basic as-

sumptions [37]: (1) well-textured local surface for feature extraction without

large homogeneous regions; (2) single image layer assumption with only Lam-

bertian surface; (3) the disparity varies slowly and smoothly in space without

sudden jumps. However, as shown in Figure 2, these assumptions can easily be

broken in many real world scenarios. For example, textureless regions like large

wall are commonly seen and specular surfaces will create multiple image lay-

ers. Furthermore, disparity jumps can break the local smoothness assumption.

The aforementioned regions are called hazardous regions [35]. In this work, we

speciﬁcally study these commonly seen yet challenging scenarios for more robust

stereo depth estimation.

Eﬃcient Attention Attention has a good ability to capture correspondence

between two sequences and solves the problem that RNN cannot be calculated

in parallel[30]. There are many successful applications that adopt attention to

encode long-range sequences [3]. Recently, attention has been applied to extract

non-local features in computer vision and led to SOTA performance for many vi-

sion tasks [7]. However, it is computational expensive when the input of attention

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Context-EnhancedStereoTransformerWeiyuGuo1,2(),ZhaoshuoLi3(),YongkuiYang1(),ZhengWang1(),RussellH.Taylor3,MathiasUnberath3,AlanYuille3,andYingweiLi3()1ShenzhenInstituteofAdvancedTechnology,ChineseAcademyofSciences,Shenzhen,China2UniversityofChineseAcademyofSciences,Beijing,China3JohnsHopkinsUniversi...

展开>> 收起<<

Context-Enhanced Stereo Transformer Weiyu Guo12 Zhaoshuo Li3 Yongkui Yang1 Zheng Wang1 Russell H. Taylor3 Mathias Unberath3 Alan Yuille3 and Yingwei Li3.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Context-Enhanced Stereo Transformer Weiyu Guo12 Zhaoshuo Li3 Yongkui Yang1 Zheng Wang1 Russell H. Taylor3 Mathias Unberath3 Alan Yuille3 and Yingwei Li3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: