Context-Enhanced Stereo Transformer Weiyu Guo12 Zhaoshuo Li3 Yongkui Yang1 Zheng Wang1 Russell H. Taylor3 Mathias Unberath3 Alan Yuille3 and Yingwei Li3

2025-05-06 0 0 5.67MB 17 页 10玖币
侵权投诉
Context-Enhanced Stereo Transformer
Weiyu Guo1,2( ) , Zhaoshuo Li3( ) , Yongkui Yang1( ) , Zheng Wang1( ) ,
Russell H. Taylor3, Mathias Unberath3, Alan Yuille3, and Yingwei Li3( )
1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,
Shenzhen, China
2University of Chinese Academy of Sciences, Beijing, China
3Johns Hopkins University, Baltimore, USA
{wy.guo,yk.yang,zheng.wang}@siat.ac.cn, {zli122, yingwei.li}@jhu.edu
Abstract. Stereo depth estimation is of great interest for computer vi-
sion research. However, existing methods struggles to generalize and pre-
dict reliably in hazardous regions, such as large uniform regions. To over-
come these limitations, we propose Context Enhanced Path (CEP). CEP
improves the generalization and robustness against common failure cases
in existing solutions by capturing the long-range global information. We
construct our stereo depth estimation model, Context Enhanced Stereo
Transformer (CSTR), by plugging CEP into the state-of-the-art stereo
depth estimation method Stereo Transformer. CSTR is examined on dis-
tinct public datasets, such as Scene Flow, Middlebury-2014, KITTI-2015,
and MPI-Sintel. We find CSTR outperforms prior approaches by a large
margin. For example, in the zero-shot synthetic-to-real setting, CSTR
outperforms the best competing approaches on Middlebury-2014 dataset
by 11%. Our extensive experiments demonstrate that the long-range in-
formation is critical for stereo matching task and CEP successfully cap-
tures such information.
Keywords: Stereo depth estimation, transformer, context extraction
1 Introduction
Stereo depth estimation is a critical task in computer vision that has been widely
used in various fields, such as robotics [27], autonomous driving [24], and 3D
scene reconstruction [29]. Recent developments in learning-based stereo dispar-
ity estimation algorithms generally use using techniques restricted to local infor-
mation for matching the feature patterns between the left and right images. For
example, prior works [2,8,36] construct a cost volume with pre-defined disparity
range and use 3D convolutions to process the cost volume, limiting themselves
to the receptive field of convolution kernel. Xu et al.[32] proposed to instead
process the cost volume using 2D convolutions, however, facing the same chal-
lenge. Recently, approaches that attempt to capture more global information
have been proposed. For example, STTR [17] and RAFT-Stereo [19] computes
Code available at: github.com/guoweiyu/Context-Enhanced-Stereo-Transformer
arXiv:2210.11719v1 [cs.CV] 21 Oct 2022
2 W. Guo et al.
attention or correlation between all pixels of the left and right images on the
same epipolar lines. However, they all fail to take advantage of cross-epipolar
line information, which is a critical component of global information processing.
Thus, as shown in Figure 1, these methods cannot address hazardous regions like
textureless, large uniform regions, specularity, and transparency [15,37], which
are particularly challenging for stereo algorithms to produce reliable estimates.
The features of left and right frames in these regions are often similar or mis-
leading, which makes the feature matching ambiguous [37]. If disparities of these
regions cannot be reliably predicted, downstream applications, such as 3D object
detection [28], may be severely impacted due to missing or wrong predictions.
Therefore, in this paper, we seek to answer this critical question: how to guide
the stereo models properly handle those hazardous regions.
To address this question, we hypothesize that the long-range contextual in-
formation help to improve the predictions on hazardous regions. For example,
as shown in Figure 1 (a), previous work performs unreliably in large white wall.
However, if we could use the global information (e.g., orientation, edge infor-
mation) of the house, the prediction can be improved. Such global context in-
formation in theory will inform the model about the geometry on a global scale
and guide the model to resolve the ambiguity in prediction. To this end, we
proposed a plug-in module, called Context Enhanced Path (CEP), which helps
stereo matching models to better understand the global structure of the haz-
ardous regions. Compared to existing methods, CEP offers the following three
unique advantages: (1) strong generalization ability, compared with previous
methods [2,17], CEP shows strong results on unseened real-world data even if
only training on synthetic data; (2) robustness against hazardous, thanks to
modeling the long-range contextual information.(3) generic, unlike [9,14,34], our
method serves as a plug-in that can be potentially applied to most of stereo
matching methods. We construct our stereo depth estimation model based on
CEP, namely Context Enhanced Stereo Transformer (CSTR). We have examined
CSTR on several popular and diverse datasets, such as, Middlebury-2014[26],
KITTI-2015 [24], and MPI sintel [1]. Our extensive experiments demonstrate
that (1) the long-range information is critical for stereo depth estimation, (2)
CSTR attains strong generalization ability, and (3) more importantly, CSTR can
better handle hazardous regions, such as texturelessness and disparity jumps
(shown in Figure 1 and Table 3). This result is attributed to our simple yet
powerful observation: using long-range contextual information to better under-
stand the global structure of the image can significantly help stereo depth esti-
mation especially for those hazardous area. This result suggests that modeling
long-range context information is critical for building a robust and generalizable
stereo depth estimation algorithm.
To summarize, our contributions are 3-fold: (1) we found global contextual
information is critical for stereo depth estimation; (2) we design a plug-in module,
Context Enhanced Path (CEP), for generic stereo depth estimation models;
(3) we integrate our plug-in module and build a stereo matching model named
Context Enhanced Stereo Transformer (CSTR), which achieves the state-of-the-
CSTR 3
STTR
CSTR
a) (b) (c) (d) (e
Fig. 1. Sample visualizations of hazardous regions taken from KITTI-2015 and
Middlebury-2014 datasets. First row is the input left images. Second row is the dispar-
ity predicted by Stereo Transformer (STTR) [17]. Third row is the disparity predicted
by our proposed Context Enhanced Stereo Transformer (CSTR). The color map shown
on the right is based on the disparity value relative to the image width.
Fig. 2. Examples of hazardous regions including: (a) Texturelessness: the wall and the
ceiling in the room a (b) Specularity: the screen of a TV (c) Transparency: the sliding
door (d) Disparity jumps: objects such as bamboos, fences and plants give frequent
disparity discontinuities. Images are from Zhang et al . [37].
art generalisation results on several popular datasets, including Middlebury-
2014-2014[26], KITTI-2015 [24], and MPI-sintel [1].
2 Related Work
Rectified stereo depth estimation obtains per-pixel depth from the left and
right frames provided by the binocular camera. It has a wide range of applica-
tions in robotics, autonomous driving, scene understanding, 3D modeling, etc.
In contrast to the success of deep learning in many high-level vision problems,
low-level deep learning algorithms for vision tasks are still in their early stages
[15]. In the field of stereo depth estimation, many works aim to improve a single
step of the classical pipeline by replacing it with a deep learning module [15,38],
where the quality of cost volume directly determines the accuracy of the dis-
parity map. Chen et al . proposed Deep Embed to learn a cost function from
4 W. Guo et al.
different windows by processing multi-patches at different resolutions [4]. After
cost volume computation, cost aggregation is essential for gathering large context
information from the huge cost volume. One of the most popular cost aggrega-
tion techniques is Semiglobal Matching (SGM) [12]. A global energy function
related to the disparity map is set to minimize this energy function to solve the
optimal disparity of each pixel. The raw disparity map should be refined by a
post-processing algorithm.
Although there are still several remaining challenges, recently, end-to-end
deep learning begin to be used in binocular stereo depth estimation and domi-
nate dense disparity estimation in several well-known benchmarks. In order to
keep memory feasible and inference speed manageable, many researchers adopt
2D convolution-based methods. These architectures always contain a self-design
layer namely correlation layer in charge of computing correlation scores between
left and right features. Mayer et al.proposed an encoder-decoder architecture
based on U-net named DispNet [23]. Some researchers adopt 3D convolutions
in stereo matching which take a 4D tensor (disparity range, height, weight, fea-
ture) as the input and directly process a matching volume-like representation.
Chang et al. proposed Pyramidal Stereo Matching network (PSMNet) to inte-
grate Spatial Pyramidal Pooling layers (SPP) in the feature extractor [2]. How-
ever, these methods lead to large computational costs, such as huge memory cost
and low inference speed. Besides, the disparity range of the conventional meth-
ods are limited, preventing them to be used in many cases when the scenes are
close to the camera. Recently, Li et al.use a sequence-to-sequence perspective
to replace cost volume construction with dense pixel matching [17]. Lipson et al.
unify stereo and optical flow approaches and utilize GRU to iteratively gener-
ate the final disparity map [19]. Others [18,16] exploit auxiliary information for
detph estimation. However, stereo depth estimation is still limited by difficulties
like textureless surfaces, disparity jumps, and occlusions.
Hazardous Regions Most of stereo algorithms rely on the following basic as-
sumptions [37]: (1) well-textured local surface for feature extraction without
large homogeneous regions; (2) single image layer assumption with only Lam-
bertian surface; (3) the disparity varies slowly and smoothly in space without
sudden jumps. However, as shown in Figure 2, these assumptions can easily be
broken in many real world scenarios. For example, textureless regions like large
wall are commonly seen and specular surfaces will create multiple image lay-
ers. Furthermore, disparity jumps can break the local smoothness assumption.
The aforementioned regions are called hazardous regions [35]. In this work, we
specifically study these commonly seen yet challenging scenarios for more robust
stereo depth estimation.
Efficient Attention Attention has a good ability to capture correspondence
between two sequences and solves the problem that RNN cannot be calculated
in parallel[30]. There are many successful applications that adopt attention to
encode long-range sequences [3]. Recently, attention has been applied to extract
non-local features in computer vision and led to SOTA performance for many vi-
sion tasks [7]. However, it is computational expensive when the input of attention
摘要:

Context-EnhancedStereoTransformerWeiyuGuo1,2(),ZhaoshuoLi3(),YongkuiYang1(),ZhengWang1(),RussellH.Taylor3,MathiasUnberath3,AlanYuille3,andYingweiLi3()1ShenzhenInstituteofAdvancedTechnology,ChineseAcademyofSciences,Shenzhen,China2UniversityofChineseAcademyofSciences,Beijing,China3JohnsHopkinsUniversi...

展开>> 收起<<
Context-Enhanced Stereo Transformer Weiyu Guo12 Zhaoshuo Li3 Yongkui Yang1 Zheng Wang1 Russell H. Taylor3 Mathias Unberath3 Alan Yuille3 and Yingwei Li3.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:5.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注