DCANet Differential Convolution Attention Network for RGB-D Semantic Segmentation Lizhi BaiyJun YangyChunqi TianYaoru SunMaoyu Mao Yanjun Xu

2025-04-24 0 0 9.22MB 12 页 10玖币
侵权投诉
DCANet: Differential Convolution Attention Network for RGB-D Semantic
Segmentation
Lizhi BaiJun YangChunqi TianYaoru SunMaoyu Mao Yanjun Xu
Weirong Xu
Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China
{bailizhi, junyang, yaoru, maomy, jesse, 2132983}@tongji.edu.cn, tianchunqi@163.com
Abstract
Combining RGB images and the corresponding depth
maps in semantic segmentation proves the effectiveness in
the past few years. Existing RGB-D modal fusion meth-
ods either lack the non-linear feature fusion ability or treat
both modal images equally, regardless of the intrinsic dis-
tribution gap or information loss. Here we find that depth
maps are suitable to provide intrinsic fine-grained pat-
terns of objects due to their local depth continuity, while
RGB images effectively provide a global view. Based on
this, we propose a pixel differential convolution attention
(DCA) module to consider geometric information and local-
range correlations for depth data. Furthermore, we extend
DCA to ensemble differential convolution attention (EDCA)
which propagates long-range contextual dependencies and
seamlessly incorporates spatial distribution for RGB data.
DCA and EDCA dynamically adjust convolutional weights
by pixel difference to enable self-adaptive in local and
long range, respectively. A two-branch network built with
DCA and EDCA, called Differential Convolutional Network
(DCANet), is proposed to fuse local and global information
of two-modal data. Consequently, the individual advantage
of RGB and depth data are emphasized. Our DCANet is
shown to set a new state-of-the-art performance for RGB-
D semantic segmentation on two challenging benchmark
datasets, i.e., NYUDv2 and SUN-RGBD.
1. Introduction
Semantic segmentation is an essential task in computer
vision, which infers semantic labels of every pixel in a
scene. With the widespread use of 3D sensors such as
Kinect, Xition etc., the 3D geometry information of objects
can be easily obtained to boost the advancement of RGB-D
Corresponding author.
Equal contribution.
Figure 1. The intrinsic differences between RGB and depth data
and the illumination of our DCANet. While the chair and the table
are inseparable according to the 2D appearance in RGB image,
they can be easily distinguished in depth map based on geometric
information. In DCANet, we exploit DCA to capture local-range
geometric consistency in depth map and EDCA to focus on long-
range dependence for RGB.
semantic segmentation. After encoding the real-world ge-
ometric information, the RGB-D images can be applied to
overcome the challenge of 2D only displaying the photo-
metric appearance properties in the projected image space
and enrich the representation of RGB images. The informa-
tion of RGB and depth images are presented in entirely dif-
ferent forms. In particular, RGB images capture the photo-
metric appearance properties in the projected image space,
while the depth maps can produce plentiful complementary
information for the appearance cues of local geometry. As
a result, it is vital to enhance and fuse the strengths of RGB
and depth data in semantic segmentation task.
In a real scenario, there are too many challenging im-
ages with complex appearances. Take Fig. 1as an example,
while the chair and the table are inseparable in the RGB im-
age, they can be easily distinguished in depth. Obviously,
it is not feasible to separate the table with chair using only
2D information such as shapes and colors. While in the
view of depth maps, there is local consistency information,
which will not be limited by the similar confusing appear-
arXiv:2210.06747v1 [eess.IV] 13 Oct 2022
ance. In fact, the depth data provide more fine-grained local
geometry difference information and theoretically leading
to better segmentation performance compared to only us-
ing RGB images. In contrast, as verified in the classic self-
attention [54,60,63] mechanisms that RGB data focuses on
more global information.
The existing methods [3,911,20,24,26,29,36,39] try
to fuse RGB-D data by introducing new convolution layer
and pooling layer, attention mechanism, noise-cancelling
module, etc., to obtain better semantic segmentation re-
sults. These methods ignore the intrinsic differences be-
tween RGB and depth features, using homogeneous opera-
tors instead. The weights of both types of data are equally
treated so as to make the same contribution to the segmenta-
tion, which is obviously not appropriate. Besides, the infor-
mation of RGB images and depth maps is mainly achieved
from the combined final channel, where specific semantic
information in different channels is not considered.
To address the aforementioned problems, we propose
two attention mechanisms, namely differential convolution
attention (DCA) and ensemble differential convolution at-
tention (EDCA) to improve the cross-modal ability between
RGB and depth data in semantic segmentation. DCA dy-
namically augments the standard convolution with a pixel
difference term and forces pixels with a similar difference
to the center of the kernel to contribute more to the out-
put than other pixels. DCA incorporates local geometric
information and improve local-range adaptability for depth
data. EDCA absorbs the advantage of dynamic convolu-
tion of DCA to propagate long-range contextual dependen-
cies and seamlessly incorporate spatial distribution for RGB
data. Meanwhile, both DCA and EDCA avoid common
drawbacks such as ignoring adaptability in channel dimen-
sion. Our main contributions are summarized as follows.
We propose a DCA module which incorporates local-
range intricate geometric patterns and enables self-adaptive
by considering subtle discrepancy of pixels in local regions
for depth data.
We extend DCA to EDCA for achieving long-range
correlations and seamlessly incorporating spatial distribu-
tion for RGB data.
Based on DCA and EDCA, we propose a DCANet that
achieves a new state-of-the-art performance on NYUDv2
[47] and SUN-RGBD [48] datasets. We also provide a de-
tailed analysis of design choices and model variants.
2. Related Work
2.1. RGB-D Semantic Segmentation
With the help of additional depth information, the com-
bination of such two complementary modalities achieves
great performance in semantic segmentation [3,9,17,27,
28,45,47]. Many works simply concatenate the features of
RGB and depth images to enhance the semantic information
of each pixel [45,47]. The fusion method can be classified
into three types: early fusion, middle fusion and late fu-
sion. Cao et al. [3] concatenate the RGB and depth data
decomposed by a shape and a base component in the depth
feature in the early stage. However, due to the complexity
of these two modalities, a single model cannot fit their data
well due to their differences. Jiao et al. [27] design two
encoder-decoder modules for fully consideration the RGB
and depth information, where both modal are fused in late
stage. In this method, the interaction between the different
features of RGB and depth data is insufficient, since the rich
information of the modalities is gradually compressed and
even lost. After overcoming the drawback of early stage and
late stage fusion strategy, middle stage fusion performs bet-
ter by fusing the intermediate information of the two differ-
ent modalities. Gupta et al. [18] concatenate the geocentric
embedding for depth images and with depth images to con-
tribute the final semantic information in the middle stage.
Notably, the distribution gap is reduced in the middle stage
fusion strategy, and multi-modal features are combined with
ample interaction. As a result, recent studies mainly focus
on middle stage fusion. Chen et al. [9] propose a spatial
information-guided convolution, which generates convolu-
tion kernels with different sampling distributions to enhance
the spatial adaptability of network and receptive field regu-
lation. Chen et al. [10] unify the most informative cross-
modality features from data for both modalities into an ef-
ficient representation. Lin et al. [29] split the image into
multiple branches based on geometry information, where
each branch of the network semantically segments relevant
similar features.
Our method applies two branches and each branch fo-
cuses on extracting modality-specific features, such as color
and texture from RGB images and geometric, illumination-
independent features from depth images. To be specific,
similar to middle stage fusion, attentive depth features gen-
erated by the DCA are fused into the attentive RGB from
the EDCA at each of the resolution stages in the encoders.
The depth and RGB data focus on local and long range in-
formation, respectively.
2.2. Attention Modules
What has greatly contributed to the popularity of atten-
tion modules is the fact that they can be applied to model
the global dependencies of features almost in any stage of
the network. Woo et al. [56] adaptively refined the informa-
tion in spatial and channel dimensions through the convolu-
tional block attention module. Inspired by the self-attention
network in Natural Language Processing [51], such self-
attention related module achieves widespread focus in com-
puter vision [44,50,61]. Many researchers focus on the
global and local dependencies. In [54], Wang et al. pro-
Figure 2. The instances of DCA and EDCA when taking a 3×3
local grid as an example.
pose a non-local model to extend the self-attention to a more
general type of non-local filtering method for capturing the
long-range dependencies. Fu et al. [15] propose two at-
tention modules to capture spatial and channel interdepen-
dencies, respectively. Cao et al. [4] propose a lightweight
non-local network based on a query independent formula-
tion for global context modeling. Zhu et al. [63] integrates
the features of different levels while considering long-range
dependencies and reducing redundant parameters.
Our method integrates DCA and EDCA to build rela-
tionship between different points for depth and RGB deta,
respectively. The DCA module supports that the same ob-
jects have more substantial depth similarity in a local-range
of depth data, and we make use of the pixel-wise difference
to force pixels with more consistent geometry to make more
contributions to the corresponding output. The EDCA mod-
ule enables long-range dependencies for RGB data.
3. Method
RGB-D semantic segmentation requires fusing features
from RGB and depth modalities, which are inherently dif-
ferent. Specifically, RGB data has long-range contextual
dependencies and global spatial consistency, while depth
data contains local geometric consistency. The intrinsic
characteristics of the two modalities should be considered
separately to identify the strengths of each, while enhanc-
Figure 3. The explains of convolution strategies in EDC, 5×5con-
volution is used for convenience. (a) 5×5convolution, Conv5×5.
(b) 5×5convolution with dilation 3, DConv5×5. (c) The com-
bination of (a) and (b), DConv5×5(Conv5×5(·)). Compared with
(a), (b) has a larger receptive field, but cause information lost. In
(c) (left), the red dashed box is Conv5×5which just fills the dila-
tion of DConv5×5and makes it approximate size of 17 ×17 as
the blue dashed box in (c) (right).
ing the two feature representations. To this end, we put
forward two attention modules called DCA and EDCA to
capture the intrinsic features of depth and RGB data, re-
spectively. In this section, we elaborate the details of the
proposed DCA and EDCA, followed by the the description
of the proposed differential convolution attention network
(DCANet).
3.1. Differential Convolution Attention
The attention mechanism can be considered as an adap-
tive selection process that selects discriminative features
based on input features and automatically ignores noisy re-
sponses [16]. The key point of the attention mechanism is
to learn the relationship between different points and gen-
erate an attention map that indicates the importance of dif-
ferent points. The well-known method for establishing re-
lationship between different points is self-attention mecha-
nism [15,54,57,61], which is used to capture long-range
dependence. However, due to its intrinsic properties, the
depth data is only relevant in a local region and long-range
dependencies may introduce more interference terms. For
this, we explore convolution to build relevance and produce
attention map by considering a local region in depth data.
Given a feature map FRh×w×c;h,w, and care the
height, width and the channel of input feature map, respec-
tively. For simplicity, we note XRh×w×1as the input
feature map. For each point pR2on X, the vanilla con-
volution is calculated as:
Y(p) =
k×k
X
i=1
Ki·X(p+pi),(1)
where pienumerates the local locations around p.Kis the
learnable weights of the convolution kernel with the size of
k×k(the bias terms are ignored for simplicity).
In Eq.(1), the convolution kernel Kof the vanilla con-
volution is fixed for any input, which cannot perceive the
摘要:

DCANet:DifferentialConvolutionAttentionNetworkforRGB-DSemanticSegmentationLizhiBaiyJunYangyChunqiTianYaoruSunMaoyuMaoYanjunXuWeirongXuDepartmentofComputerScienceandTechnology,TongjiUniversity,Shanghai,201804,Chinafbailizhi,junyang,yaoru,maomy,jesse,2132983g@tongji.edu.cn,tianchunqi@163.comAbstract...

展开>> 收起<<
DCANet Differential Convolution Attention Network for RGB-D Semantic Segmentation Lizhi BaiyJun YangyChunqi TianYaoru SunMaoyu Mao Yanjun Xu.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:9.22MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注