DCANet Differential Convolution Attention Network for RGB-D Semantic Segmentation Lizhi BaiyJun YangyChunqi TianYaoru SunMaoyu Mao Yanjun Xu

2025-04-24 0 0 9.22MB 12 页 10玖币

侵权投诉

DCANet: Differential Convolution Attention Network for RGB-D Semantic

Segmentation

Lizhi Bai†Jun Yang†Chunqi Tian∗Yaoru Sun∗Maoyu Mao Yanjun Xu

Weirong Xu

Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China

{bailizhi, junyang, yaoru, maomy, jesse, 2132983}@tongji.edu.cn, tianchunqi@163.com

Abstract

Combining RGB images and the corresponding depth

maps in semantic segmentation proves the effectiveness in

the past few years. Existing RGB-D modal fusion meth-

ods either lack the non-linear feature fusion ability or treat

both modal images equally, regardless of the intrinsic dis-

tribution gap or information loss. Here we ﬁnd that depth

maps are suitable to provide intrinsic ﬁne-grained pat-

terns of objects due to their local depth continuity, while

RGB images effectively provide a global view. Based on

this, we propose a pixel differential convolution attention

(DCA) module to consider geometric information and local-

range correlations for depth data. Furthermore, we extend

DCA to ensemble differential convolution attention (EDCA)

which propagates long-range contextual dependencies and

seamlessly incorporates spatial distribution for RGB data.

DCA and EDCA dynamically adjust convolutional weights

by pixel difference to enable self-adaptive in local and

long range, respectively. A two-branch network built with

DCA and EDCA, called Differential Convolutional Network

(DCANet), is proposed to fuse local and global information

of two-modal data. Consequently, the individual advantage

of RGB and depth data are emphasized. Our DCANet is

shown to set a new state-of-the-art performance for RGB-

D semantic segmentation on two challenging benchmark

datasets, i.e., NYUDv2 and SUN-RGBD.

1. Introduction

Semantic segmentation is an essential task in computer

vision, which infers semantic labels of every pixel in a

scene. With the widespread use of 3D sensors such as

Kinect, Xition etc., the 3D geometry information of objects

can be easily obtained to boost the advancement of RGB-D

∗Corresponding author.

†Equal contribution.

Figure 1. The intrinsic differences between RGB and depth data

and the illumination of our DCANet. While the chair and the table

are inseparable according to the 2D appearance in RGB image,

they can be easily distinguished in depth map based on geometric

information. In DCANet, we exploit DCA to capture local-range

geometric consistency in depth map and EDCA to focus on long-

range dependence for RGB.

semantic segmentation. After encoding the real-world ge-

ometric information, the RGB-D images can be applied to

overcome the challenge of 2D only displaying the photo-

metric appearance properties in the projected image space

and enrich the representation of RGB images. The informa-

tion of RGB and depth images are presented in entirely dif-

ferent forms. In particular, RGB images capture the photo-

metric appearance properties in the projected image space,

while the depth maps can produce plentiful complementary

information for the appearance cues of local geometry. As

a result, it is vital to enhance and fuse the strengths of RGB

and depth data in semantic segmentation task.

In a real scenario, there are too many challenging im-

ages with complex appearances. Take Fig. 1as an example,

while the chair and the table are inseparable in the RGB im-

age, they can be easily distinguished in depth. Obviously,

it is not feasible to separate the table with chair using only

2D information such as shapes and colors. While in the

view of depth maps, there is local consistency information,

which will not be limited by the similar confusing appear-

arXiv:2210.06747v1 [eess.IV] 13 Oct 2022

ance. In fact, the depth data provide more ﬁne-grained local

geometry difference information and theoretically leading

to better segmentation performance compared to only us-

ing RGB images. In contrast, as veriﬁed in the classic self-

attention [54,60,63] mechanisms that RGB data focuses on

more global information.

The existing methods [3,9–11,20,24,26,29,36,39] try

to fuse RGB-D data by introducing new convolution layer

and pooling layer, attention mechanism, noise-cancelling

module, etc., to obtain better semantic segmentation re-

sults. These methods ignore the intrinsic differences be-

tween RGB and depth features, using homogeneous opera-

tors instead. The weights of both types of data are equally

treated so as to make the same contribution to the segmenta-

tion, which is obviously not appropriate. Besides, the infor-

mation of RGB images and depth maps is mainly achieved

from the combined ﬁnal channel, where speciﬁc semantic

information in different channels is not considered.

To address the aforementioned problems, we propose

two attention mechanisms, namely differential convolution

attention (DCA) and ensemble differential convolution at-

tention (EDCA) to improve the cross-modal ability between

RGB and depth data in semantic segmentation. DCA dy-

namically augments the standard convolution with a pixel

difference term and forces pixels with a similar difference

to the center of the kernel to contribute more to the out-

put than other pixels. DCA incorporates local geometric

information and improve local-range adaptability for depth

data. EDCA absorbs the advantage of dynamic convolu-

tion of DCA to propagate long-range contextual dependen-

cies and seamlessly incorporate spatial distribution for RGB

data. Meanwhile, both DCA and EDCA avoid common

drawbacks such as ignoring adaptability in channel dimen-

sion. Our main contributions are summarized as follows.

•We propose a DCA module which incorporates local-

range intricate geometric patterns and enables self-adaptive

by considering subtle discrepancy of pixels in local regions

for depth data.

•We extend DCA to EDCA for achieving long-range

correlations and seamlessly incorporating spatial distribu-

tion for RGB data.

•Based on DCA and EDCA, we propose a DCANet that

achieves a new state-of-the-art performance on NYUDv2

[47] and SUN-RGBD [48] datasets. We also provide a de-

tailed analysis of design choices and model variants.

2. Related Work

2.1. RGB-D Semantic Segmentation

With the help of additional depth information, the com-

bination of such two complementary modalities achieves

great performance in semantic segmentation [3,9,17,27,

28,45,47]. Many works simply concatenate the features of

RGB and depth images to enhance the semantic information

of each pixel [45,47]. The fusion method can be classiﬁed

into three types: early fusion, middle fusion and late fu-

sion. Cao et al. [3] concatenate the RGB and depth data

decomposed by a shape and a base component in the depth

feature in the early stage. However, due to the complexity

of these two modalities, a single model cannot ﬁt their data

well due to their differences. Jiao et al. [27] design two

encoder-decoder modules for fully consideration the RGB

and depth information, where both modal are fused in late

stage. In this method, the interaction between the different

features of RGB and depth data is insufﬁcient, since the rich

information of the modalities is gradually compressed and

even lost. After overcoming the drawback of early stage and

late stage fusion strategy, middle stage fusion performs bet-

ter by fusing the intermediate information of the two differ-

ent modalities. Gupta et al. [18] concatenate the geocentric

embedding for depth images and with depth images to con-

tribute the ﬁnal semantic information in the middle stage.

Notably, the distribution gap is reduced in the middle stage

fusion strategy, and multi-modal features are combined with

ample interaction. As a result, recent studies mainly focus

on middle stage fusion. Chen et al. [9] propose a spatial

information-guided convolution, which generates convolu-

tion kernels with different sampling distributions to enhance

the spatial adaptability of network and receptive ﬁeld regu-

lation. Chen et al. [10] unify the most informative cross-

modality features from data for both modalities into an ef-

ﬁcient representation. Lin et al. [29] split the image into

multiple branches based on geometry information, where

each branch of the network semantically segments relevant

similar features.

Our method applies two branches and each branch fo-

cuses on extracting modality-speciﬁc features, such as color

and texture from RGB images and geometric, illumination-

independent features from depth images. To be speciﬁc,

similar to middle stage fusion, attentive depth features gen-

erated by the DCA are fused into the attentive RGB from

the EDCA at each of the resolution stages in the encoders.

The depth and RGB data focus on local and long range in-

formation, respectively.

2.2. Attention Modules

What has greatly contributed to the popularity of atten-

tion modules is the fact that they can be applied to model

the global dependencies of features almost in any stage of

the network. Woo et al. [56] adaptively reﬁned the informa-

tion in spatial and channel dimensions through the convolu-

tional block attention module. Inspired by the self-attention

network in Natural Language Processing [51], such self-

attention related module achieves widespread focus in com-

puter vision [44,50,61]. Many researchers focus on the

global and local dependencies. In [54], Wang et al. pro-

Figure 2. The instances of DCA and EDCA when taking a 3×3

local grid as an example.

pose a non-local model to extend the self-attention to a more

general type of non-local ﬁltering method for capturing the

long-range dependencies. Fu et al. [15] propose two at-

tention modules to capture spatial and channel interdepen-

dencies, respectively. Cao et al. [4] propose a lightweight

non-local network based on a query independent formula-

tion for global context modeling. Zhu et al. [63] integrates

the features of different levels while considering long-range

dependencies and reducing redundant parameters.

Our method integrates DCA and EDCA to build rela-

tionship between different points for depth and RGB deta,

respectively. The DCA module supports that the same ob-

jects have more substantial depth similarity in a local-range

of depth data, and we make use of the pixel-wise difference

to force pixels with more consistent geometry to make more

contributions to the corresponding output. The EDCA mod-

ule enables long-range dependencies for RGB data.

3. Method

RGB-D semantic segmentation requires fusing features

from RGB and depth modalities, which are inherently dif-

ferent. Speciﬁcally, RGB data has long-range contextual

dependencies and global spatial consistency, while depth

data contains local geometric consistency. The intrinsic

characteristics of the two modalities should be considered

separately to identify the strengths of each, while enhanc-

Figure 3. The explains of convolution strategies in EDC, 5×5con-

volution is used for convenience. (a) 5×5convolution, Conv5×5.

(b) 5×5convolution with dilation 3, DConv5×5. (c) The com-

bination of (a) and (b), DConv5×5(Conv5×5(·)). Compared with

(a), (b) has a larger receptive ﬁeld, but cause information lost. In

tion of DConv5×5and makes it approximate size of 17 ×17 as

the blue dashed box in (c) (right).

ing the two feature representations. To this end, we put

forward two attention modules called DCA and EDCA to

capture the intrinsic features of depth and RGB data, re-

spectively. In this section, we elaborate the details of the

proposed DCA and EDCA, followed by the the description

of the proposed differential convolution attention network

(DCANet).

3.1. Differential Convolution Attention

The attention mechanism can be considered as an adap-

tive selection process that selects discriminative features

based on input features and automatically ignores noisy re-

sponses [16]. The key point of the attention mechanism is

to learn the relationship between different points and gen-

erate an attention map that indicates the importance of dif-

ferent points. The well-known method for establishing re-

lationship between different points is self-attention mecha-

nism [15,54,57,61], which is used to capture long-range

dependence. However, due to its intrinsic properties, the

depth data is only relevant in a local region and long-range

dependencies may introduce more interference terms. For

this, we explore convolution to build relevance and produce

attention map by considering a local region in depth data.

Given a feature map F∈Rh×w×c;h,w, and care the

height, width and the channel of input feature map, respec-

tively. For simplicity, we note X∈Rh×w×1as the input

feature map. For each point p∈R2on X, the vanilla con-

volution is calculated as:

Y(p) =

k×k

i=1

Ki·X(p+pi),(1)

where pienumerates the local locations around p.Kis the

learnable weights of the convolution kernel with the size of

k×k(the bias terms are ignored for simplicity).

In Eq.(1), the convolution kernel Kof the vanilla con-

volution is ﬁxed for any input, which cannot perceive the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DCANet:DifferentialConvolutionAttentionNetworkforRGB-DSemanticSegmentationLizhiBaiyJunYangyChunqiTianYaoruSunMaoyuMaoYanjunXuWeirongXuDepartmentofComputerScienceandTechnology,TongjiUniversity,Shanghai,201804,Chinafbailizhi,junyang,yaoru,maomy,jesse,2132983g@tongji.edu.cn,tianchunqi@163.comAbstract...

展开>> 收起<<

DCANet Differential Convolution Attention Network for RGB-D Semantic Segmentation Lizhi BaiyJun YangyChunqi TianYaoru SunMaoyu Mao Yanjun Xu.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DCANet Differential Convolution Attention Network for RGB-D Semantic Segmentation Lizhi BaiyJun YangyChunqi TianYaoru SunMaoyu Mao Yanjun Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: