ance. In fact, the depth data provide more fine-grained local
geometry difference information and theoretically leading
to better segmentation performance compared to only us-
ing RGB images. In contrast, as verified in the classic self-
attention [54,60,63] mechanisms that RGB data focuses on
more global information.
The existing methods [3,9–11,20,24,26,29,36,39] try
to fuse RGB-D data by introducing new convolution layer
and pooling layer, attention mechanism, noise-cancelling
module, etc., to obtain better semantic segmentation re-
sults. These methods ignore the intrinsic differences be-
tween RGB and depth features, using homogeneous opera-
tors instead. The weights of both types of data are equally
treated so as to make the same contribution to the segmenta-
tion, which is obviously not appropriate. Besides, the infor-
mation of RGB images and depth maps is mainly achieved
from the combined final channel, where specific semantic
information in different channels is not considered.
To address the aforementioned problems, we propose
two attention mechanisms, namely differential convolution
attention (DCA) and ensemble differential convolution at-
tention (EDCA) to improve the cross-modal ability between
RGB and depth data in semantic segmentation. DCA dy-
namically augments the standard convolution with a pixel
difference term and forces pixels with a similar difference
to the center of the kernel to contribute more to the out-
put than other pixels. DCA incorporates local geometric
information and improve local-range adaptability for depth
data. EDCA absorbs the advantage of dynamic convolu-
tion of DCA to propagate long-range contextual dependen-
cies and seamlessly incorporate spatial distribution for RGB
data. Meanwhile, both DCA and EDCA avoid common
drawbacks such as ignoring adaptability in channel dimen-
sion. Our main contributions are summarized as follows.
•We propose a DCA module which incorporates local-
range intricate geometric patterns and enables self-adaptive
by considering subtle discrepancy of pixels in local regions
for depth data.
•We extend DCA to EDCA for achieving long-range
correlations and seamlessly incorporating spatial distribu-
tion for RGB data.
•Based on DCA and EDCA, we propose a DCANet that
achieves a new state-of-the-art performance on NYUDv2
[47] and SUN-RGBD [48] datasets. We also provide a de-
tailed analysis of design choices and model variants.
2. Related Work
2.1. RGB-D Semantic Segmentation
With the help of additional depth information, the com-
bination of such two complementary modalities achieves
great performance in semantic segmentation [3,9,17,27,
28,45,47]. Many works simply concatenate the features of
RGB and depth images to enhance the semantic information
of each pixel [45,47]. The fusion method can be classified
into three types: early fusion, middle fusion and late fu-
sion. Cao et al. [3] concatenate the RGB and depth data
decomposed by a shape and a base component in the depth
feature in the early stage. However, due to the complexity
of these two modalities, a single model cannot fit their data
well due to their differences. Jiao et al. [27] design two
encoder-decoder modules for fully consideration the RGB
and depth information, where both modal are fused in late
stage. In this method, the interaction between the different
features of RGB and depth data is insufficient, since the rich
information of the modalities is gradually compressed and
even lost. After overcoming the drawback of early stage and
late stage fusion strategy, middle stage fusion performs bet-
ter by fusing the intermediate information of the two differ-
ent modalities. Gupta et al. [18] concatenate the geocentric
embedding for depth images and with depth images to con-
tribute the final semantic information in the middle stage.
Notably, the distribution gap is reduced in the middle stage
fusion strategy, and multi-modal features are combined with
ample interaction. As a result, recent studies mainly focus
on middle stage fusion. Chen et al. [9] propose a spatial
information-guided convolution, which generates convolu-
tion kernels with different sampling distributions to enhance
the spatial adaptability of network and receptive field regu-
lation. Chen et al. [10] unify the most informative cross-
modality features from data for both modalities into an ef-
ficient representation. Lin et al. [29] split the image into
multiple branches based on geometry information, where
each branch of the network semantically segments relevant
similar features.
Our method applies two branches and each branch fo-
cuses on extracting modality-specific features, such as color
and texture from RGB images and geometric, illumination-
independent features from depth images. To be specific,
similar to middle stage fusion, attentive depth features gen-
erated by the DCA are fused into the attentive RGB from
the EDCA at each of the resolution stages in the encoders.
The depth and RGB data focus on local and long range in-
formation, respectively.
2.2. Attention Modules
What has greatly contributed to the popularity of atten-
tion modules is the fact that they can be applied to model
the global dependencies of features almost in any stage of
the network. Woo et al. [56] adaptively refined the informa-
tion in spatial and channel dimensions through the convolu-
tional block attention module. Inspired by the self-attention
network in Natural Language Processing [51], such self-
attention related module achieves widespread focus in com-
puter vision [44,50,61]. Many researchers focus on the
global and local dependencies. In [54], Wang et al. pro-