1 CIR-Net Cross-modality Interaction and Refinement for RGB-D Salient Object Detection

2025-04-28 0 0 2.95MB 16 页 10玖币
侵权投诉
1
CIR-Net: Cross-modality Interaction and
Refinement for RGB-D Salient Object Detection
Runmin Cong, Member, IEEE, Qinwei Lin, Chen Zhang, Chongyi Li, Xiaochun Cao, Senior Member, IEEE,
Qingming Huang, Fellow, IEEE, and Yao Zhao, Senior Member, IEEE
Abstract—Focusing on the issue of how to effectively capture
and utilize cross-modality information in RGB-D salient object
detection (SOD) task, we present a convolutional neural network
(CNN) model, named CIR-Net, based on the novel cross-modality
interaction and refinement. For the cross-modality interaction,
1) a progressive attention guided integration unit is proposed
to sufficiently integrate RGB-D feature representations in the
encoder stage, and 2) a convergence aggregation structure is
proposed, which flows the RGB and depth decoding features into
the corresponding RGB-D decoding streams via an importance
gated fusion unit in the decoder stage. For the cross-modality
refinement, we insert a refinement middleware structure between
the encoder and the decoder, in which the RGB, depth, and
RGB-D encoder features are further refined by successively using
a self-modality attention refinement unit and a cross-modality
weighting refinement unit. At last, with the gradually refined
features, we predict the saliency map in the decoder stage.
Extensive experiments on six popular RGB-D SOD benchmarks
demonstrate that our network outperforms the state-of-the-art
saliency detectors both qualitatively and quantitatively. The code
and results can be found from the link of https://rmcong.github.
io/proj CIRNet.html.
Index Terms—Salient object detection, RGB-D images, Cross-
modality attention, Cross-modality interaction.
I. INTRODUCTION
WHEN viewing an image, humans are involuntarily
attracted by some objects or regions in the image
(e.g., the Smurfs in the second image of Fig. 1), which is
mainly caused by the human visual attention mechanism, and
these objects are called salient objects [1]–[3]. Simulating
this scheme, in the field of computer vision, salient object
detection (SOD) is the task of automatically locating the most
visually attractive objects or regions in a scene, which has been
successfully applied to numerous tasks, such as segmentation
Runmin Cong, Qinwei Lin, Chen Zhang, and Yao Zhao are with the Institute
of Information Science, Beijing Jiaotong University, Beijing 100044, China,
also with the Beijing Key Laboratory of Advanced Information Science and
Network Technology, Beijing 100044, China (e-mail: rmcong@bjtu.edu.cn,
lqw22@mails.tsinghua.edu.cn, chen.zhang@bjtu.edu.cn, yzhao@bjtu.edu.cn).
Chongyi Li is with the School of Computer Science and En-
gineering, Nanyang Technological University, Singapore (e-mail: li-
chongyi25@gmail.com).
Xiaochun Cao is with School of Cyber Science and Technology,
Shenzhen Campus, Sun Yat-sen University, 518107, China (e-mail:
caoxiaochun@mail.sysu.edu.cn).
Qingming Huang is with the School of Computer Science and Technology,
University of Chinese Academy of Sciences, Beijing 101408, China, also
with the Key Laboratory of Intelligent Information Processing, Institute
of Computing Technology, Chinese Academy of Sciences, Beijing 100190,
China, and also with Peng Cheng Laboratory, Shenzhen 518055, China (email:
qmhuang@ucas.ac.cn).
(a) (b) (c) (d) (e) (f)
Fig. 1. Visual examples of different methods. (a) RGB images. (b) Depth
maps. (c) Ground truths. (d) Our results. (e)-(f) Saliency maps produced by
FRDT [18] and GCPANet [19], respectively.
[4]–[9], retrieval [10], enhancement [11]–[15], and quality
assessment [16], [17].
With the development of SOD task research, there are many
subtasks, such as co-salient object detection (CoSOD) [20]–
[22], remote sensing SOD [23]–[27], video SOD [28]–[30],
light field SOD [31], have also been developed. In fact, the
natural binocular structure of humans can also perceive the
depth of field of the scene, and then generate stereo perception.
Expressing this depth relationship in the form of an image
is a depth/disparity map. In recent years, the development
and popularization of depth sensors, especially the rise of
affordable and portable consumer depth cameras, has further
promoted the applications of RGB-D data, such as depth map
super-resolution [32]–[34], depth estimation [35], superpixel
segmentation [36], and saliency detection [37]–[43]. For the
RGB-D images, the RGB image contains abundant details
and appearance information (e.g., color, texture, structure, etc)
while the depth map provides some valuable supplementary
information (e.g., shape, surface normals, internal consistency,
etc). Recently, more and more studies focus on the introduc-
tion of depth cue for the SOD task to effectively suppress
the background interference in complex scenes and further
completely highlight foreground salient regions. For example,
in Fig. 1, the first two images have complex and mussy
backgrounds, and the color contrast between the salient object
and the background in the fourth image is low. Thus, for the
RGB SOD method (i.e., the GCPANet [19]) shown in the last
row of Fig. 1, it is difficult to accurately locate the salient
regions with a clean background and a complete structure.
In comparison, the RGB-D SOD methods (e.g., the fourth
arXiv:2210.02843v1 [cs.CV] 6 Oct 2022
2
and fifth rows of Fig. 1) can alleviate this problem with
the introduction of depth information. Notably, our method
has better object positioning ability, completeness preserving
ability, and background suppression ability.
The effectiveness of depth map for SOD task has been
validated in previous work [44]–[47]; however, how to effec-
tively utilize and integrate the RGB information and depth
cue is still an open issue. This is because RGB image
and depth map belong to different modalities that have
different attributes. To achieve this, we design the three-stream
structure network to fully capture and utilize cross-modality
information. Considering the strengths and complementarities
of different modalities, through the three-stream structure with
independent RGB and depth streams, we can sufficiently
preserve the rich information and explore the complementary
relations of different modalities, which is beneficial to jointly
integrate cross-modality information in the encoder and de-
coder stage with a more comprehensive and in-depth manner
than the two-stream structure. It is manifested in the following
two aspects:
1) Cross-Modality Interaction. In terms of the cross-
modality information, the primary problem we face is how to
interact them. Specifically, the purpose is to learn the strengths
and complementarities of different modalities, then obtain
more comprehensive and discriminative feature representa-
tions. Different from the existing cross-modality interaction
methods that operated only in the encoder stage [48], [49] or
decoder stage [37], [47], [50], we dedicate to integrating cross-
modality information into both encoder and decoder stages
jointly in a more comprehensive and in-depth manner, which
sufficiently explores the complementary relations of different
modalities. Concretely, in the feature encoder stage, we design
a progressive attention guided integration (PAI) unit to fuse
cross-modality and cross-level features, thereby attaining the
RGB-D encoder representations. In the feature decoder stage,
we design an aggregation structure to allow RGB and depth
decoder features to flow into the RGB-D mainstream branch
and generate more comprehensive saliency-related features.
In this structure, the decoder features of the previous layer,
the RGB and depth decoder features of the corresponding
layer are integrated into confluence decoder features through
an important gate fusion (IGF) unit in a dynamic weighting
manner. The gradually refined decoder features of the last layer
are then used to predict the final saliency map.
2) Cross-Modality Refinement. In addition to cross-
modality interaction, refining the most valuable information
from different modalities is also crucial for RGB-D SOD
task. To this end, we insert a refinement middleware
between the encoder and the decoder, including the self-
modality refinement and cross-modality refinement. For
the self-modality refinement, in order to reduce the feature
redundancy of the channel dimension and emphasize the
important location of the spatial dimension, we propose
a simple but effective self-modality attention refinement
(smAR) unit, which replaces the commonly used progressive
interaction [51] or feature fusion [47] method with our
proposed channel-spatial attention generation. We directly
integrate spatial attention and channel attention in the feature
map space to generate a 3D attention tensor that is used to
refine the single modality features, which not only reduces the
computational cost, but also better highlights the important
features. Further, we design a cross-modality weighting
refinement (cmWR) unit to refine the multi-modality features
by considering cross-modality complementary information
and cross-modality global contextual dependencies. Inspired
by the non-local model [52], the RGB features, depth features,
and RGB-D features are integrated to capture the long-range
dependencies among different modalities. Then, we use the
integrated features to weight and refine different modality
features, thereby obtaining the refined features embedded
with cross-modality global context cue, which is important
for the perception of global information.
In summary, our method is unique in that the cross-modality
interaction and refinement are closely coupled in a compre-
hensive and in-depth manner. In terms of the cross-modality
interaction, for learning the strengths and complementarities of
different modalities, we propose the PAI unit in the encoder
stage and the IGF unit in the decoder stage to jointly explore
the complementary relations of different modalities. In terms
of cross-modality refinement, considering the information
redundancy of the encoder features and the significance of
global context cues for the SOD, we design the pluggable
refinement middleware structure to refine the encoder features
from the self-modality and cross-modality perspectives. The
main contributions are summarized as follows:
We propose an end-to-end cross-modality interaction and
refinement network (CIR-Net) for RGB-D SOD by fully
capturing and utilizing the cross-modality information in
an interaction and refinement manner.
The progressive attention guided integration unit and the
importance gated fusion unit are proposed to achieve
comprehensive cross-modality interaction in the encoder
and decoder stages respectively.
The refinement middleware structure including the self-
modality attention refinement unit and cross-modality
weighting refinement unit is designed to refine the multi-
modality encoder features by encoding the self-modality
3D attention tensor and the cross-modality contextual
dependencies.
Without any pre-processing (e.g., HHA [53]) or post-
processing (e.g., CRF [54]) techniques, our network
achieves competitive performance against the state-of-
the-art methods on six RGB-D SOD datasets.
The rest of this paper is organized as follows. In Section II,
we briefly review the related works of RGB-D SOD. In Section
III, we introduce the technical details of the proposed CIR-
Net. Then, the experiments including the comparisons with
state-of-the-art methods and ablation studies are conducted in
Section IV. Finally, the conclusion is drawn in Section V.
II. RELATED WORK
Different from RGB SOD models [55]–[59], depth modality
together with RGB appearance are introduced into RGB-D
SOD models. In the past ten years, a mass of methods have
been proposed, which can be roughly divided into traditional
3
methods [60]–[68] and deep learning-based methods [18],
[37], [39], [44]–[50], [69]–[74]. Especially in recent years,
the deep learning-based methods have achieved great break-
throughs in the performance of RGB-D SOD. For the RGB-
D SOD task, how to make full use of the cross-modality
information and generate more discriminate saliency-related
representation is a challenging issue to be addressed [75].
In terms of the model structure, the existing works can be
roughly divided into single-stream, two-stream and three-
stream structures, as shown in Fig. 2(a)-(c).
For the single-stream models [72], [76]–[78], the early
feature fusion strategy is commonly adopted, where RGB
image and depth map are concatenated into four channels
as the input of a network. For examples, Zhao et al. [72]
adopted a single-stream encoder to make full use of the
representation ability of the pre-trained network, and proposed
a real-time and robust salient detection model. Zhang et al.
[76], [77] proposed the first uncertainty-inspired RGB-D SOD
model based on conditional variational auto-encoder. Ji et al.
[78] proposed a novel collaborative learning framework that
integrated the edge, depth, and saliency collaborators, which
is a more lightweight and versatile network due to the free of
depth inputs during testing. However, such models ignore the
difference between RGB and depth modalities and lack the
comprehensive cross-modality interaction.
The two-stream models [39], [46], [47], [50], [51], [79]–
[82] are currently the most widely used structure in RGB-
D SOD task, mainly including two independent branches to
respectively process RGB and depth modality information and
generate cross-modality features in the encoder or decoder
stage. For example, Li et al. [46] proposed an attention
steered interweave fusion network, which progressively and
interactively captures cross-modality complementarity via the
interweave fusion and weighs the saliency regions by the
steering of the deeply supervised attention mechanism. Li
et al. [47] adopted the late feature fusion strategy to gen-
erate cross-modality representation which combines high-
level RGB and depth features of two independent branches
in the decoder stage. Zhai et al. [51] leveraged the multi-
modal and multi-level features to devise a novel cascaded
refinement network, and the RGB and depth modalities can
be fused in a complementary way. Zhang et al. [83] focused
on the roles of RGB and depth modalities in the cross-
modality interaction, and presented a discrepant interaction
mode, i.e., the RGB modality and the depth modality guide
each other interactively. Some studies are taking an interest in
the negative impact of low-quality depth maps by controlling,
updating, or abandoning the depth information in the two-
stream structure [79], [84]–[87]. Chen et al. [79] introduced
depth quality perception to control the impact of low-quality
depth maps while performing cross-modality interaction in the
two-stream structure. Chen et al. [84] estimated an additional
high-quality depth map as a complement to the original depth
map, and all these depth maps are fed into a selective fusion
network to achieve RGB-D SOD. Chen et al. [85] introduced a
depth-quality-aware subnet into the two-stream RGB-D SOD
structure to locate the most valuable depth regions.
In addition, some studies [70], [88], [89] adopted the three-
RGB Depth
C
CConcatenate Supervision Interaction
(a)(b)(c)
C
Fusion
Refinement Middleware
RGB Depth RGB Depth
RGB-D
(d)
RGB Depth
EEncoder DDecoder
E
D
E E E E E EE
D D D D
DDD
D
Fig. 2. Comparisons among different network structures for RGB-D SOD.
(a)-(c) denote the single-stream, two-stream and three-stream structures,
respectively. The (d) is the proposed structure in this paper.
stream network structure for comprehensive cross-modality
feature interaction, where RGB, depth, and RGB-D are em-
bedded in three sub-networks for learning and interaction,
respectively. For example, Fan et al. [88] designed a gate
mechanism to filter out the low-quality depth maps using the
decoder results of RGB, depth, and RGB-D branches.
Compared with the existing works, our work differs concep-
tually from theirs in that: Our proposed network architecture
(as shown in Fig. 2(d)) is a form between two-stream and
three-stream networks, and the RGB-D stream is formed
by interacting with the high-level features learned by the
single-branch network. In this way, the parameters of the
network can be reduced, and the RGB-D features can be better
established by our designed PAI unit. On balance, we classify
our network as a three-stream network architecture. This is
also the first point that makes our network different from other
networks. Second, in addition to the cross-modality feature
integration through the PAI unit in the encoder stage, we also
perform cross-modality information interaction in the decoder
stage to obtain the discriminative saliency prediction features.
Considering that the decoder features of RGB and depth
streams can further provide effective guidance information
(e.g., sharp edge, internal consistency) for RGB-D stream,
we design a convergence aggregation structure in the entire
decoder stage. In this way, we are dedicated to jointly
integrating cross-modality information into the encoder and
decoder stages in a more comprehensive manner. Third, to
better establish the relationship between encoder features
and decoder features, we introduce a refinement middleware
structure to further highlight the effective information before
decoding from the perspective of self-modality and cross-
modality. It is worth mentioning that such a middleware
structure is pluggable for three-stream networks.
III. PROPOSED METHOD
A. Overview
Fig. 3 shows the overview of the proposed CIR-Net that is
an encoder-decoder three-stream architecture equipped with a
refinement middleware between the encoder and the decoder.
In what follows, we detail the proposed method.
The feature encoder aims to learn the multi-level three-
stream features, i.e., RGB, depth, and RGB-D encoder fea-
4
cmWR IGF IGF
Deconv Deconv Deconv
Deconv Deconv
IGF IGFIGF
DeconvDeconvDeconv
DeconvDeconv up up up up
up up up up
up up up up
Progressive
Attention guided
Integration (PAI)
unit
up Up-sampling operation
Spatial attention
C
C
C
Skip-connection
smAR
smAR
smAR
smAR Self-modality Attention
Refinement Unit cmWR IGF
Cross-modality Weighting
Refinement Unit Importance Gated
Fusion Unit
Refinement Middleware
2;
1
122_
dsd
dnf
pndniof
Fig. 3. The overview of the proposed CIR-Net. The extracted RGB and depth features from the backbone are denoted as fi
rand fi
drespectively, where
rand drepresent RGB and depth streams, and i∈ {1,2, ..., 5}indexes the feature level. In the feature encoder, we also use the PAI unit to generate
the cross-modality RGB-D encoder features fi
rgbd (i∈ {3,4,5}). Then, the top-layer RGB, depth, and RGB-D features are embedded into the refinement
middleware consisting of a smAR unit and a cmWR unit to progressively refine the multi-modality encoder features in a self- and cross-modality manner.
Finally, the decoder features of the RGB branch and depth branch flow into the corresponding RGB-D stream to learn more comprehensive interaction features
through an IGF unit in the feature decoder stage. Note that all three branches output a corresponding saliency prediction map, and we use the output of the
RGB-D branch as the final result.
tures. First, the backbone network (e.g., ResNet50) is used
to extract the multi-level features from the input RGB image
and depth map, denoted as fi
rand fi
d, respectively, where
i∈ {1,2,3,4,5}indexes the feature level. Then, the RGB
and depth features at high levels are fed into the proposed pro-
gressive attention-guided integration (PAI) unit to generate the
cross-modality RGB-D encoder features fi
rgbd (i∈ {3,4,5}).
At this point, the three-stream encoder structure is formed, as
shown on the left side of Fig. 3.
Considering the information redundancy in the self-modality
and the content complementarity in the cross-modality, we in-
troduce a refinement middleware structure to further highlight
the effective information before decoding. Specifically, a two-
stage refinement mechanism composing of a self-modality at-
tention refinement (smAR) unit and a cross-modality weighing
refinement (cmWR) unit is designed to progressively refine the
multi-modality top-level encoder features in a self- and cross-
modality manner.
In the decoder stage, we devise a novel convergence aggre-
gation structure, in which the corresponding decoder features
of the RGB and depth streams flow into the corresponding
RGB-D stream to achieve cross-modality interaction. During
aggregation, an importance gated fusion (IGF) unit is proposed
to integrate the corresponding decoder features of RGB and
depth streams and the previous IGF outputs in a dynamic
weighting manner. Finally, the output features of the last IGF
unit are used to infer the final saliency map.
B. Progressive Attention Guided Integration Unit
Taking the complementarity and diversity of different
modalities into account, effective cross-modality information
interaction plays a critical role in the RGB-D SOD task.
For an encoder-decoder network architecture, the existing
interaction strategies are mainly designed separately in
the encoder stage [48], [49] or decoder stage [37], [47],
[50]. In comparison, we design specialized modules in
both encoder and decoder stages according to the different
interaction purposes. To achieve that, two key issues need to
be addressed: (1) how to effectively integrate and generate
the RGB-D representations based on the multi-level RGB
and depth features in the encoder stage, and (2) how the
single-modality stream can better collaborate with the RGB-D
stream to learn more discriminate saliency-related features
and predict more accurate saliency map in the decoder stage.
To this end, a PAI unit in the encoder stage and an IGF unit
in the decoder stage are proposed in our method. The IGF
unit will be introduced in Section III-D.
Specifically, to effectively integrate the RGB-D represen-
tations in the encoder stage, we consider two aspects when
designing the PAI unit: (1) sufficient multi-level information
fusion, and (2) effective feature selection and highlighting. For
the former, in the encoder stage, considering the fact that the
features of different levels contain different information with
varying scales, receptive fields, and contents. Thus, the pro-
gressive cross-level fusion strategy is designed to obtain more
comprehensive RGB-D representations in a coarse-to-fine
manner. For the latter, although the encoder features contain
rich multi-level information, the commonly used fusion strat-
egy (e.g., concat-conv) may introduce information redundancy
and easily confuse the feature representations. Therefore, for
feature selection and enhancement, we introduce the spatial
attention scheme to guide the cross-level and cross-modality
摘要:

1CIR-Net:Cross-modalityInteractionandRenementforRGB-DSalientObjectDetectionRunminCong,Member,IEEE,QinweiLin,ChenZhang,ChongyiLi,XiaochunCao,SeniorMember,IEEE,QingmingHuang,Fellow,IEEE,andYaoZhao,SeniorMember,IEEEAbstract—Focusingontheissueofhowtoeffectivelycaptureandutilizecross-modalityinformation...

展开>> 收起<<
1 CIR-Net Cross-modality Interaction and Refinement for RGB-D Salient Object Detection.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.95MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注