1 CIR-Net Cross-modality Interaction and Reﬁnement for RGB-D Salient Object Detection

2025-04-28 1 0 2.95MB 16 页 10玖币

侵权投诉

CIR-Net: Cross-modality Interaction and

Reﬁnement for RGB-D Salient Object Detection

Runmin Cong, Member, IEEE, Qinwei Lin, Chen Zhang, Chongyi Li, Xiaochun Cao, Senior Member, IEEE,

Qingming Huang, Fellow, IEEE, and Yao Zhao, Senior Member, IEEE

Abstract—Focusing on the issue of how to effectively capture

and utilize cross-modality information in RGB-D salient object

detection (SOD) task, we present a convolutional neural network

(CNN) model, named CIR-Net, based on the novel cross-modality

interaction and reﬁnement. For the cross-modality interaction,

1) a progressive attention guided integration unit is proposed

to sufﬁciently integrate RGB-D feature representations in the

encoder stage, and 2) a convergence aggregation structure is

proposed, which ﬂows the RGB and depth decoding features into

the corresponding RGB-D decoding streams via an importance

gated fusion unit in the decoder stage. For the cross-modality

reﬁnement, we insert a reﬁnement middleware structure between

the encoder and the decoder, in which the RGB, depth, and

RGB-D encoder features are further reﬁned by successively using

a self-modality attention reﬁnement unit and a cross-modality

weighting reﬁnement unit. At last, with the gradually reﬁned

features, we predict the saliency map in the decoder stage.

Extensive experiments on six popular RGB-D SOD benchmarks

demonstrate that our network outperforms the state-of-the-art

saliency detectors both qualitatively and quantitatively. The code

and results can be found from the link of https://rmcong.github.

io/proj CIRNet.html.

Index Terms—Salient object detection, RGB-D images, Cross-

modality attention, Cross-modality interaction.

I. INTRODUCTION

WHEN viewing an image, humans are involuntarily

attracted by some objects or regions in the image

(e.g., the Smurfs in the second image of Fig. 1), which is

mainly caused by the human visual attention mechanism, and

these objects are called salient objects [1]–[3]. Simulating

this scheme, in the ﬁeld of computer vision, salient object

detection (SOD) is the task of automatically locating the most

visually attractive objects or regions in a scene, which has been

successfully applied to numerous tasks, such as segmentation

Runmin Cong, Qinwei Lin, Chen Zhang, and Yao Zhao are with the Institute

of Information Science, Beijing Jiaotong University, Beijing 100044, China,

also with the Beijing Key Laboratory of Advanced Information Science and

Network Technology, Beijing 100044, China (e-mail: rmcong@bjtu.edu.cn,

lqw22@mails.tsinghua.edu.cn, chen.zhang@bjtu.edu.cn, yzhao@bjtu.edu.cn).

Chongyi Li is with the School of Computer Science and En-

gineering, Nanyang Technological University, Singapore (e-mail: li-

chongyi25@gmail.com).

Xiaochun Cao is with School of Cyber Science and Technology,

Shenzhen Campus, Sun Yat-sen University, 518107, China (e-mail:

caoxiaochun@mail.sysu.edu.cn).

Qingming Huang is with the School of Computer Science and Technology,

University of Chinese Academy of Sciences, Beijing 101408, China, also

with the Key Laboratory of Intelligent Information Processing, Institute

of Computing Technology, Chinese Academy of Sciences, Beijing 100190,

China, and also with Peng Cheng Laboratory, Shenzhen 518055, China (email:

qmhuang@ucas.ac.cn).

(a) (b) (c) (d) (e) (f)

Fig. 1. Visual examples of different methods. (a) RGB images. (b) Depth

maps. (c) Ground truths. (d) Our results. (e)-(f) Saliency maps produced by

FRDT [18] and GCPANet [19], respectively.

[4]–[9], retrieval [10], enhancement [11]–[15], and quality

assessment [16], [17].

With the development of SOD task research, there are many

subtasks, such as co-salient object detection (CoSOD) [20]–

[22], remote sensing SOD [23]–[27], video SOD [28]–[30],

light ﬁeld SOD [31], have also been developed. In fact, the

natural binocular structure of humans can also perceive the

depth of ﬁeld of the scene, and then generate stereo perception.

Expressing this depth relationship in the form of an image

is a depth/disparity map. In recent years, the development

and popularization of depth sensors, especially the rise of

affordable and portable consumer depth cameras, has further

promoted the applications of RGB-D data, such as depth map

super-resolution [32]–[34], depth estimation [35], superpixel

segmentation [36], and saliency detection [37]–[43]. For the

RGB-D images, the RGB image contains abundant details

and appearance information (e.g., color, texture, structure, etc)

while the depth map provides some valuable supplementary

information (e.g., shape, surface normals, internal consistency,

etc). Recently, more and more studies focus on the introduc-

tion of depth cue for the SOD task to effectively suppress

the background interference in complex scenes and further

completely highlight foreground salient regions. For example,

in Fig. 1, the ﬁrst two images have complex and mussy

backgrounds, and the color contrast between the salient object

and the background in the fourth image is low. Thus, for the

RGB SOD method (i.e., the GCPANet [19]) shown in the last

row of Fig. 1, it is difﬁcult to accurately locate the salient

regions with a clean background and a complete structure.

In comparison, the RGB-D SOD methods (e.g., the fourth

arXiv:2210.02843v1 [cs.CV] 6 Oct 2022

and ﬁfth rows of Fig. 1) can alleviate this problem with

the introduction of depth information. Notably, our method

has better object positioning ability, completeness preserving

ability, and background suppression ability.

The effectiveness of depth map for SOD task has been

validated in previous work [44]–[47]; however, how to effec-

tively utilize and integrate the RGB information and depth

cue is still an open issue. This is because RGB image

and depth map belong to different modalities that have

different attributes. To achieve this, we design the three-stream

structure network to fully capture and utilize cross-modality

information. Considering the strengths and complementarities

of different modalities, through the three-stream structure with

independent RGB and depth streams, we can sufﬁciently

preserve the rich information and explore the complementary

relations of different modalities, which is beneﬁcial to jointly

integrate cross-modality information in the encoder and de-

coder stage with a more comprehensive and in-depth manner

than the two-stream structure. It is manifested in the following

two aspects:

1) Cross-Modality Interaction. In terms of the cross-

modality information, the primary problem we face is how to

interact them. Speciﬁcally, the purpose is to learn the strengths

and complementarities of different modalities, then obtain

more comprehensive and discriminative feature representa-

tions. Different from the existing cross-modality interaction

methods that operated only in the encoder stage [48], [49] or

decoder stage [37], [47], [50], we dedicate to integrating cross-

modality information into both encoder and decoder stages

jointly in a more comprehensive and in-depth manner, which

sufﬁciently explores the complementary relations of different

modalities. Concretely, in the feature encoder stage, we design

a progressive attention guided integration (PAI) unit to fuse

cross-modality and cross-level features, thereby attaining the

RGB-D encoder representations. In the feature decoder stage,

we design an aggregation structure to allow RGB and depth

decoder features to ﬂow into the RGB-D mainstream branch

and generate more comprehensive saliency-related features.

In this structure, the decoder features of the previous layer,

the RGB and depth decoder features of the corresponding

layer are integrated into conﬂuence decoder features through

an important gate fusion (IGF) unit in a dynamic weighting

manner. The gradually reﬁned decoder features of the last layer

are then used to predict the ﬁnal saliency map.

2) Cross-Modality Reﬁnement. In addition to cross-

modality interaction, reﬁning the most valuable information

from different modalities is also crucial for RGB-D SOD

task. To this end, we insert a reﬁnement middleware

between the encoder and the decoder, including the self-

modality reﬁnement and cross-modality reﬁnement. For

the self-modality reﬁnement, in order to reduce the feature

redundancy of the channel dimension and emphasize the

important location of the spatial dimension, we propose

a simple but effective self-modality attention reﬁnement

(smAR) unit, which replaces the commonly used progressive

interaction [51] or feature fusion [47] method with our

proposed channel-spatial attention generation. We directly

integrate spatial attention and channel attention in the feature

map space to generate a 3D attention tensor that is used to

reﬁne the single modality features, which not only reduces the

computational cost, but also better highlights the important

features. Further, we design a cross-modality weighting

reﬁnement (cmWR) unit to reﬁne the multi-modality features

by considering cross-modality complementary information

and cross-modality global contextual dependencies. Inspired

by the non-local model [52], the RGB features, depth features,

and RGB-D features are integrated to capture the long-range

dependencies among different modalities. Then, we use the

integrated features to weight and reﬁne different modality

features, thereby obtaining the reﬁned features embedded

with cross-modality global context cue, which is important

for the perception of global information.

In summary, our method is unique in that the cross-modality

interaction and reﬁnement are closely coupled in a compre-

hensive and in-depth manner. In terms of the cross-modality

interaction, for learning the strengths and complementarities of

different modalities, we propose the PAI unit in the encoder

stage and the IGF unit in the decoder stage to jointly explore

the complementary relations of different modalities. In terms

of cross-modality reﬁnement, considering the information

redundancy of the encoder features and the signiﬁcance of

global context cues for the SOD, we design the pluggable

reﬁnement middleware structure to reﬁne the encoder features

from the self-modality and cross-modality perspectives. The

main contributions are summarized as follows:

•We propose an end-to-end cross-modality interaction and

reﬁnement network (CIR-Net) for RGB-D SOD by fully

capturing and utilizing the cross-modality information in

an interaction and reﬁnement manner.

•The progressive attention guided integration unit and the

importance gated fusion unit are proposed to achieve

comprehensive cross-modality interaction in the encoder

and decoder stages respectively.

•The reﬁnement middleware structure including the self-

modality attention reﬁnement unit and cross-modality

weighting reﬁnement unit is designed to reﬁne the multi-

modality encoder features by encoding the self-modality

3D attention tensor and the cross-modality contextual

dependencies.

•Without any pre-processing (e.g., HHA [53]) or post-

processing (e.g., CRF [54]) techniques, our network

achieves competitive performance against the state-of-

the-art methods on six RGB-D SOD datasets.

The rest of this paper is organized as follows. In Section II,

we brieﬂy review the related works of RGB-D SOD. In Section

III, we introduce the technical details of the proposed CIR-

Net. Then, the experiments including the comparisons with

state-of-the-art methods and ablation studies are conducted in

Section IV. Finally, the conclusion is drawn in Section V.

II. RELATED WORK

Different from RGB SOD models [55]–[59], depth modality

together with RGB appearance are introduced into RGB-D

SOD models. In the past ten years, a mass of methods have

been proposed, which can be roughly divided into traditional

methods [60]–[68] and deep learning-based methods [18],

[37], [39], [44]–[50], [69]–[74]. Especially in recent years,

the deep learning-based methods have achieved great break-

throughs in the performance of RGB-D SOD. For the RGB-

D SOD task, how to make full use of the cross-modality

information and generate more discriminate saliency-related

representation is a challenging issue to be addressed [75].

In terms of the model structure, the existing works can be

roughly divided into single-stream, two-stream and three-

stream structures, as shown in Fig. 2(a)-(c).

For the single-stream models [72], [76]–[78], the early

feature fusion strategy is commonly adopted, where RGB

image and depth map are concatenated into four channels

as the input of a network. For examples, Zhao et al. [72]

adopted a single-stream encoder to make full use of the

representation ability of the pre-trained network, and proposed

a real-time and robust salient detection model. Zhang et al.

[76], [77] proposed the ﬁrst uncertainty-inspired RGB-D SOD

model based on conditional variational auto-encoder. Ji et al.

[78] proposed a novel collaborative learning framework that

integrated the edge, depth, and saliency collaborators, which

is a more lightweight and versatile network due to the free of

depth inputs during testing. However, such models ignore the

difference between RGB and depth modalities and lack the

comprehensive cross-modality interaction.

The two-stream models [39], [46], [47], [50], [51], [79]–

[82] are currently the most widely used structure in RGB-

D SOD task, mainly including two independent branches to

respectively process RGB and depth modality information and

generate cross-modality features in the encoder or decoder

stage. For example, Li et al. [46] proposed an attention

steered interweave fusion network, which progressively and

interactively captures cross-modality complementarity via the

interweave fusion and weighs the saliency regions by the

steering of the deeply supervised attention mechanism. Li

et al. [47] adopted the late feature fusion strategy to gen-

erate cross-modality representation which combines high-

level RGB and depth features of two independent branches

in the decoder stage. Zhai et al. [51] leveraged the multi-

modal and multi-level features to devise a novel cascaded

reﬁnement network, and the RGB and depth modalities can

be fused in a complementary way. Zhang et al. [83] focused

on the roles of RGB and depth modalities in the cross-

modality interaction, and presented a discrepant interaction

mode, i.e., the RGB modality and the depth modality guide

each other interactively. Some studies are taking an interest in

the negative impact of low-quality depth maps by controlling,

updating, or abandoning the depth information in the two-

stream structure [79], [84]–[87]. Chen et al. [79] introduced

depth quality perception to control the impact of low-quality

depth maps while performing cross-modality interaction in the

two-stream structure. Chen et al. [84] estimated an additional

high-quality depth map as a complement to the original depth

map, and all these depth maps are fed into a selective fusion

network to achieve RGB-D SOD. Chen et al. [85] introduced a

depth-quality-aware subnet into the two-stream RGB-D SOD

structure to locate the most valuable depth regions.

In addition, some studies [70], [88], [89] adopted the three-

RGB Depth

CConcatenate Supervision Interaction

(a)(b)(c)

Fusion

Refinement Middleware

RGB Depth RGB Depth

RGB-D

(d)

RGB Depth

EEncoder DDecoder

E E E E E EE

D D D D

DDD

Fig. 2. Comparisons among different network structures for RGB-D SOD.

(a)-(c) denote the single-stream, two-stream and three-stream structures,

respectively. The (d) is the proposed structure in this paper.

stream network structure for comprehensive cross-modality

feature interaction, where RGB, depth, and RGB-D are em-

bedded in three sub-networks for learning and interaction,

respectively. For example, Fan et al. [88] designed a gate

mechanism to ﬁlter out the low-quality depth maps using the

decoder results of RGB, depth, and RGB-D branches.

Compared with the existing works, our work differs concep-

tually from theirs in that: Our proposed network architecture

(as shown in Fig. 2(d)) is a form between two-stream and

three-stream networks, and the RGB-D stream is formed

by interacting with the high-level features learned by the

single-branch network. In this way, the parameters of the

network can be reduced, and the RGB-D features can be better

established by our designed PAI unit. On balance, we classify

our network as a three-stream network architecture. This is

also the ﬁrst point that makes our network different from other

networks. Second, in addition to the cross-modality feature

integration through the PAI unit in the encoder stage, we also

perform cross-modality information interaction in the decoder

stage to obtain the discriminative saliency prediction features.

Considering that the decoder features of RGB and depth

streams can further provide effective guidance information

(e.g., sharp edge, internal consistency) for RGB-D stream,

we design a convergence aggregation structure in the entire

decoder stage. In this way, we are dedicated to jointly

integrating cross-modality information into the encoder and

decoder stages in a more comprehensive manner. Third, to

better establish the relationship between encoder features

and decoder features, we introduce a reﬁnement middleware

structure to further highlight the effective information before

decoding from the perspective of self-modality and cross-

modality. It is worth mentioning that such a middleware

structure is pluggable for three-stream networks.

III. PROPOSED METHOD

A. Overview

Fig. 3 shows the overview of the proposed CIR-Net that is

an encoder-decoder three-stream architecture equipped with a

reﬁnement middleware between the encoder and the decoder.

In what follows, we detail the proposed method.

The feature encoder aims to learn the multi-level three-

stream features, i.e., RGB, depth, and RGB-D encoder fea-

cmWR IGF IGF

Deconv Deconv Deconv

Deconv Deconv

IGF IGFIGF

DeconvDeconvDeconv

DeconvDeconv up up up up

up up up up

Progressive

Attention guided

Integration (PAI)

unit

up Up-sampling operation

Spatial attention

Skip-connection

smAR

smAR Self-modality Attention

Refinement Unit cmWR IGF

Cross-modality Weighting

Refinement Unit Importance Gated

Fusion Unit

Refinement Middleware

122_

dsd

dnf

pndniof





















































































Fig. 3. The overview of the proposed CIR-Net. The extracted RGB and depth features from the backbone are denoted as fi

rand fi

drespectively, where

rand drepresent RGB and depth streams, and i∈ {1,2, ..., 5}indexes the feature level. In the feature encoder, we also use the PAI unit to generate

the cross-modality RGB-D encoder features fi

rgbd (i∈ {3,4,5}). Then, the top-layer RGB, depth, and RGB-D features are embedded into the reﬁnement

middleware consisting of a smAR unit and a cmWR unit to progressively reﬁne the multi-modality encoder features in a self- and cross-modality manner.

Finally, the decoder features of the RGB branch and depth branch ﬂow into the corresponding RGB-D stream to learn more comprehensive interaction features

through an IGF unit in the feature decoder stage. Note that all three branches output a corresponding saliency prediction map, and we use the output of the

RGB-D branch as the ﬁnal result.

tures. First, the backbone network (e.g., ResNet50) is used

to extract the multi-level features from the input RGB image

and depth map, denoted as fi

rand fi

d, respectively, where

i∈ {1,2,3,4,5}indexes the feature level. Then, the RGB

and depth features at high levels are fed into the proposed pro-

gressive attention-guided integration (PAI) unit to generate the

cross-modality RGB-D encoder features fi

rgbd (i∈ {3,4,5}).

At this point, the three-stream encoder structure is formed, as

shown on the left side of Fig. 3.

Considering the information redundancy in the self-modality

and the content complementarity in the cross-modality, we in-

troduce a reﬁnement middleware structure to further highlight

the effective information before decoding. Speciﬁcally, a two-

stage reﬁnement mechanism composing of a self-modality at-

tention reﬁnement (smAR) unit and a cross-modality weighing

reﬁnement (cmWR) unit is designed to progressively reﬁne the

multi-modality top-level encoder features in a self- and cross-

modality manner.

In the decoder stage, we devise a novel convergence aggre-

gation structure, in which the corresponding decoder features

of the RGB and depth streams ﬂow into the corresponding

RGB-D stream to achieve cross-modality interaction. During

aggregation, an importance gated fusion (IGF) unit is proposed

to integrate the corresponding decoder features of RGB and

depth streams and the previous IGF outputs in a dynamic

weighting manner. Finally, the output features of the last IGF

unit are used to infer the ﬁnal saliency map.

B. Progressive Attention Guided Integration Unit

Taking the complementarity and diversity of different

modalities into account, effective cross-modality information

interaction plays a critical role in the RGB-D SOD task.

For an encoder-decoder network architecture, the existing

interaction strategies are mainly designed separately in

the encoder stage [48], [49] or decoder stage [37], [47],

[50]. In comparison, we design specialized modules in

both encoder and decoder stages according to the different

interaction purposes. To achieve that, two key issues need to

be addressed: (1) how to effectively integrate and generate

the RGB-D representations based on the multi-level RGB

and depth features in the encoder stage, and (2) how the

single-modality stream can better collaborate with the RGB-D

stream to learn more discriminate saliency-related features

and predict more accurate saliency map in the decoder stage.

To this end, a PAI unit in the encoder stage and an IGF unit

in the decoder stage are proposed in our method. The IGF

unit will be introduced in Section III-D.

Speciﬁcally, to effectively integrate the RGB-D represen-

tations in the encoder stage, we consider two aspects when

designing the PAI unit: (1) sufﬁcient multi-level information

fusion, and (2) effective feature selection and highlighting. For

the former, in the encoder stage, considering the fact that the

features of different levels contain different information with

varying scales, receptive ﬁelds, and contents. Thus, the pro-

gressive cross-level fusion strategy is designed to obtain more

comprehensive RGB-D representations in a coarse-to-ﬁne

manner. For the latter, although the encoder features contain

rich multi-level information, the commonly used fusion strat-

egy (e.g., concat-conv) may introduce information redundancy

and easily confuse the feature representations. Therefore, for

feature selection and enhancement, we introduce the spatial

attention scheme to guide the cross-level and cross-modality

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1CIR-Net:Cross-modalityInteractionandRenementforRGB-DSalientObjectDetectionRunminCong,Member,IEEE,QinweiLin,ChenZhang,ChongyiLi,XiaochunCao,SeniorMember,IEEE,QingmingHuang,Fellow,IEEE,andYaoZhao,SeniorMember,IEEEAbstractFocusingontheissueofhowtoeffectivelycaptureandutilizecross-modalityinformation...

展开>> 收起<<

1 CIR-Net Cross-modality Interaction and Reﬁnement for RGB-D Salient Object Detection.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 CIR-Net Cross-modality Interaction and Reﬁnement for RGB-D Salient Object Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: