2
and fifth rows of Fig. 1) can alleviate this problem with
the introduction of depth information. Notably, our method
has better object positioning ability, completeness preserving
ability, and background suppression ability.
The effectiveness of depth map for SOD task has been
validated in previous work [44]–[47]; however, how to effec-
tively utilize and integrate the RGB information and depth
cue is still an open issue. This is because RGB image
and depth map belong to different modalities that have
different attributes. To achieve this, we design the three-stream
structure network to fully capture and utilize cross-modality
information. Considering the strengths and complementarities
of different modalities, through the three-stream structure with
independent RGB and depth streams, we can sufficiently
preserve the rich information and explore the complementary
relations of different modalities, which is beneficial to jointly
integrate cross-modality information in the encoder and de-
coder stage with a more comprehensive and in-depth manner
than the two-stream structure. It is manifested in the following
two aspects:
1) Cross-Modality Interaction. In terms of the cross-
modality information, the primary problem we face is how to
interact them. Specifically, the purpose is to learn the strengths
and complementarities of different modalities, then obtain
more comprehensive and discriminative feature representa-
tions. Different from the existing cross-modality interaction
methods that operated only in the encoder stage [48], [49] or
decoder stage [37], [47], [50], we dedicate to integrating cross-
modality information into both encoder and decoder stages
jointly in a more comprehensive and in-depth manner, which
sufficiently explores the complementary relations of different
modalities. Concretely, in the feature encoder stage, we design
a progressive attention guided integration (PAI) unit to fuse
cross-modality and cross-level features, thereby attaining the
RGB-D encoder representations. In the feature decoder stage,
we design an aggregation structure to allow RGB and depth
decoder features to flow into the RGB-D mainstream branch
and generate more comprehensive saliency-related features.
In this structure, the decoder features of the previous layer,
the RGB and depth decoder features of the corresponding
layer are integrated into confluence decoder features through
an important gate fusion (IGF) unit in a dynamic weighting
manner. The gradually refined decoder features of the last layer
are then used to predict the final saliency map.
2) Cross-Modality Refinement. In addition to cross-
modality interaction, refining the most valuable information
from different modalities is also crucial for RGB-D SOD
task. To this end, we insert a refinement middleware
between the encoder and the decoder, including the self-
modality refinement and cross-modality refinement. For
the self-modality refinement, in order to reduce the feature
redundancy of the channel dimension and emphasize the
important location of the spatial dimension, we propose
a simple but effective self-modality attention refinement
(smAR) unit, which replaces the commonly used progressive
interaction [51] or feature fusion [47] method with our
proposed channel-spatial attention generation. We directly
integrate spatial attention and channel attention in the feature
map space to generate a 3D attention tensor that is used to
refine the single modality features, which not only reduces the
computational cost, but also better highlights the important
features. Further, we design a cross-modality weighting
refinement (cmWR) unit to refine the multi-modality features
by considering cross-modality complementary information
and cross-modality global contextual dependencies. Inspired
by the non-local model [52], the RGB features, depth features,
and RGB-D features are integrated to capture the long-range
dependencies among different modalities. Then, we use the
integrated features to weight and refine different modality
features, thereby obtaining the refined features embedded
with cross-modality global context cue, which is important
for the perception of global information.
In summary, our method is unique in that the cross-modality
interaction and refinement are closely coupled in a compre-
hensive and in-depth manner. In terms of the cross-modality
interaction, for learning the strengths and complementarities of
different modalities, we propose the PAI unit in the encoder
stage and the IGF unit in the decoder stage to jointly explore
the complementary relations of different modalities. In terms
of cross-modality refinement, considering the information
redundancy of the encoder features and the significance of
global context cues for the SOD, we design the pluggable
refinement middleware structure to refine the encoder features
from the self-modality and cross-modality perspectives. The
main contributions are summarized as follows:
•We propose an end-to-end cross-modality interaction and
refinement network (CIR-Net) for RGB-D SOD by fully
capturing and utilizing the cross-modality information in
an interaction and refinement manner.
•The progressive attention guided integration unit and the
importance gated fusion unit are proposed to achieve
comprehensive cross-modality interaction in the encoder
and decoder stages respectively.
•The refinement middleware structure including the self-
modality attention refinement unit and cross-modality
weighting refinement unit is designed to refine the multi-
modality encoder features by encoding the self-modality
3D attention tensor and the cross-modality contextual
dependencies.
•Without any pre-processing (e.g., HHA [53]) or post-
processing (e.g., CRF [54]) techniques, our network
achieves competitive performance against the state-of-
the-art methods on six RGB-D SOD datasets.
The rest of this paper is organized as follows. In Section II,
we briefly review the related works of RGB-D SOD. In Section
III, we introduce the technical details of the proposed CIR-
Net. Then, the experiments including the comparisons with
state-of-the-art methods and ablation studies are conducted in
Section IV. Finally, the conclusion is drawn in Section V.
II. RELATED WORK
Different from RGB SOD models [55]–[59], depth modality
together with RGB appearance are introduced into RGB-D
SOD models. In the past ten years, a mass of methods have
been proposed, which can be roughly divided into traditional