2
objects with salient appearance may be incorrectly preserved.
(see the 1st row of Fig. 1). To alleviate the problems mentioned
above, the undifferentiated and bidirectional fusion mechanism
is proposed as another typical interaction mode, as shown in
Fig. 1(b), which no longer distinguishes their primary and sec-
ondary roles. Fusing the two modality features by addition or
concatenation is the simplest solution, but this way often fails
to achieve the desired results, especially for some complex
scenes (see the 2nd row of Fig. 1). In addition, some works
[18] learn the weights to determine the contributions of spatial
and temporal features, and then achieve adaptive fusion of two
modality features. Although these methods appear to be quite
intelligent and achieve relatively competitive performance, this
black-box adaptive fusion strategy sometimes only trades off
performance rather than maximizing gains when faced with
different scenarios. As shown in 3rd and 4th rows of Fig.
1, they are different frames from different moments of the
same video. Although they are similar scenes, the contribution
of the two modality data to the final saliency detection is
different. We can find that the appearance cues are more
important than the motion cues in the 3rd row, where the
dramatic moving of objects and the change of camera position
lead to unclear and blur motion cues. While in the 4th row,
motion cues can provide more effective guidance information
compared with appearance cues that contain some irrepressible
noise. According to these observations, when salient objects
and backgrounds share similar appearances or background
interference is disturbing, interlaced and wrong appearance
cues could greatly contaminate the final detection results. But
at this time, perhaps accurate motion cues will help us to
segment the salient objects correctly. Alternatively, too slow
or too fast object motion will blur the estimated optical flow
map, thus failing to provide discriminative motion cues and
affecting the final detection. In this case, satisfactory detection
results can be obtained by exploiting the semantic information
from distinctive appearance cues and features. In other words,
the roles of the two modalities in different scenes or even
similar scenes cannot be generalized, and the uncertainty
of the scene makes it very difficult to model interaction
fully adaptively. Instead of learning the importance of these
two modalities regardless and fully adaptively, we propose
a more secure modeling strategy, where the importance of
appearance cues and motion cues will be comprehensively
and explicitly taken into account to generate the saliency
maps, as shown in Fig. 1(c). In our network, we design a
top-bottom parallel symmetric structure, which sacrifices the
full-automatic intelligence so that we can fuse features more
comprehensively, considering the adaptability of the network
to different scenarios. Since it struggles for the network to
distinguish which modality is more important in one particular
scenario, we design two branches with varying tendencies
of importance for VSOD, taking one modality feature as a
dominant role in each branch and then supplementing from
another modality.
Under the parallel symmetric structure, we need to do
two things, one is how to realize the utilization of the two
modality information in each branch more clearly, and the
other is how to integrate the information of the upper and lower
branches to generate the final result. For the first issue, we
design the Gather Diffusion Reinforcement (GDR) module and
Cross-modality Refinement and Complement (CRC) module
to achieve dominate-modality feature reinforcement and cross-
modality feature interaction, respectively. Considering that the
high-level semantic information can reduce the interference of
non-salient information in a single modality and multi-scale
information can contribute to more comprehensive features,
we design a GDR module to enhance the effectiveness of
dominant features in each branch and improve the multi-scale
correlation of the dominant features themselves. The outputs
of the GDR module are then used for the CRC module in a top-
down manner. The key ideas behind the design of the CRC
module are as follows. Even if the data from one modality
plays a dominant role, there is more or less useful information
from the other modality. We divide this role into two types,
one is the refinement role, which is mainly used to suppress
the irrelevant redundancies in the dominant features, and the
other is the complementary role, mainly used to compensate
for potential information missing in dominant features. There-
fore, we design the CRC module to achieve comprehensive
information interaction in the case of explicit primary and
secondary relations, which can play the most significant role
in our proposed parallel symmetric framework. Although both
our upper and lower branches are fully implemented in the
VSOD task, the dominant modality they set is different. To
obtain more robust and generalized final results, we need to
integrate the two branches, which is the second problem we
need to solve. Considering the different importance of the
upper and lower branches in different scenarios, we introduce
an Importance Perception Fusion (IPF) module for adaptive
fusion. All designed modules are closely cooperated and
integrated under our parallel symmetrical structure to achieve
better detection performance. As shown in the 5th column
of Fig. 1, our model can accurately locate salient objects in
different types of scenes, with obvious advantages in detail
representation and background suppression. The contributions
of this paper can be summarized as:
•Considering the adaptability of the network to different
scenarios and the uncertainty of the role of different
modalities, we propose a parallel symmetric network
(PSNet) for VSOD that simultaneously models the im-
portance of two modality features in an explicit way.
•We propose a GDR module in each branch to perform
multi-scale content enhancement for dominant features
and design a CRC module to achieve cross-modality
interaction, where the auxiliary features are applied to
refine and supplement dominant features.
•Experimental results on four mainstream datasets demon-
strate that our PSNet outperforms 25 state-of-the-art
methods both quantitatively and qualitatively.
II. RELATED WORK
A. Salient Object Detection in Single Image and Image Group
For decades, single image-based SOD task has achieved
extensive development [19]–[33], and has been widely used
in many related fields [2], such as object segmentation [34],