1 PSNet Parallel Symmetric Network for Video Salient Object Detection

2025-04-27 0 0 6.86MB 13 页 10玖币
侵权投诉
1
PSNet: Parallel Symmetric Network for Video
Salient Object Detection
Runmin Cong, Member, IEEE, Weiyu Song, Jianjun Lei, Senior Member, IEEE, Guanghui Yue,
Yao Zhao, Senior Member, IEEE, and Sam Kwong, Fellow, IEEE
Abstract—For the video salient object detection (VSOD) task,
how to excavate the information from the appearance modality
and the motion modality has always been a topic of great concern.
The two-stream structure, including an RGB appearance stream
and an optical flow motion stream, has been widely used as a
typical pipeline for VSOD tasks, but the existing methods usually
only use motion features to unidirectionally guide appearance
features or adaptively but blindly fuse two modality features.
However, these methods underperform in diverse scenarios due
to the uncomprehensive and unspecific learning schemes. In this
paper, following a more secure modeling philosophy, we deeply
investigate the importance of appearance modality and motion
modality in a more comprehensive way and propose a VSOD
network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set
to achieve complete video saliency decoding with the cooperation
of the Gather Diffusion Reinforcement (GDR) module and Cross-
modality Refinement and Complement (CRC) module. Finally, we
use the Importance Perception Fusion (IPF) module to fuse the
features from two parallel branches according to their different
importance in different scenarios. Experiments on four dataset
benchmarks demonstrate that our method achieves desirable and
competitive performance. The code and results can be found from
the link of https://rmcong.github.io/proj PSNet.html.
Index Terms—Salient object detection, Video sequence, Parallel
symmetric structure, Importance perception.
I. INTRODUCTION
VIDEO salient object detection (VSOD) focuses on ex-
tracting the most attractive and motion related objects
in a video sequence [1], [2], which has been used as a pre-
processing step for a wide range of tasks, such as video
understanding [3]–[6], video compression [7], video tracking
[8], and video caption [9]. Due to the characteristic of video,
Runmin Cong is with the Institute of Information Science, Beijing Jiaotong
University, Beijing 100044, China, also with the Beijing Key Laboratory of
Advanced Information Science and Network Technology, Beijing 100044,
China, and also with the Department of Computer Science, City University
of Hong Kong, Hong Kong SAR, China (e-mail: rmcong@bjtu.edu.cn).
Weiyu Song and Yao Zhao are with the Institute of Information Science,
Beijing Jiaotong University, Beijing 100044, China, and also with the Beijing
Key Laboratory of Advanced Information Science and Network Technology,
Beijing 100044, China (e-mail: wysong125@bjtu.edu.cn, yzhao@bjtu.edu.cn).
Jianjun Lei is with the School of Electrical and Information Engineering,
Tianjin University, Tianjin 300072, China (e-mail: jjlei@tju.edu.cn).
Guanghui Yue is with the National, regional Key Technology Engineer-
ing Laboratory for Medical Ultrasound, Guangdong Key Laboratory for
Biomedical Measurements and Ultrasound Imaging, School of Biomedical
Engineering, Health Science Center, Shenzhen University, Shenzhen 518060,
China (email: yueguanghui@szu.edu.cn).
Sam Kwong is with the Department of Computer Science, City University
of Hong Kong, Hong Kong SAR, China, and also with the City University
of Hong Kong Shenzhen Research Institute, Shenzhen 51800, China (e-mail:
cssamk@cityu.edu.hk).
Temporal
Encoder
Decoder
(a)
Spatial
Encoder
Decoder
(b)
Spatial
Encoder
(c)
Decoder
Decoder
RGB GT
Optical Flow Model Ours
(d) (e) (f) (g) (h)
Spatial
Encoder
Temporal
Encoder
Temporal
Encoder
Fig. 1. Top: The structures of VSOD models between our method (c) and
the other optical flow-based two-stream VSOD methods (a) (b). Bottom: The
saliency results from different models in different scenes. (d) RGB images; (e)
Optical flow images; (f) GT; (g) Saliency maps deduced by different methods,
where the first row is generated by the MGA method [16], the second row is
generated by our baseline model with addition fusion, and the last two rows
are generated by the CAG method [18]; (h) Our model.
in addition to the appearance cue, the motion attribute plays
an important role, which is different from the SOD task for
static images. Entering the deep learning era, a variety of
VSOD methods have been explored, which can be roughly
divided into two categories, e.g., single-stream methods using
the temporal convolution/long short-term memory [10]–[14]
and two-stream methods using the optical flow [15]–[18]. Even
so, it is still very challenging for current VSOD methods to
fully excavate and integrate the information from motion and
appearance cues. For the optical flow-based two-stream VSOD
model, how to achieve the information interaction according to
the role of the two modalities is very important. In this paper,
we first rethink and review the interaction mode in the optical
flow-based two-stream VSOD structure, and the existing meth-
ods can be further categorized into two categories. One is the
unidirectional guidance model, as shown in Fig. 1(a), in which
the motion information mainly plays a supplementary role. For
example, Li et al. [16] encouraged motion features to guide
the appearance features in the designed VSOD model. As a
result, the model pays too much attention to the spatial branch,
while the advantage of the motion branch is weakened when
dealing with some challenging scenes, such as the stationary
arXiv:2210.05912v1 [cs.CV] 12 Oct 2022
2
objects with salient appearance may be incorrectly preserved.
(see the 1st row of Fig. 1). To alleviate the problems mentioned
above, the undifferentiated and bidirectional fusion mechanism
is proposed as another typical interaction mode, as shown in
Fig. 1(b), which no longer distinguishes their primary and sec-
ondary roles. Fusing the two modality features by addition or
concatenation is the simplest solution, but this way often fails
to achieve the desired results, especially for some complex
scenes (see the 2nd row of Fig. 1). In addition, some works
[18] learn the weights to determine the contributions of spatial
and temporal features, and then achieve adaptive fusion of two
modality features. Although these methods appear to be quite
intelligent and achieve relatively competitive performance, this
black-box adaptive fusion strategy sometimes only trades off
performance rather than maximizing gains when faced with
different scenarios. As shown in 3rd and 4th rows of Fig.
1, they are different frames from different moments of the
same video. Although they are similar scenes, the contribution
of the two modality data to the final saliency detection is
different. We can find that the appearance cues are more
important than the motion cues in the 3rd row, where the
dramatic moving of objects and the change of camera position
lead to unclear and blur motion cues. While in the 4th row,
motion cues can provide more effective guidance information
compared with appearance cues that contain some irrepressible
noise. According to these observations, when salient objects
and backgrounds share similar appearances or background
interference is disturbing, interlaced and wrong appearance
cues could greatly contaminate the final detection results. But
at this time, perhaps accurate motion cues will help us to
segment the salient objects correctly. Alternatively, too slow
or too fast object motion will blur the estimated optical flow
map, thus failing to provide discriminative motion cues and
affecting the final detection. In this case, satisfactory detection
results can be obtained by exploiting the semantic information
from distinctive appearance cues and features. In other words,
the roles of the two modalities in different scenes or even
similar scenes cannot be generalized, and the uncertainty
of the scene makes it very difficult to model interaction
fully adaptively. Instead of learning the importance of these
two modalities regardless and fully adaptively, we propose
a more secure modeling strategy, where the importance of
appearance cues and motion cues will be comprehensively
and explicitly taken into account to generate the saliency
maps, as shown in Fig. 1(c). In our network, we design a
top-bottom parallel symmetric structure, which sacrifices the
full-automatic intelligence so that we can fuse features more
comprehensively, considering the adaptability of the network
to different scenarios. Since it struggles for the network to
distinguish which modality is more important in one particular
scenario, we design two branches with varying tendencies
of importance for VSOD, taking one modality feature as a
dominant role in each branch and then supplementing from
another modality.
Under the parallel symmetric structure, we need to do
two things, one is how to realize the utilization of the two
modality information in each branch more clearly, and the
other is how to integrate the information of the upper and lower
branches to generate the final result. For the first issue, we
design the Gather Diffusion Reinforcement (GDR) module and
Cross-modality Refinement and Complement (CRC) module
to achieve dominate-modality feature reinforcement and cross-
modality feature interaction, respectively. Considering that the
high-level semantic information can reduce the interference of
non-salient information in a single modality and multi-scale
information can contribute to more comprehensive features,
we design a GDR module to enhance the effectiveness of
dominant features in each branch and improve the multi-scale
correlation of the dominant features themselves. The outputs
of the GDR module are then used for the CRC module in a top-
down manner. The key ideas behind the design of the CRC
module are as follows. Even if the data from one modality
plays a dominant role, there is more or less useful information
from the other modality. We divide this role into two types,
one is the refinement role, which is mainly used to suppress
the irrelevant redundancies in the dominant features, and the
other is the complementary role, mainly used to compensate
for potential information missing in dominant features. There-
fore, we design the CRC module to achieve comprehensive
information interaction in the case of explicit primary and
secondary relations, which can play the most significant role
in our proposed parallel symmetric framework. Although both
our upper and lower branches are fully implemented in the
VSOD task, the dominant modality they set is different. To
obtain more robust and generalized final results, we need to
integrate the two branches, which is the second problem we
need to solve. Considering the different importance of the
upper and lower branches in different scenarios, we introduce
an Importance Perception Fusion (IPF) module for adaptive
fusion. All designed modules are closely cooperated and
integrated under our parallel symmetrical structure to achieve
better detection performance. As shown in the 5th column
of Fig. 1, our model can accurately locate salient objects in
different types of scenes, with obvious advantages in detail
representation and background suppression. The contributions
of this paper can be summarized as:
Considering the adaptability of the network to different
scenarios and the uncertainty of the role of different
modalities, we propose a parallel symmetric network
(PSNet) for VSOD that simultaneously models the im-
portance of two modality features in an explicit way.
We propose a GDR module in each branch to perform
multi-scale content enhancement for dominant features
and design a CRC module to achieve cross-modality
interaction, where the auxiliary features are applied to
refine and supplement dominant features.
Experimental results on four mainstream datasets demon-
strate that our PSNet outperforms 25 state-of-the-art
methods both quantitatively and qualitatively.
II. RELATED WORK
A. Salient Object Detection in Single Image and Image Group
For decades, single image-based SOD task has achieved
extensive development [19]–[33], and has been widely used
in many related fields [2], such as object segmentation [34],
3
𝒇𝟏
𝒂𝒇𝟐
𝒂𝒇𝟑
𝒂𝒇𝟒
𝒂𝒇𝟓
𝒂
𝒇𝟏
𝒎𝒇𝟐
𝒎𝒇𝟑
𝒎𝒇𝟒
𝒎𝒇𝟓
𝒎
CRC CRC CRC CRC
CRC CRC CRC CRC
Gather Diffusion Reinforcement
Gather Diffusion Reinforcement
𝒇𝟒
𝒂𝒇𝟑
𝒂𝒇𝟐
𝒂
𝒇𝟒
𝒂
𝒇𝟓
𝒂𝒇𝟑
𝒂𝒇𝟐
𝒂
𝒇𝟓
𝒎𝒇𝟒
𝒎𝒇𝟑
𝒎𝒇𝟐
𝒎
𝒇𝟒
𝒎𝒇𝟑
𝒎𝒇𝟐
𝒎
𝒇𝟒
𝒎𝒇𝟑
𝒎𝒇𝟐
𝒎
𝒇𝟐
𝒂
𝒇𝟑
𝒂
IPF
IPF Importance Perception Fusion Module
CRC Cross-modality Refinement and Complement Module
Appearance-Dominated Branch
Motion-Dominated Branch
𝒇𝟓
𝒎
𝒇𝟓
𝒂
Spatial Encoder
Temporal Encoder
𝒇𝟒
𝒂
Fig. 2. The flowchart of the proposed Parallel Symmetric Network (PSNet) for video salient object detection. We first extract the multi-level features from
RGB images and optical flow maps via spatial encoder and temporal encoder respectively, which are denoted as fa
iand fm
i(i={1,2,··· ,5}). Then, the
appearance-dominated branch (top branch) and motion-dominated branch (bottom branch) are used to feature decoding. For each decoding, we use Gather
Diffusion Reinforcement (GDR) module to perform cross-scale feature enhancement, and then use the Cross-modality Refinement and Complement (CRC)
module to achieve cross-modality interaction with an explicit primary and secondary modality relationship. Finally, the Importance Perception Fusion (IPF)
module is used to integrate the upper and lower branches by considering their different importance in different scenarios.
content enhancement [35]–[46], and quality assessment [47],
[48]. Chen et al. [21] developed a method to make full use
of global context. Liu et al. [22] introduced a network to
selectively attend to informative context locations for each
pixel. In addition, the salient boundaries have been introduced
into the model to improve the representation and highlight
the desirable boundaries [23]–[25]. Some methods integrated
features in multiple layers of CNN to exploit the context
information at different semantic levels [25], [26]. In some
challenging and complex single image scenarios, some works
seek help from other modality data (e.g., depth map [49]–
[55] and thermal map [56]). In addition, co-salient object
detection (CoSOD) aims to detect salient objects from an
image group containing several relevant images [57]–[66]. The
difference between CoSOD and VSOD is that it does not have
temporal consistency, and the co-salient object is generally
only consistent in semantic categories, rather than the same
object.
B. Salient Object Detection in Video
The last decade has witnessed the considerable development
of salient object detection in video sequences. Earlier VSOD
methods mostly locate salient objects through hand-crafted
features [67]–[70]. Tu et al. [67] detected the salient object in
the video through two distinctive object detectors and refined
the final spatiotemporal saliency result by measuring the
foreground connectivity between two maps from two detectors.
Chen et al. [68] divided the long-term video sequence into
some short batches and proposed to detect saliency in a batch-
wise way, where the low-rank coherency is introduced to
guarantee temporal smoothness. However, the performance
of these methods is not satisfactory due to the limited fea-
ture representation capabilities. Recently, deep learning has
demonstrated its power in VSOD tasks. Among them, some
VSOD models adopt a single-stream structure that directly
feeds the video sequences recursively into the network. For
instance, Wang et al. [10] proposed the first work applying
deep learning to the VSOD task. Li et al. [71] proposed a two-
stage FCN-based model, where the first stage is responsible
for detecting static saliency, and the second stage is utilized to
detect spatiotemporal saliency with two consecutive frames. In
general, this method models saliency in a relatively primitive
way. With the development of the model, some subtle module
designs are proposed. For example, Song et al. [13] used the
designed Pyramid Dilated Bidirectional ConvLSTM to achieve
deeper spatiotemporal feature extraction. Fan et al. [14] intro-
duced a VSOD model based on ConvLSTM, which is applied
to model spatiotemporal features in a fixed length of video
frames. Moreover, a new VSOD dataset with human visual
fixation to model the human saliency shifting is proposed as
well. Chen et al. [72] focused on the results derived from
previous SOTA models, which are applied as pseudo labels to
fine-tune a new model, considering the motion quality. Chen et
al. [12] presented a novel spatiotemporal modeling unit based
on 3D convolution.
In addition, another typical VSOD pipeline is the two-
stream structure, where the optical flow image generated by
FlowNet2 [73] or other methods is directly fed into the
network as another stream input. Current two-stream models
can be divided into two categories. One is the uni-direction
摘要:

1PSNet:ParallelSymmetricNetworkforVideoSalientObjectDetectionRunminCong,Member,IEEE,WeiyuSong,JianjunLei,SeniorMember,IEEE,GuanghuiYue,YaoZhao,SeniorMember,IEEE,andSamKwong,Fellow,IEEEAbstract—Forthevideosalientobjectdetection(VSOD)task,howtoexcavatetheinformationfromtheappearancemodalityandthemotio...

展开>> 收起<<
1 PSNet Parallel Symmetric Network for Video Salient Object Detection.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:6.86MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注