1 PSNet Parallel Symmetric Network for Video Salient Object Detection

2025-04-27 0 0 6.86MB 13 页 10玖币

侵权投诉

PSNet: Parallel Symmetric Network for Video

Salient Object Detection

Runmin Cong, Member, IEEE, Weiyu Song, Jianjun Lei, Senior Member, IEEE, Guanghui Yue,

Yao Zhao, Senior Member, IEEE, and Sam Kwong, Fellow, IEEE

Abstract—For the video salient object detection (VSOD) task,

how to excavate the information from the appearance modality

and the motion modality has always been a topic of great concern.

The two-stream structure, including an RGB appearance stream

and an optical ﬂow motion stream, has been widely used as a

typical pipeline for VSOD tasks, but the existing methods usually

only use motion features to unidirectionally guide appearance

features or adaptively but blindly fuse two modality features.

However, these methods underperform in diverse scenarios due

to the uncomprehensive and unspeciﬁc learning schemes. In this

paper, following a more secure modeling philosophy, we deeply

investigate the importance of appearance modality and motion

modality in a more comprehensive way and propose a VSOD

network with up and down parallel symmetry, named PSNet.

Two parallel branches with different dominant modalities are set

to achieve complete video saliency decoding with the cooperation

of the Gather Diffusion Reinforcement (GDR) module and Cross-

modality Reﬁnement and Complement (CRC) module. Finally, we

use the Importance Perception Fusion (IPF) module to fuse the

features from two parallel branches according to their different

importance in different scenarios. Experiments on four dataset

benchmarks demonstrate that our method achieves desirable and

competitive performance. The code and results can be found from

the link of https://rmcong.github.io/proj PSNet.html.

Index Terms—Salient object detection, Video sequence, Parallel

symmetric structure, Importance perception.

I. INTRODUCTION

VIDEO salient object detection (VSOD) focuses on ex-

tracting the most attractive and motion related objects

in a video sequence [1], [2], which has been used as a pre-

processing step for a wide range of tasks, such as video

understanding [3]–[6], video compression [7], video tracking

[8], and video caption [9]. Due to the characteristic of video,

Runmin Cong is with the Institute of Information Science, Beijing Jiaotong

University, Beijing 100044, China, also with the Beijing Key Laboratory of

Advanced Information Science and Network Technology, Beijing 100044,

China, and also with the Department of Computer Science, City University

of Hong Kong, Hong Kong SAR, China (e-mail: rmcong@bjtu.edu.cn).

Weiyu Song and Yao Zhao are with the Institute of Information Science,

Beijing Jiaotong University, Beijing 100044, China, and also with the Beijing

Key Laboratory of Advanced Information Science and Network Technology,

Beijing 100044, China (e-mail: wysong125@bjtu.edu.cn, yzhao@bjtu.edu.cn).

Jianjun Lei is with the School of Electrical and Information Engineering,

Tianjin University, Tianjin 300072, China (e-mail: jjlei@tju.edu.cn).

Guanghui Yue is with the National, regional Key Technology Engineer-

ing Laboratory for Medical Ultrasound, Guangdong Key Laboratory for

Biomedical Measurements and Ultrasound Imaging, School of Biomedical

Engineering, Health Science Center, Shenzhen University, Shenzhen 518060,

China (email: yueguanghui@szu.edu.cn).

Sam Kwong is with the Department of Computer Science, City University

of Hong Kong, Hong Kong SAR, China, and also with the City University

of Hong Kong Shenzhen Research Institute, Shenzhen 51800, China (e-mail:

cssamk@cityu.edu.hk).

Temporal

Encoder

Decoder

(a)

Spatial

Encoder

Decoder

(b)

Spatial

Encoder

(c)

Decoder

RGB GT

Optical Flow Model Ours

(d) (e) (f) (g) (h)

Spatial

Encoder

Temporal

Encoder

Temporal

Encoder

Fig. 1. Top: The structures of VSOD models between our method (c) and

the other optical ﬂow-based two-stream VSOD methods (a) (b). Bottom: The

saliency results from different models in different scenes. (d) RGB images; (e)

Optical ﬂow images; (f) GT; (g) Saliency maps deduced by different methods,

where the ﬁrst row is generated by the MGA method [16], the second row is

generated by our baseline model with addition fusion, and the last two rows

are generated by the CAG method [18]; (h) Our model.

in addition to the appearance cue, the motion attribute plays

an important role, which is different from the SOD task for

static images. Entering the deep learning era, a variety of

VSOD methods have been explored, which can be roughly

divided into two categories, e.g., single-stream methods using

the temporal convolution/long short-term memory [10]–[14]

and two-stream methods using the optical ﬂow [15]–[18]. Even

so, it is still very challenging for current VSOD methods to

fully excavate and integrate the information from motion and

appearance cues. For the optical ﬂow-based two-stream VSOD

model, how to achieve the information interaction according to

the role of the two modalities is very important. In this paper,

we ﬁrst rethink and review the interaction mode in the optical

ﬂow-based two-stream VSOD structure, and the existing meth-

ods can be further categorized into two categories. One is the

unidirectional guidance model, as shown in Fig. 1(a), in which

the motion information mainly plays a supplementary role. For

example, Li et al. [16] encouraged motion features to guide

the appearance features in the designed VSOD model. As a

result, the model pays too much attention to the spatial branch,

while the advantage of the motion branch is weakened when

dealing with some challenging scenes, such as the stationary

arXiv:2210.05912v1 [cs.CV] 12 Oct 2022

objects with salient appearance may be incorrectly preserved.

(see the 1st row of Fig. 1). To alleviate the problems mentioned

above, the undifferentiated and bidirectional fusion mechanism

is proposed as another typical interaction mode, as shown in

Fig. 1(b), which no longer distinguishes their primary and sec-

ondary roles. Fusing the two modality features by addition or

concatenation is the simplest solution, but this way often fails

to achieve the desired results, especially for some complex

scenes (see the 2nd row of Fig. 1). In addition, some works

[18] learn the weights to determine the contributions of spatial

and temporal features, and then achieve adaptive fusion of two

modality features. Although these methods appear to be quite

intelligent and achieve relatively competitive performance, this

black-box adaptive fusion strategy sometimes only trades off

performance rather than maximizing gains when faced with

different scenarios. As shown in 3rd and 4th rows of Fig.

1, they are different frames from different moments of the

same video. Although they are similar scenes, the contribution

of the two modality data to the ﬁnal saliency detection is

different. We can ﬁnd that the appearance cues are more

important than the motion cues in the 3rd row, where the

dramatic moving of objects and the change of camera position

lead to unclear and blur motion cues. While in the 4th row,

motion cues can provide more effective guidance information

compared with appearance cues that contain some irrepressible

noise. According to these observations, when salient objects

and backgrounds share similar appearances or background

interference is disturbing, interlaced and wrong appearance

cues could greatly contaminate the ﬁnal detection results. But

at this time, perhaps accurate motion cues will help us to

segment the salient objects correctly. Alternatively, too slow

or too fast object motion will blur the estimated optical ﬂow

map, thus failing to provide discriminative motion cues and

affecting the ﬁnal detection. In this case, satisfactory detection

results can be obtained by exploiting the semantic information

from distinctive appearance cues and features. In other words,

the roles of the two modalities in different scenes or even

similar scenes cannot be generalized, and the uncertainty

of the scene makes it very difﬁcult to model interaction

fully adaptively. Instead of learning the importance of these

two modalities regardless and fully adaptively, we propose

a more secure modeling strategy, where the importance of

appearance cues and motion cues will be comprehensively

and explicitly taken into account to generate the saliency

maps, as shown in Fig. 1(c). In our network, we design a

top-bottom parallel symmetric structure, which sacriﬁces the

full-automatic intelligence so that we can fuse features more

comprehensively, considering the adaptability of the network

to different scenarios. Since it struggles for the network to

distinguish which modality is more important in one particular

scenario, we design two branches with varying tendencies

of importance for VSOD, taking one modality feature as a

dominant role in each branch and then supplementing from

another modality.

Under the parallel symmetric structure, we need to do

two things, one is how to realize the utilization of the two

modality information in each branch more clearly, and the

other is how to integrate the information of the upper and lower

branches to generate the ﬁnal result. For the ﬁrst issue, we

design the Gather Diffusion Reinforcement (GDR) module and

Cross-modality Reﬁnement and Complement (CRC) module

to achieve dominate-modality feature reinforcement and cross-

modality feature interaction, respectively. Considering that the

high-level semantic information can reduce the interference of

non-salient information in a single modality and multi-scale

information can contribute to more comprehensive features,

we design a GDR module to enhance the effectiveness of

dominant features in each branch and improve the multi-scale

correlation of the dominant features themselves. The outputs

of the GDR module are then used for the CRC module in a top-

down manner. The key ideas behind the design of the CRC

module are as follows. Even if the data from one modality

plays a dominant role, there is more or less useful information

from the other modality. We divide this role into two types,

one is the reﬁnement role, which is mainly used to suppress

the irrelevant redundancies in the dominant features, and the

other is the complementary role, mainly used to compensate

for potential information missing in dominant features. There-

fore, we design the CRC module to achieve comprehensive

information interaction in the case of explicit primary and

secondary relations, which can play the most signiﬁcant role

in our proposed parallel symmetric framework. Although both

our upper and lower branches are fully implemented in the

VSOD task, the dominant modality they set is different. To

obtain more robust and generalized ﬁnal results, we need to

integrate the two branches, which is the second problem we

need to solve. Considering the different importance of the

upper and lower branches in different scenarios, we introduce

an Importance Perception Fusion (IPF) module for adaptive

fusion. All designed modules are closely cooperated and

integrated under our parallel symmetrical structure to achieve

better detection performance. As shown in the 5th column

of Fig. 1, our model can accurately locate salient objects in

different types of scenes, with obvious advantages in detail

representation and background suppression. The contributions

of this paper can be summarized as:

•Considering the adaptability of the network to different

scenarios and the uncertainty of the role of different

modalities, we propose a parallel symmetric network

(PSNet) for VSOD that simultaneously models the im-

portance of two modality features in an explicit way.

•We propose a GDR module in each branch to perform

multi-scale content enhancement for dominant features

and design a CRC module to achieve cross-modality

interaction, where the auxiliary features are applied to

reﬁne and supplement dominant features.

•Experimental results on four mainstream datasets demon-

strate that our PSNet outperforms 25 state-of-the-art

methods both quantitatively and qualitatively.

II. RELATED WORK

A. Salient Object Detection in Single Image and Image Group

For decades, single image-based SOD task has achieved

extensive development [19]–[33], and has been widely used

in many related ﬁelds [2], such as object segmentation [34],

𝒇𝟏

𝒂𝒇𝟐

𝒂𝒇𝟑

𝒂𝒇𝟒

𝒂𝒇𝟓

𝒂

𝒇𝟏

𝒎𝒇𝟐

𝒎𝒇𝟑

𝒎𝒇𝟒

𝒎𝒇𝟓

𝒎

CRC CRC CRC CRC

Gather Diffusion Reinforcement

𝒇𝟒

𝒂𝒇𝟑

𝒂𝒇𝟐

𝒂

𝒇𝟒

𝒂

𝒇𝟓

𝒂𝒇𝟑

𝒂𝒇𝟐

𝒂

𝒇𝟓

𝒎𝒇𝟒

𝒎𝒇𝟑

𝒎𝒇𝟐

𝒎

𝒇𝟒

𝒎𝒇𝟑

𝒎𝒇𝟐

𝒎

𝒇𝟒

𝒎𝒇𝟑

𝒎𝒇𝟐

𝒎

𝒇𝟐

𝒂

𝒇𝟑

𝒂

IPF

IPF Importance Perception Fusion Module

CRC Cross-modality Refinement and Complement Module

Appearance-Dominated Branch

Motion-Dominated Branch

𝒇𝟓

𝒎

𝒇𝟓

𝒂

Spatial Encoder

Temporal Encoder

𝒇𝟒

𝒂

Fig. 2. The ﬂowchart of the proposed Parallel Symmetric Network (PSNet) for video salient object detection. We ﬁrst extract the multi-level features from

RGB images and optical ﬂow maps via spatial encoder and temporal encoder respectively, which are denoted as fa

iand fm

i(i={1,2,··· ,5}). Then, the

appearance-dominated branch (top branch) and motion-dominated branch (bottom branch) are used to feature decoding. For each decoding, we use Gather

Diffusion Reinforcement (GDR) module to perform cross-scale feature enhancement, and then use the Cross-modality Reﬁnement and Complement (CRC)

module to achieve cross-modality interaction with an explicit primary and secondary modality relationship. Finally, the Importance Perception Fusion (IPF)

module is used to integrate the upper and lower branches by considering their different importance in different scenarios.

content enhancement [35]–[46], and quality assessment [47],

[48]. Chen et al. [21] developed a method to make full use

of global context. Liu et al. [22] introduced a network to

selectively attend to informative context locations for each

pixel. In addition, the salient boundaries have been introduced

into the model to improve the representation and highlight

the desirable boundaries [23]–[25]. Some methods integrated

features in multiple layers of CNN to exploit the context

information at different semantic levels [25], [26]. In some

challenging and complex single image scenarios, some works

seek help from other modality data (e.g., depth map [49]–

[55] and thermal map [56]). In addition, co-salient object

detection (CoSOD) aims to detect salient objects from an

image group containing several relevant images [57]–[66]. The

difference between CoSOD and VSOD is that it does not have

temporal consistency, and the co-salient object is generally

only consistent in semantic categories, rather than the same

object.

B. Salient Object Detection in Video

The last decade has witnessed the considerable development

of salient object detection in video sequences. Earlier VSOD

methods mostly locate salient objects through hand-crafted

features [67]–[70]. Tu et al. [67] detected the salient object in

the video through two distinctive object detectors and reﬁned

the ﬁnal spatiotemporal saliency result by measuring the

foreground connectivity between two maps from two detectors.

Chen et al. [68] divided the long-term video sequence into

some short batches and proposed to detect saliency in a batch-

wise way, where the low-rank coherency is introduced to

guarantee temporal smoothness. However, the performance

of these methods is not satisfactory due to the limited fea-

ture representation capabilities. Recently, deep learning has

demonstrated its power in VSOD tasks. Among them, some

VSOD models adopt a single-stream structure that directly

feeds the video sequences recursively into the network. For

instance, Wang et al. [10] proposed the ﬁrst work applying

deep learning to the VSOD task. Li et al. [71] proposed a two-

stage FCN-based model, where the ﬁrst stage is responsible

for detecting static saliency, and the second stage is utilized to

detect spatiotemporal saliency with two consecutive frames. In

general, this method models saliency in a relatively primitive

way. With the development of the model, some subtle module

designs are proposed. For example, Song et al. [13] used the

designed Pyramid Dilated Bidirectional ConvLSTM to achieve

deeper spatiotemporal feature extraction. Fan et al. [14] intro-

duced a VSOD model based on ConvLSTM, which is applied

to model spatiotemporal features in a ﬁxed length of video

frames. Moreover, a new VSOD dataset with human visual

ﬁxation to model the human saliency shifting is proposed as

well. Chen et al. [72] focused on the results derived from

previous SOTA models, which are applied as pseudo labels to

ﬁne-tune a new model, considering the motion quality. Chen et

al. [12] presented a novel spatiotemporal modeling unit based

on 3D convolution.

In addition, another typical VSOD pipeline is the two-

stream structure, where the optical ﬂow image generated by

FlowNet2 [73] or other methods is directly fed into the

network as another stream input. Current two-stream models

can be divided into two categories. One is the uni-direction

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1PSNet:ParallelSymmetricNetworkforVideoSalientObjectDetectionRunminCong,Member,IEEE,WeiyuSong,JianjunLei,SeniorMember,IEEE,GuanghuiYue,YaoZhao,SeniorMember,IEEE,andSamKwong,Fellow,IEEEAbstractForthevideosalientobjectdetection(VSOD)task,howtoexcavatetheinformationfromtheappearancemodalityandthemotio...

展开>> 收起<<

1 PSNet Parallel Symmetric Network for Video Salient Object Detection.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 PSNet Parallel Symmetric Network for Video Salient Object Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: