Global Spectral Filter Memory Network for Video Object Segmentation Yong Liu1 Ran Yu1 Jiahao Wang1 Xinyuan Zhao3 Yitong Wang2 Yansong

2025-04-24 1 0 4.47MB 18 页 10玖币

侵权投诉

Global Spectral Filter Memory Network for

Video Object Segmentation

Yong Liu1∗, Ran Yu1, Jiahao Wang1, Xinyuan Zhao3, Yitong Wang2, Yansong

Tang1, and Yujiu Yang1†

1Tsinghua Shenzhen International Graduate School, Tsinghua University

2ByteDance Inc.

3Northwestern University

{liu-yong20, yu-r19}@mails.tsinghua.edu.cn, {tang.yansong,

yang.yujiu}@sz.tsinghua.edu.cn

Abstract. This paper studies semi-supervised video object segmenta-

tion through boosting intra-frame interaction. Recent memory network-

based methods focus on exploiting inter-frame temporal reference while

paying little attention to intra-frame spatial dependency. Speciﬁcally,

these segmentation model tends to be susceptible to interference from

unrelated nontarget objects in a certain frame. To this end, we propose

Global Spectral Filter Memory network (GSFM), which improves intra-

frame interaction through learning long-term spatial dependencies in the

spectral domain. The key components of GSFM is 2D (inverse) discrete

Fourier transform for spatial information mixing. Besides, we empirically

ﬁnd low frequency feature should be enhanced in encoder (backbone)

while high frequency for decoder (segmentation head). We attribute this

to semantic information extracting role for encoder and ﬁne-grained de-

tails highlighting role for decoder. Thus, Low (High) Frequency Module

is proposed to ﬁt this circumstance. Extensive experiments on the pop-

ular DAVIS and YouTube-VOS benchmarks demonstrate that GSFM

noticeably outperforms the baseline method and achieves state-of-the-

art performance. Besides, extensive analysis shows that the proposed

modules are reasonable and of great generalization ability. Our source

code is available at https://github.com/workforai/GSFM.

Keywords: video object segmentation, spectral domain

1 Introduction

Video Object Segmentation (VOS) [37,38,61,66] aims at identifying and seg-

menting objects in videos. It is one of the most challenging tasks in computer

vision with many potential applications, including interactive video editing, aug-

mented reality [33], and autonomous driving [74]. In this paper, we focus on the

semi-supervised setting where target objects are deﬁned by the given masks of

∗This work was done during an internship at ByteDance Inc.

†Corresponding author

arXiv:2210.05567v2 [cs.CV] 12 Oct 2022

2 Y. Liu et al.

(a) Query pixel in current frame (b) Matched pixels (STCN) (c) Matched pixels (Ours)

Fig. 1: Illustration of the disadvantages of lacking semantic global information.

The highlight red pixels in the ﬁrst column are target pixels. The second col-

umn shows that previous method [6] would incorrectly match similar pixels. In

the third column, our model relieves the confusion problem by enhancing low-

frequency components and updating features from spectral domain.

the ﬁrst frame. It is crucial for semi-supervised VOS to fully utilize the available

reference information to distinguish targets from background objects.

Since the critical problem of this task lies in how to make full use of the

spatial-temporal dependency to recognize the targets, matching-based methods,

which perform pixel-level matching with historical reference frames, have re-

ceived tremendous attention. The Space-Time Memory Network [35] memorizes

intermediate frames with segmentation masks as references and performs pixel-

level matching between them with the current frame to segment target objects

in a bottom-up manner, which has been proved eﬀective and has served as the

current mainstream framework. Some works [40,23,5,15,59,41,51,6,62,46,25,27]

further develop STM and have achieved excellent performance.

Although these methods have made great progress in the ﬁeld of VOS, they

pay little attention to excavating intra-frame dependency and only utilize local

representation for matching and prediction due to the inductive bias of con-

volution. Lacking global dependency would cause low eﬃcacy in distinguishing

similar pixels, e.g., pixels of similar color or objects of the same category. We take

the typical method STCN [6] for illustration. In Fig. 1(b), some pixels belonging

to background objects are mismatched with the target pixel due to their simi-

lar local features. Ignoring long-range dependency for matching would lead to a

high risk of interference from other objects. Since the matching-based approaches

rely on the matching process to identify the targets, incorrectly matched pixels

would negatively aﬀect the ﬁnal segmentation and even lead to error accumula-

tion. Therefore, it is necessary to excavate the intra-frame spatial dependency

to enhance the representation of features.

According to the Fourier theory [19], FFT function generates outputs based

on pixels from all spatial locations when processing input feature. Thus, the spec-

Global Spectral Filter Memory Network for Video Object Segmentation 3

tral domain representation contains rich global information. Inspired by this, we

introduce a Global Spectral Filter Memory network (GSFM), which fuses global

dependency from spectral domain and distinguishes the high-frequency and low-

frequency components for targeted enhancement. In GSFM, we propose the Low

Frequency Module (LFM) and High Frequency Module (HFM) to enhance dif-

ferent representation according to the characteristics of the encoder-decoder net-

work structure.

The role of encoder is to extract deep features for subsequent modules, and

the encoded features need to contain rich semantic information. Intuitively, low-

frequency components correspond to high-level semantic information while ignor-

ing details. Some theoretical researches on CNN from spectral domain [64,52,70]

also point out similar observations. Inspired by the above analysis, we propose a

Low-Frequency Module (LFM) for the encoding process to update the features

in the spectral domain and emphasize their low-frequency components. Fig. 1

tinguishability of similar pixels is greatly improved. Extensive experiments also

demonstrate the rationality of emphasizing low-frequency in the encoder.

Diﬀerent from encoding, features in the decoding process need to contain

more ﬁne-grained information for accurate prediction. And high-frequency com-

ponents correspond to the image parts that change drastically, e.g., object bound-

aries and texture details. Combined with the above analysis, we believe that

focusing on high-frequency components would help to rich the ﬁne-grained rep-

resentation of features and make more accurate predictions of boundaries or

ambiguous regions. Therefore, we introduce a High-Frequency Module (HFM)

in the decoding process, which enhances the high-frequency components of fea-

tures to better capture detailed information. Besides, to take full advantage of

HFM, we combine it with an additional boundary prediction branch to provide

better localization and shape guidance.

Experiments show that the proposed model noticeably outperforms the base-

line method and achieves state-of-the-art performance on DAVIS [37,38] and

YouTube-VOS [61] datasets. The contribution of this paper can be summarized

as follows. Firstly, we propose to leverage the spectral domain to enhance the

global spatial dependency of features for semi-supervised VOS. Secondly, con-

sidering the diﬀerences between the process of encoding and decoding, we pro-

pose LFM and HFM to perform targeted enhancement, respectively. Thirdly,

we combine object boundaries and high-frequency to provide better localization

and shape information while keeping the decoding features are ﬁne-grained.

2 Related Work

Semi-supervised video object segmentation. Since the masks for the ﬁrst

frame are given, early methods [3,49,30,50,57] take the strategy that online ﬁne-

tune the network according to the object mask of the ﬁrst frame, which suﬀers

from slow inference speed. Propagation-based methods [47,8,7,63,16,1,18,21,13]

forward propagate the segmentation masks as a reference to the next frame,

4 Y. Liu et al.

and they are diﬃcult to handle complicated scenarios. Some other researchers

have decoupled VOS into three independent subtasks of detection, tracking, and

segmentation [29,22,17,45]. Although this approach balances running time and

accuracy, it is extremely dependent on the performance of the detectors and

makes the entire pipeline complex.

In recent years matching-based methods have received great attention for ex-

cellent performance and robustness. FEELVOS [48], CFBI [67] and CFBI+ [69]

perform global and local matching with the ﬁrst frame and the previous adjacent

frame, respectively. AOT [68] associates multiple target objects into the same em-

bedding space by employing an identiﬁcation mechanism. STM [35] leverages the

memory network to memorize intermediate frames as references, which has been

proved eﬀective and has served as the current mainstream framework. Based

on STM, KMN [40] and RMNet [59] perform local-to-local matching by using

the Gaussian kernel and hard crop strategy. SwiftNet [51] and AFB-URR [25]

reduce memory redundancy by calculating the similarity between query and

memory. LCM [15] and SCM [73] proposes spatial constraint to enhance spatial

location information. EGMN [28] employs an episodic memory network to mem-

orize frames as nodes and capture cross-frame correlations by edges. MiVOS [5]

further developed KMN [40] by utilizing the top-k strategy to reduce noise in-

formation in the memory read block. STCN [6] improves the feature extraction

and performs more reasonable matching by decoupling the image and masks.

Despite the great performance achieved by these methods, they ignore the

importance of fully excavating the intra-frame global information, which may

lead to a high risk of interference by pixels with similar local features.

Spectral domain learning. Recent years have witnessed increasing research

enthusiasm on combining spectral domain and deep learning [42,10,39,52,70,58].

Among them, some researches [64,52,70] attempt to explain the behavior of con-

volution neural network from the perspective of spectral domain. They point

out that the features of diﬀerent frequency bands represent diﬀerent types of

information and observe some properties of deep neural networks related to it.

With the guidance of these works and rethinking about the characteristics of

the encoder-decoder structure, we propose separating the high-frequency and

low-frequency components for reasonably utilizing them. In this paper, we intro-

duce a low-frequency module (LFM) and a high-frequency module (HFM). LFM

enhances the low-frequency components during encoding to fuse global semantic

features, while HFM enhances the high-frequency components in the decoder to

make features contain more ﬁne-grained details.

Some previous methods [25,76,65] applying spatial prior ﬁlter or introducing

boundary to features can also be explained from the perspective of spectral

domain. Applying ﬁlter kernels or highlighting boundaries in the spatial domain

is essentially a special way to distinguish between high and low frequencies. While

this approach can also serve the purpose of targeted enhancement, it loses the

advantage of global perception in the spectral domain. Therefore, our approach

that updates features in the spectral domain is more generalized and eﬀective.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GlobalSpectralFilterMemoryNetworkforVideoObjectSegmentationYongLiu1∗,RanYu1,JiahaoWang1,XinyuanZhao3,YitongWang2,YansongTang1,andYujiuYang1†1TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2ByteDanceInc.3NorthwesternUniversity{liu-yong20,yu-r19}@mails.tsinghua.edu.cn,{tang.yansong,yang...

展开>> 收起<<

Global Spectral Filter Memory Network for Video Object Segmentation Yong Liu1 Ran Yu1 Jiahao Wang1 Xinyuan Zhao3 Yitong Wang2 Yansong.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Global Spectral Filter Memory Network for Video Object Segmentation Yong Liu1 Ran Yu1 Jiahao Wang1 Xinyuan Zhao3 Yitong Wang2 Yansong

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: