Global Spectral Filter Memory Network for Video Object Segmentation Yong Liu1 Ran Yu1 Jiahao Wang1 Xinyuan Zhao3 Yitong Wang2 Yansong

2025-04-24 0 0 4.47MB 18 页 10玖币
侵权投诉
Global Spectral Filter Memory Network for
Video Object Segmentation
Yong Liu1, Ran Yu1, Jiahao Wang1, Xinyuan Zhao3, Yitong Wang2, Yansong
Tang1, and Yujiu Yang1
1Tsinghua Shenzhen International Graduate School, Tsinghua University
2ByteDance Inc.
3Northwestern University
{liu-yong20, yu-r19}@mails.tsinghua.edu.cn, {tang.yansong,
yang.yujiu}@sz.tsinghua.edu.cn
Abstract. This paper studies semi-supervised video object segmenta-
tion through boosting intra-frame interaction. Recent memory network-
based methods focus on exploiting inter-frame temporal reference while
paying little attention to intra-frame spatial dependency. Specifically,
these segmentation model tends to be susceptible to interference from
unrelated nontarget objects in a certain frame. To this end, we propose
Global Spectral Filter Memory network (GSFM), which improves intra-
frame interaction through learning long-term spatial dependencies in the
spectral domain. The key components of GSFM is 2D (inverse) discrete
Fourier transform for spatial information mixing. Besides, we empirically
find low frequency feature should be enhanced in encoder (backbone)
while high frequency for decoder (segmentation head). We attribute this
to semantic information extracting role for encoder and fine-grained de-
tails highlighting role for decoder. Thus, Low (High) Frequency Module
is proposed to fit this circumstance. Extensive experiments on the pop-
ular DAVIS and YouTube-VOS benchmarks demonstrate that GSFM
noticeably outperforms the baseline method and achieves state-of-the-
art performance. Besides, extensive analysis shows that the proposed
modules are reasonable and of great generalization ability. Our source
code is available at https://github.com/workforai/GSFM.
Keywords: video object segmentation, spectral domain
1 Introduction
Video Object Segmentation (VOS) [37,38,61,66] aims at identifying and seg-
menting objects in videos. It is one of the most challenging tasks in computer
vision with many potential applications, including interactive video editing, aug-
mented reality [33], and autonomous driving [74]. In this paper, we focus on the
semi-supervised setting where target objects are defined by the given masks of
This work was done during an internship at ByteDance Inc.
Corresponding author
arXiv:2210.05567v2 [cs.CV] 12 Oct 2022
2 Y. Liu et al.
(a) Query pixel in current frame (b) Matched pixels (STCN) (c) Matched pixels (Ours)
Fig. 1: Illustration of the disadvantages of lacking semantic global information.
The highlight red pixels in the first column are target pixels. The second col-
umn shows that previous method [6] would incorrectly match similar pixels. In
the third column, our model relieves the confusion problem by enhancing low-
frequency components and updating features from spectral domain.
the first frame. It is crucial for semi-supervised VOS to fully utilize the available
reference information to distinguish targets from background objects.
Since the critical problem of this task lies in how to make full use of the
spatial-temporal dependency to recognize the targets, matching-based methods,
which perform pixel-level matching with historical reference frames, have re-
ceived tremendous attention. The Space-Time Memory Network [35] memorizes
intermediate frames with segmentation masks as references and performs pixel-
level matching between them with the current frame to segment target objects
in a bottom-up manner, which has been proved effective and has served as the
current mainstream framework. Some works [40,23,5,15,59,41,51,6,62,46,25,27]
further develop STM and have achieved excellent performance.
Although these methods have made great progress in the field of VOS, they
pay little attention to excavating intra-frame dependency and only utilize local
representation for matching and prediction due to the inductive bias of con-
volution. Lacking global dependency would cause low efficacy in distinguishing
similar pixels, e.g., pixels of similar color or objects of the same category. We take
the typical method STCN [6] for illustration. In Fig. 1(b), some pixels belonging
to background objects are mismatched with the target pixel due to their simi-
lar local features. Ignoring long-range dependency for matching would lead to a
high risk of interference from other objects. Since the matching-based approaches
rely on the matching process to identify the targets, incorrectly matched pixels
would negatively affect the final segmentation and even lead to error accumula-
tion. Therefore, it is necessary to excavate the intra-frame spatial dependency
to enhance the representation of features.
According to the Fourier theory [19], FFT function generates outputs based
on pixels from all spatial locations when processing input feature. Thus, the spec-
Global Spectral Filter Memory Network for Video Object Segmentation 3
tral domain representation contains rich global information. Inspired by this, we
introduce a Global Spectral Filter Memory network (GSFM), which fuses global
dependency from spectral domain and distinguishes the high-frequency and low-
frequency components for targeted enhancement. In GSFM, we propose the Low
Frequency Module (LFM) and High Frequency Module (HFM) to enhance dif-
ferent representation according to the characteristics of the encoder-decoder net-
work structure.
The role of encoder is to extract deep features for subsequent modules, and
the encoded features need to contain rich semantic information. Intuitively, low-
frequency components correspond to high-level semantic information while ignor-
ing details. Some theoretical researches on CNN from spectral domain [64,52,70]
also point out similar observations. Inspired by the above analysis, we propose a
Low-Frequency Module (LFM) for the encoding process to update the features
in the spectral domain and emphasize their low-frequency components. Fig. 1
(c) illustrates that with LFM enhancing global semantic information, the dis-
tinguishability of similar pixels is greatly improved. Extensive experiments also
demonstrate the rationality of emphasizing low-frequency in the encoder.
Different from encoding, features in the decoding process need to contain
more fine-grained information for accurate prediction. And high-frequency com-
ponents correspond to the image parts that change drastically, e.g., object bound-
aries and texture details. Combined with the above analysis, we believe that
focusing on high-frequency components would help to rich the fine-grained rep-
resentation of features and make more accurate predictions of boundaries or
ambiguous regions. Therefore, we introduce a High-Frequency Module (HFM)
in the decoding process, which enhances the high-frequency components of fea-
tures to better capture detailed information. Besides, to take full advantage of
HFM, we combine it with an additional boundary prediction branch to provide
better localization and shape guidance.
Experiments show that the proposed model noticeably outperforms the base-
line method and achieves state-of-the-art performance on DAVIS [37,38] and
YouTube-VOS [61] datasets. The contribution of this paper can be summarized
as follows. Firstly, we propose to leverage the spectral domain to enhance the
global spatial dependency of features for semi-supervised VOS. Secondly, con-
sidering the differences between the process of encoding and decoding, we pro-
pose LFM and HFM to perform targeted enhancement, respectively. Thirdly,
we combine object boundaries and high-frequency to provide better localization
and shape information while keeping the decoding features are fine-grained.
2 Related Work
Semi-supervised video object segmentation. Since the masks for the first
frame are given, early methods [3,49,30,50,57] take the strategy that online fine-
tune the network according to the object mask of the first frame, which suffers
from slow inference speed. Propagation-based methods [47,8,7,63,16,1,18,21,13]
forward propagate the segmentation masks as a reference to the next frame,
4 Y. Liu et al.
and they are difficult to handle complicated scenarios. Some other researchers
have decoupled VOS into three independent subtasks of detection, tracking, and
segmentation [29,22,17,45]. Although this approach balances running time and
accuracy, it is extremely dependent on the performance of the detectors and
makes the entire pipeline complex.
In recent years matching-based methods have received great attention for ex-
cellent performance and robustness. FEELVOS [48], CFBI [67] and CFBI+ [69]
perform global and local matching with the first frame and the previous adjacent
frame, respectively. AOT [68] associates multiple target objects into the same em-
bedding space by employing an identification mechanism. STM [35] leverages the
memory network to memorize intermediate frames as references, which has been
proved effective and has served as the current mainstream framework. Based
on STM, KMN [40] and RMNet [59] perform local-to-local matching by using
the Gaussian kernel and hard crop strategy. SwiftNet [51] and AFB-URR [25]
reduce memory redundancy by calculating the similarity between query and
memory. LCM [15] and SCM [73] proposes spatial constraint to enhance spatial
location information. EGMN [28] employs an episodic memory network to mem-
orize frames as nodes and capture cross-frame correlations by edges. MiVOS [5]
further developed KMN [40] by utilizing the top-k strategy to reduce noise in-
formation in the memory read block. STCN [6] improves the feature extraction
and performs more reasonable matching by decoupling the image and masks.
Despite the great performance achieved by these methods, they ignore the
importance of fully excavating the intra-frame global information, which may
lead to a high risk of interference by pixels with similar local features.
Spectral domain learning. Recent years have witnessed increasing research
enthusiasm on combining spectral domain and deep learning [42,10,39,52,70,58].
Among them, some researches [64,52,70] attempt to explain the behavior of con-
volution neural network from the perspective of spectral domain. They point
out that the features of different frequency bands represent different types of
information and observe some properties of deep neural networks related to it.
With the guidance of these works and rethinking about the characteristics of
the encoder-decoder structure, we propose separating the high-frequency and
low-frequency components for reasonably utilizing them. In this paper, we intro-
duce a low-frequency module (LFM) and a high-frequency module (HFM). LFM
enhances the low-frequency components during encoding to fuse global semantic
features, while HFM enhances the high-frequency components in the decoder to
make features contain more fine-grained details.
Some previous methods [25,76,65] applying spatial prior filter or introducing
boundary to features can also be explained from the perspective of spectral
domain. Applying filter kernels or highlighting boundaries in the spatial domain
is essentially a special way to distinguish between high and low frequencies. While
this approach can also serve the purpose of targeted enhancement, it loses the
advantage of global perception in the spectral domain. Therefore, our approach
that updates features in the spectral domain is more generalized and effective.
摘要:

GlobalSpectralFilterMemoryNetworkforVideoObjectSegmentationYongLiu1∗,RanYu1,JiahaoWang1,XinyuanZhao3,YitongWang2,YansongTang1,andYujiuYang1†1TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2ByteDanceInc.3NorthwesternUniversity{liu-yong20,yu-r19}@mails.tsinghua.edu.cn,{tang.yansong,yang...

展开>> 收起<<
Global Spectral Filter Memory Network for Video Object Segmentation Yong Liu1 Ran Yu1 Jiahao Wang1 Xinyuan Zhao3 Yitong Wang2 Yansong.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:18 页 大小:4.47MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注