Global Spectral Filter Memory Network for Video Object Segmentation 3
tral domain representation contains rich global information. Inspired by this, we
introduce a Global Spectral Filter Memory network (GSFM), which fuses global
dependency from spectral domain and distinguishes the high-frequency and low-
frequency components for targeted enhancement. In GSFM, we propose the Low
Frequency Module (LFM) and High Frequency Module (HFM) to enhance dif-
ferent representation according to the characteristics of the encoder-decoder net-
work structure.
The role of encoder is to extract deep features for subsequent modules, and
the encoded features need to contain rich semantic information. Intuitively, low-
frequency components correspond to high-level semantic information while ignor-
ing details. Some theoretical researches on CNN from spectral domain [64,52,70]
also point out similar observations. Inspired by the above analysis, we propose a
Low-Frequency Module (LFM) for the encoding process to update the features
in the spectral domain and emphasize their low-frequency components. Fig. 1
(c) illustrates that with LFM enhancing global semantic information, the dis-
tinguishability of similar pixels is greatly improved. Extensive experiments also
demonstrate the rationality of emphasizing low-frequency in the encoder.
Different from encoding, features in the decoding process need to contain
more fine-grained information for accurate prediction. And high-frequency com-
ponents correspond to the image parts that change drastically, e.g., object bound-
aries and texture details. Combined with the above analysis, we believe that
focusing on high-frequency components would help to rich the fine-grained rep-
resentation of features and make more accurate predictions of boundaries or
ambiguous regions. Therefore, we introduce a High-Frequency Module (HFM)
in the decoding process, which enhances the high-frequency components of fea-
tures to better capture detailed information. Besides, to take full advantage of
HFM, we combine it with an additional boundary prediction branch to provide
better localization and shape guidance.
Experiments show that the proposed model noticeably outperforms the base-
line method and achieves state-of-the-art performance on DAVIS [37,38] and
YouTube-VOS [61] datasets. The contribution of this paper can be summarized
as follows. Firstly, we propose to leverage the spectral domain to enhance the
global spatial dependency of features for semi-supervised VOS. Secondly, con-
sidering the differences between the process of encoding and decoding, we pro-
pose LFM and HFM to perform targeted enhancement, respectively. Thirdly,
we combine object boundaries and high-frequency to provide better localization
and shape information while keeping the decoding features are fine-grained.
2 Related Work
Semi-supervised video object segmentation. Since the masks for the first
frame are given, early methods [3,49,30,50,57] take the strategy that online fine-
tune the network according to the object mask of the first frame, which suffers
from slow inference speed. Propagation-based methods [47,8,7,63,16,1,18,21,13]
forward propagate the segmentation masks as a reference to the next frame,