2 Related Work
Semi-supervised Video Object Segmentation.
Given a video with one or several annotated frames
(the first frame in general), semi-supervised VOS [52] requires algorithms to propagate the mask
annotations to the entire video. Traditional methods often solve an optimization problem with an
energy defined over a graph structure [2,4,49]. Based on deep neural networks (DNN), deep learning
based VOS methods have achieved significant progress and dominated the field in recent years.
Finetuning-based Methods. Early DNN-based methods rely on fine-tuning pre-trained segmentation
networks at test time to make the networks focus on the given object. Among them, OSVOS [7] and
MoNet [56] propose to fine-tune pre-trained networks on the first-frame annotation. OnAVOS [51]
extends the first-frame fine-tuning by introducing an online adaptation mechanism. Following
these approaches, MaskTrack [37] and PReM [32] further utilize optical flow to help propagate the
segmentation mask from one frame to the next.
Template-based Methods. To avoid using the test-time fine-tuning, many researchers regard the
annotated frames as templates and investigate how to match with them. For example, OSMN [60]
employs a network to extract object embedding and another one to predict segmentation based on
the embedding. PML [10] learns pixel-wise embedding with the nearest neighbor classifier, and
VideoMatch [22] uses a matching layer to map the pixels of the current frame to the annotated frame
in a learned embedding space. Following these methods, FEELVOS [50] and CFBI(+) [62,64] extend
the pixel-level matching mechanism by additionally doing local matching with the previous frame,
and RPCM [58] proposes a correction module to improve the reliability of pixel-level matching.
Instead of using matching mechanisms, LWL [6] proposes to use an online few-shot learner to learn
to decode object segmentation.
Attention-based Methods. Based on the advance of attention mechanisms [5,48,53], STM [34] and the
following works (e.g., KMN [43] and STCN [11]) leverage a memory network to embed past-frame
predictions into memory and apply a non-local attention mechanism on the memory to propagate
mask information to the current frame. Differently, SST [17] proposes to calculate pixel-level
matching maps based on the attention maps of transformer blocks [48]. Recently, AOT [61,63,65]
introduces hierarchical propagation into VOS and can associate multiple objects collaboratively with
the proposed ID mechanism.
Visual Transformers.
Transformers [48] was initially proposed to build hierarchical attention-based
networks for natural language processing (NLP). Compared to RNNs, transformer networks model
global correlation or attention in parallel, leading to better memory efficiency, and thus have been
widely used in NLP tasks [15,40,46]. Similar to Non-local Neural Networks [53], transformer
blocks compute correlation with all the input elements and aggregate their information by using
attention mechanisms [5]. Recently, transformer blocks were introduced to computer vision and
have shown promising performance in many tasks, such as image classification [16,30,47], object
detection [8]/segmentation [25,35,54,66], image generation [36], and video understanding [1,26,31].
Based on transformers, AOT [63] proposes a Long Short-Term Transformer (LSTT) structure for
constructing hierarchical propagation. By hierarchically propagating object information, AOT
variants [63] have shown promising performance with remarkable scalability. Unlike AOT, which
shares the embedding space for object-agnostic and object-specific embeddings, we propose to
decouple them into different branches using individual propagation processes. Such a dual-branch
paradigm avoids the loss of object-agnostic information and achieves significant improvement.
Besides, a more efficient structure, GPM, is proposed for hierarchical propagation.
3 Rethinking Hierarchical Propagation for VOS
Attention-based VOS methods [11,34,43,63] are dominating the field of VOS. In these methods,
STM [34] and following algorithms [11,43] uses a single attention layer to propagate mask in-
formation from memorized frames to the current frame. The use of only a single attention layer
restricts the scalability of algorithms. Hence, AOT [63] introduces hierarchical propagation into VOS
by proposing the Long Short-term Transformer (LSTT) structure, which can propagate the mask
information in a hierarchical coarse-to-fine manner. By adjusting the layer number of LSTT, AOT
variants can be ranged from state-of-the-art performance to real-time run-time speed.
3