Decoupling Features in Hierarchical Propagation for Video Object Segmentation Zongxin Yang12 Yi Yang1y

2025-05-06 0 0 3.16MB 12 页 10玖币
侵权投诉
Decoupling Features in Hierarchical Propagation
for Video Object Segmentation
Zongxin Yang1,2, Yi Yang1
1CCAI, College of Computer Science and Technology, Zhejiang University 2Baidu Research
{yangzongxin, yangyics}@zju.edu.cn
Abstract
This paper focuses on developing a more effective method of hierarchical propa-
gation for semi-supervised Video Object Segmentation (VOS). Based on vision
transformers, the recently-developed Associating Objects with Transformers (AOT)
approach introduces hierarchical propagation into VOS and has shown promis-
ing results. The hierarchical propagation can gradually propagate information
from past frames to the current frame and transfer the current frame feature from
object-agnostic to object-specific. However, the increase of object-specific infor-
mation will inevitably lead to the loss of object-agnostic visual information in deep
propagation layers. To solve such a problem and further facilitate the learning
of visual embeddings, this paper proposes a Decoupling Features in Hierarchical
Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propa-
gation of object-agnostic and object-specific embeddings by handling them in two
independent branches. Secondly, to compensate for the additional computation
from dual-branch propagation, we propose an efficient module for constructing hi-
erarchical propagation, i.e., Gated Propagation Module, which is carefully designed
with single-head attention. Extensive experiments show that DeAOT significantly
outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can
achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations,
we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-
VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622).
Project page: https://github.com/z-x-yang/AOT.
1 Introduction
Video Object Segmentation (VOS), which aims at recognizing and segmenting one or multiple objects
of interest in a given video, has attracted much attention as a fundamental task of video understanding.
This paper focuses on semi-supervised VOS, which requires algorithms to track and segment objects
throughout a video sequence given objects’ annotated masks at one or several frames.
Early VOS methods are mainly based on finetuning segmentation networks on the annotated frames [7,
32,51] or constructing pixel-wise matching maps [10,50]. Based on the advance of attention
mechanisms [5,48,53], many attention-based VOS algorithms have been proposed in recent years and
achieved significant improvement. STM [34] and the following works [11,43,44] leverage a memory
network to store and read the target features of predicted past frames and apply a non-local attention
mechanism to match the target in the current frame. Furthermore, AOT [61,63,65] introduces
hierarchical propagation into VOS based on transformers [8,48] and can associate multiple objects
: the corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09782v3 [cs.CV] 28 Nov 2022
Frame t
Reference
Prop Prop
Prediction t
Object-agnostic Object-specific
Object-agnostic
Reference
Prop Prop
Prop Prop
Visual Branch
ID Branch
Object-agnostic Object-specific
Image
Mask
Object-agnostic
Prediction t
Zero
Visual
Embedding
ID
Embedding Share
Attention
Maps
Frame t
Reference
Encode Decode
EncodeDecode
(a) AOT-like hierarchical propagation
Frame t
Reference
Prop Prop
Prediction t
Object-agnostic Object-specific
Object-agnostic
Reference
Prop Prop
Prop Prop
Visual Branch
ID Branch
Object-agnostic Object-specific
Image
Mask
Object-agnostic
Prediction t
Zero
Visual
Embedding
ID
Embedding Share
Attention
Maps
Frame t
Reference
Encode Decode
EncodeDecode
(b) Decoupling features (ours)
Real-time
(c) Comparison
Figure 1: (a) AOT [63] hierarchically propagates (Prop) object-specific information (i.e., specific to
the given object(s)) into the object-agnostic visual embedding. (b) By contrast, DeAOT decouples
the propagation of visual and ID embeddings in two branches. (c) Speed-accuracy comparison. All
the results were fairly recorded on the same device, 1 Tesla V100 GPU.
collaboratively by utilizing the IDentification (ID) mechanism [63]. The hierarchical propagation can
gradually propagate ID information from past frames to the current frame and has shown promising
VOS performance with remarkable scalability.
Fig. 1a shows that AOT’s hierarchical propagation can transfer the current frame feature from an
object-agnostic visual embedding to an object-specific ID embedding by hierarchically propagating
the reference information into the current frame. The hierarchical structure enables AOT to be
structurally scalable between state-of-the-art performance and real-time efficiency. Intuitively, the
increase of ID information will inevitably lead to the loss of initial visual information since the
dimension of features is limited. However, matching objects’ visual features, the only clues provided
by the current frame, is crucial for attention-based VOS solutions. To avoid the loss of visual
information in deeper propagation layers and facilitate the learning of visual embeddings, a desirable
manner (Fig. 1b) is to decouple object-agnostic and object-specific embeddings in the propagation.
Based on the above motivation, this paper proposes a novel hierarchical propagation approach for
VOS, i.e., Decoupling Features in Hierarchical Propagation (DeAOT). Unlike AOT, which shares the
embedding space for visual (object-agnostic) and ID (object-specific) embeddings, DeAOT decouples
them into different branches using individual propagation processes while sharing their attention
maps. To compensate for the additional computation from the dual-branch propagation, we propose
a more efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module
(GPM). By carefully designing GPM for VOS, we are able to use single-head attention to match
objects and propagate information instead of the stronger multi-head attention [48], which we found
to be an efficiency bottleneck of AOT [63].
To evaluate the proposed DeAOT approach, a series of experiments are conducted on three VOS
benchmarks (YouTube-VOS [57], DAVIS 2017 [39], and DAVIS 2016 [38]) and one Visual Object
Tracking (VOT) benchmark (VOT 2020 [24]). On the large-scale VOS benchmark, YouTube-VOS,
the DeAOT variant networks remarkably outperform AOT counterparts in both accuracy and run-time
speed as shown in Fig. 1c. Particularly, our R50-DeAOT-L can achieve
86.0%
at a nearly real-time
speed,
22.4fps
, and our DeAOT-T can achieve
82.0%
at
53.4fps
, which is superior compared to AOT-
T [63] (80.2%, 41.0fps). Without any test-time augmentations, our SwinB-DeAOT-L achieves top-
ranked performance on four VOS/VOT benchmarks, i.e., YouTube-VOS 2018/2019 (
86.2%/86.1%
),
DAVIS 2017 Val/Test (86.2%/82.8%), DAVIS 2016 (92.9%), and VOT 2020 (0.622 EAO).
Overall, our contributions are summarized below:
We propose a highly-effective VOS framework, DeAOT, by decoupling object-agnostic and object-
specific features in hierarchical propagation. DeAOT achieves top-ranked performance and effi-
ciency on four VOS/VOT benchmarks [24,38,39,57].
We design an efficient module, GPM, for constructing hierarchical matching and propagation. By
using GPM, DeAOT variants are consistently faster than AOT counterparts, although DeAOT’s
propagation processes are twice as AOT’s.
2
2 Related Work
Semi-supervised Video Object Segmentation.
Given a video with one or several annotated frames
(the first frame in general), semi-supervised VOS [52] requires algorithms to propagate the mask
annotations to the entire video. Traditional methods often solve an optimization problem with an
energy defined over a graph structure [2,4,49]. Based on deep neural networks (DNN), deep learning
based VOS methods have achieved significant progress and dominated the field in recent years.
Finetuning-based Methods. Early DNN-based methods rely on fine-tuning pre-trained segmentation
networks at test time to make the networks focus on the given object. Among them, OSVOS [7] and
MoNet [56] propose to fine-tune pre-trained networks on the first-frame annotation. OnAVOS [51]
extends the first-frame fine-tuning by introducing an online adaptation mechanism. Following
these approaches, MaskTrack [37] and PReM [32] further utilize optical flow to help propagate the
segmentation mask from one frame to the next.
Template-based Methods. To avoid using the test-time fine-tuning, many researchers regard the
annotated frames as templates and investigate how to match with them. For example, OSMN [60]
employs a network to extract object embedding and another one to predict segmentation based on
the embedding. PML [10] learns pixel-wise embedding with the nearest neighbor classifier, and
VideoMatch [22] uses a matching layer to map the pixels of the current frame to the annotated frame
in a learned embedding space. Following these methods, FEELVOS [50] and CFBI(+) [62,64] extend
the pixel-level matching mechanism by additionally doing local matching with the previous frame,
and RPCM [58] proposes a correction module to improve the reliability of pixel-level matching.
Instead of using matching mechanisms, LWL [6] proposes to use an online few-shot learner to learn
to decode object segmentation.
Attention-based Methods. Based on the advance of attention mechanisms [5,48,53], STM [34] and the
following works (e.g., KMN [43] and STCN [11]) leverage a memory network to embed past-frame
predictions into memory and apply a non-local attention mechanism on the memory to propagate
mask information to the current frame. Differently, SST [17] proposes to calculate pixel-level
matching maps based on the attention maps of transformer blocks [48]. Recently, AOT [61,63,65]
introduces hierarchical propagation into VOS and can associate multiple objects collaboratively with
the proposed ID mechanism.
Visual Transformers.
Transformers [48] was initially proposed to build hierarchical attention-based
networks for natural language processing (NLP). Compared to RNNs, transformer networks model
global correlation or attention in parallel, leading to better memory efficiency, and thus have been
widely used in NLP tasks [15,40,46]. Similar to Non-local Neural Networks [53], transformer
blocks compute correlation with all the input elements and aggregate their information by using
attention mechanisms [5]. Recently, transformer blocks were introduced to computer vision and
have shown promising performance in many tasks, such as image classification [16,30,47], object
detection [8]/segmentation [25,35,54,66], image generation [36], and video understanding [1,26,31].
Based on transformers, AOT [63] proposes a Long Short-Term Transformer (LSTT) structure for
constructing hierarchical propagation. By hierarchically propagating object information, AOT
variants [63] have shown promising performance with remarkable scalability. Unlike AOT, which
shares the embedding space for object-agnostic and object-specific embeddings, we propose to
decouple them into different branches using individual propagation processes. Such a dual-branch
paradigm avoids the loss of object-agnostic information and achieves significant improvement.
Besides, a more efficient structure, GPM, is proposed for hierarchical propagation.
3 Rethinking Hierarchical Propagation for VOS
Attention-based VOS methods [11,34,43,63] are dominating the field of VOS. In these methods,
STM [34] and following algorithms [11,43] uses a single attention layer to propagate mask in-
formation from memorized frames to the current frame. The use of only a single attention layer
restricts the scalability of algorithms. Hence, AOT [63] introduces hierarchical propagation into VOS
by proposing the Long Short-term Transformer (LSTT) structure, which can propagate the mask
information in a hierarchical coarse-to-fine manner. By adjusting the layer number of LSTT, AOT
variants can be ranged from state-of-the-art performance to real-time run-time speed.
3
摘要:

DecouplingFeaturesinHierarchicalPropagationforVideoObjectSegmentationZongxinYang1;2,YiYang1y1CCAI,CollegeofComputerScienceandTechnology,ZhejiangUniversity2BaiduResearch{yangzongxin,yangyics}@zju.edu.cnAbstractThispaperfocusesondevelopingamoreeffectivemethodofhierarchicalpropa-gationforsemi-supervise...

展开>> 收起<<
Decoupling Features in Hierarchical Propagation for Video Object Segmentation Zongxin Yang12 Yi Yang1y.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.16MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注