Decoupling Features in Hierarchical Propagation for Video Object Segmentation Zongxin Yang12 Yi Yang1y

2025-05-06 0 0 3.16MB 12 页 10玖币

侵权投诉

Decoupling Features in Hierarchical Propagation

for Video Object Segmentation

Zongxin Yang1,2, Yi Yang1†

1CCAI, College of Computer Science and Technology, Zhejiang University 2Baidu Research

{yangzongxin, yangyics}@zju.edu.cn

Abstract

This paper focuses on developing a more effective method of hierarchical propa-

gation for semi-supervised Video Object Segmentation (VOS). Based on vision

transformers, the recently-developed Associating Objects with Transformers (AOT)

approach introduces hierarchical propagation into VOS and has shown promis-

ing results. The hierarchical propagation can gradually propagate information

from past frames to the current frame and transfer the current frame feature from

object-agnostic to object-speciﬁc. However, the increase of object-speciﬁc infor-

mation will inevitably lead to the loss of object-agnostic visual information in deep

propagation layers. To solve such a problem and further facilitate the learning

of visual embeddings, this paper proposes a Decoupling Features in Hierarchical

Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propa-

gation of object-agnostic and object-speciﬁc embeddings by handling them in two

independent branches. Secondly, to compensate for the additional computation

from dual-branch propagation, we propose an efﬁcient module for constructing hi-

erarchical propagation, i.e., Gated Propagation Module, which is carefully designed

with single-head attention. Extensive experiments show that DeAOT signiﬁcantly

outperforms AOT in both accuracy and efﬁciency. On YouTube-VOS, DeAOT can

achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations,

we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-

VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622).

Project page: https://github.com/z-x-yang/AOT.

1 Introduction

Video Object Segmentation (VOS), which aims at recognizing and segmenting one or multiple objects

of interest in a given video, has attracted much attention as a fundamental task of video understanding.

This paper focuses on semi-supervised VOS, which requires algorithms to track and segment objects

throughout a video sequence given objects’ annotated masks at one or several frames.

Early VOS methods are mainly based on ﬁnetuning segmentation networks on the annotated frames [7,

32,51] or constructing pixel-wise matching maps [10,50]. Based on the advance of attention

mechanisms [5,48,53], many attention-based VOS algorithms have been proposed in recent years and

achieved signiﬁcant improvement. STM [34] and the following works [11,43,44] leverage a memory

network to store and read the target features of predicted past frames and apply a non-local attention

mechanism to match the target in the current frame. Furthermore, AOT [61,63,65] introduces

hierarchical propagation into VOS based on transformers [8,48] and can associate multiple objects

†: the corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09782v3 [cs.CV] 28 Nov 2022

Frame t

Reference

Prop Prop

Prediction t

Object-agnostic Object-specific

Object-agnostic

Reference

Prop Prop

Visual Branch

ID Branch

Object-agnostic Object-specific

Image

Mask

Object-agnostic

Prediction t

Zero

Visual

Embedding

Embedding Share

Attention

Maps

Frame t

Reference

Encode Decode

EncodeDecode

(a) AOT-like hierarchical propagation

Frame t

Reference

Prop Prop

Prediction t

Object-agnostic Object-specific

Object-agnostic

Reference

Prop Prop

Visual Branch

ID Branch

Object-agnostic Object-specific

Image

Mask

Object-agnostic

Prediction t

Zero

Visual

Embedding

Embedding Share

Attention

Maps

Frame t

Reference

Encode Decode

EncodeDecode

(b) Decoupling features (ours)

Real-time

Figure 1: (a) AOT [63] hierarchically propagates (Prop) object-speciﬁc information (i.e., speciﬁc to

the given object(s)) into the object-agnostic visual embedding. (b) By contrast, DeAOT decouples

the propagation of visual and ID embeddings in two branches. (c) Speed-accuracy comparison. All

the results were fairly recorded on the same device, 1 Tesla V100 GPU.

collaboratively by utilizing the IDentiﬁcation (ID) mechanism [63]. The hierarchical propagation can

gradually propagate ID information from past frames to the current frame and has shown promising

VOS performance with remarkable scalability.

Fig. 1a shows that AOT’s hierarchical propagation can transfer the current frame feature from an

object-agnostic visual embedding to an object-speciﬁc ID embedding by hierarchically propagating

the reference information into the current frame. The hierarchical structure enables AOT to be

structurally scalable between state-of-the-art performance and real-time efﬁciency. Intuitively, the

increase of ID information will inevitably lead to the loss of initial visual information since the

dimension of features is limited. However, matching objects’ visual features, the only clues provided

by the current frame, is crucial for attention-based VOS solutions. To avoid the loss of visual

information in deeper propagation layers and facilitate the learning of visual embeddings, a desirable

manner (Fig. 1b) is to decouple object-agnostic and object-speciﬁc embeddings in the propagation.

Based on the above motivation, this paper proposes a novel hierarchical propagation approach for

VOS, i.e., Decoupling Features in Hierarchical Propagation (DeAOT). Unlike AOT, which shares the

embedding space for visual (object-agnostic) and ID (object-speciﬁc) embeddings, DeAOT decouples

them into different branches using individual propagation processes while sharing their attention

maps. To compensate for the additional computation from the dual-branch propagation, we propose

a more efﬁcient module for constructing hierarchical propagation, i.e., Gated Propagation Module

(GPM). By carefully designing GPM for VOS, we are able to use single-head attention to match

objects and propagate information instead of the stronger multi-head attention [48], which we found

to be an efﬁciency bottleneck of AOT [63].

To evaluate the proposed DeAOT approach, a series of experiments are conducted on three VOS

benchmarks (YouTube-VOS [57], DAVIS 2017 [39], and DAVIS 2016 [38]) and one Visual Object

Tracking (VOT) benchmark (VOT 2020 [24]). On the large-scale VOS benchmark, YouTube-VOS,

the DeAOT variant networks remarkably outperform AOT counterparts in both accuracy and run-time

speed as shown in Fig. 1c. Particularly, our R50-DeAOT-L can achieve

86.0%

at a nearly real-time

speed,

22.4fps

, and our DeAOT-T can achieve

82.0%

53.4fps

, which is superior compared to AOT-

T [63] (80.2%, 41.0fps). Without any test-time augmentations, our SwinB-DeAOT-L achieves top-

ranked performance on four VOS/VOT benchmarks, i.e., YouTube-VOS 2018/2019 (

86.2%/86.1%

DAVIS 2017 Val/Test (86.2%/82.8%), DAVIS 2016 (92.9%), and VOT 2020 (0.622 EAO).

Overall, our contributions are summarized below:

•

We propose a highly-effective VOS framework, DeAOT, by decoupling object-agnostic and object-

speciﬁc features in hierarchical propagation. DeAOT achieves top-ranked performance and efﬁ-

ciency on four VOS/VOT benchmarks [24,38,39,57].

•

We design an efﬁcient module, GPM, for constructing hierarchical matching and propagation. By

using GPM, DeAOT variants are consistently faster than AOT counterparts, although DeAOT’s

propagation processes are twice as AOT’s.

2 Related Work

Semi-supervised Video Object Segmentation.

Given a video with one or several annotated frames

(the ﬁrst frame in general), semi-supervised VOS [52] requires algorithms to propagate the mask

annotations to the entire video. Traditional methods often solve an optimization problem with an

energy deﬁned over a graph structure [2,4,49]. Based on deep neural networks (DNN), deep learning

based VOS methods have achieved signiﬁcant progress and dominated the ﬁeld in recent years.

Finetuning-based Methods. Early DNN-based methods rely on ﬁne-tuning pre-trained segmentation

networks at test time to make the networks focus on the given object. Among them, OSVOS [7] and

MoNet [56] propose to ﬁne-tune pre-trained networks on the ﬁrst-frame annotation. OnAVOS [51]

extends the ﬁrst-frame ﬁne-tuning by introducing an online adaptation mechanism. Following

these approaches, MaskTrack [37] and PReM [32] further utilize optical ﬂow to help propagate the

segmentation mask from one frame to the next.

Template-based Methods. To avoid using the test-time ﬁne-tuning, many researchers regard the

annotated frames as templates and investigate how to match with them. For example, OSMN [60]

employs a network to extract object embedding and another one to predict segmentation based on

the embedding. PML [10] learns pixel-wise embedding with the nearest neighbor classiﬁer, and

VideoMatch [22] uses a matching layer to map the pixels of the current frame to the annotated frame

in a learned embedding space. Following these methods, FEELVOS [50] and CFBI(+) [62,64] extend

the pixel-level matching mechanism by additionally doing local matching with the previous frame,

and RPCM [58] proposes a correction module to improve the reliability of pixel-level matching.

Instead of using matching mechanisms, LWL [6] proposes to use an online few-shot learner to learn

to decode object segmentation.

Attention-based Methods. Based on the advance of attention mechanisms [5,48,53], STM [34] and the

following works (e.g., KMN [43] and STCN [11]) leverage a memory network to embed past-frame

predictions into memory and apply a non-local attention mechanism on the memory to propagate

mask information to the current frame. Differently, SST [17] proposes to calculate pixel-level

matching maps based on the attention maps of transformer blocks [48]. Recently, AOT [61,63,65]

introduces hierarchical propagation into VOS and can associate multiple objects collaboratively with

the proposed ID mechanism.

Visual Transformers.

Transformers [48] was initially proposed to build hierarchical attention-based

networks for natural language processing (NLP). Compared to RNNs, transformer networks model

global correlation or attention in parallel, leading to better memory efﬁciency, and thus have been

widely used in NLP tasks [15,40,46]. Similar to Non-local Neural Networks [53], transformer

blocks compute correlation with all the input elements and aggregate their information by using

attention mechanisms [5]. Recently, transformer blocks were introduced to computer vision and

have shown promising performance in many tasks, such as image classiﬁcation [16,30,47], object

detection [8]/segmentation [25,35,54,66], image generation [36], and video understanding [1,26,31].

Based on transformers, AOT [63] proposes a Long Short-Term Transformer (LSTT) structure for

constructing hierarchical propagation. By hierarchically propagating object information, AOT

variants [63] have shown promising performance with remarkable scalability. Unlike AOT, which

shares the embedding space for object-agnostic and object-speciﬁc embeddings, we propose to

decouple them into different branches using individual propagation processes. Such a dual-branch

paradigm avoids the loss of object-agnostic information and achieves signiﬁcant improvement.

Besides, a more efﬁcient structure, GPM, is proposed for hierarchical propagation.

3 Rethinking Hierarchical Propagation for VOS

Attention-based VOS methods [11,34,43,63] are dominating the ﬁeld of VOS. In these methods,

STM [34] and following algorithms [11,43] uses a single attention layer to propagate mask in-

formation from memorized frames to the current frame. The use of only a single attention layer

restricts the scalability of algorithms. Hence, AOT [63] introduces hierarchical propagation into VOS

by proposing the Long Short-term Transformer (LSTT) structure, which can propagate the mask

information in a hierarchical coarse-to-ﬁne manner. By adjusting the layer number of LSTT, AOT

variants can be ranged from state-of-the-art performance to real-time run-time speed.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DecouplingFeaturesinHierarchicalPropagationforVideoObjectSegmentationZongxinYang1;2,YiYang1y1CCAI,CollegeofComputerScienceandTechnology,ZhejiangUniversity2BaiduResearch{yangzongxin,yangyics}@zju.edu.cnAbstractThispaperfocusesondevelopingamoreeffectivemethodofhierarchicalpropa-gationforsemi-supervise...

展开>> 收起<<

Decoupling Features in Hierarchical Propagation for Video Object Segmentation Zongxin Yang12 Yi Yang1y.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Decoupling Features in Hierarchical Propagation for Video Object Segmentation Zongxin Yang12 Yi Yang1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: