Time-Space Transformers for Video Panoptic Segmentation Andra Petrovai Technical University of Cluj-Napoca

2025-05-06 0 0 8.81MB 10 页 10玖币

侵权投诉

Time-Space Transformers for Video Panoptic Segmentation

Andra Petrovai

Technical University of Cluj-Napoca

Cluj-Napoca, Romania

andra.petrovai@cs.utcluj.ro

Sergiu Nedevschi

Technical University of Cluj-Napoca

Cluj-Napoca, Romania

sergiu.nedevschi@cs.utcluj.ro

Abstract

We propose a novel solution for the task of video

panoptic segmentation, that simultaneously predicts pixel-

level semantic and instance segmentation and generates

clip-level instance tracks. Our network, named VPS-

Transformer, with a hybrid architecture based on the

state-of-the-art panoptic segmentation network Panoptic-

DeepLab, combines a convolutional architecture for single-

frame panoptic segmentation and a novel video module

based on an instantiation of the pure Transformer block.

The Transformer, equipped with attention mechanisms,

models spatio-temporal relations between backbone out-

put features of current and past frames for more accu-

rate and consistent panoptic estimates. As the pure Trans-

former block introduces large computation overhead when

processing high resolution images, we propose a few de-

sign changes for a more efﬁcient compute. We study how

to aggregate information more effectively over the space-

time volume and we compare several variants of the Trans-

former block with different attention schemes. Extensive ex-

periments on the Cityscapes-VPS dataset demonstrate that

our best model improves the temporal consistency and video

panoptic quality by a margin of 2.2%, with little extra com-

putation.

1. Introduction

Video panoptic segmentation [22] extends panoptic seg-

mentation [24] to video and provides a holistic scene un-

derstanding across space and time by performing pixel level

segmentation and instance level classiﬁcation and tracking.

The task has a broad applicability in many real-world sys-

tems such as robotics and automated driving, which nat-

urally process video streams, rather than single frames.

Video panoptic segmentation is more challenging than its

image-level counterpart, because it requires temporally con-

sistent inter-frame predictions.

Video sequences provide rich information, such as tem-

poral cues and motion patterns, which could be exploited

Optical Flow

Decoder

Backbone Backbone Backbone

Optical Flow

Decoder

Panoptic

Decoder

Panoptic

Decoder

Panoptic

Decoder

Transformer

Video Module

Transformer

Video Module

Transformer

Video Module

Warp

k-2 k-1 k

Association

Warp

Figure 1. High level overview of our VPS-Transformer network,

which processes video frames and outputs panoptic segmentation

and consistent instance identiﬁers. We propose a Transformer

based video module to model temporal and spatial correlations

among pixels from the current frame features and past frames

features. Instance tracking is performed by warping the previous

panoptic prediction with optical ﬂow and associating the instance

IDs with the current instance segmentation.

for more accurate and consistent panoptic segmentation.

Although we can clearly beneﬁt from modeling the tempo-

ral correlations between frames, new challenges arise when

processing video data. We can generally assume a high level

of temporal consistency across consecutive frames, how-

ever, this could be broken by occlusions and new objects

caused by fast scene evolution. In this context, temporal

information should be carefully used such that we avoid in-

troducing outdated information in our predictions [29, 21].

On the other hand, extending a single frame solution to pro-

cess multiple video frames often incurs a high computa-

tional cost [22], causing overhead in training and inference.

However, striking a balance between efﬁciency and accu-

racy is important from a practical point of view.

Unlike panoptic image segmentation [24], which has re-

ceived increased attention from the research community

arXiv:2210.03546v1 [cs.CV] 7 Oct 2022

[23, 9, 19], video panoptic segmentation [22] is a newly in-

troduced task that has been less studied. Existing methods,

such as VPSNet [22] and ViP-DeepLab [34] focus on im-

proving the panoptic segmentation quality, however with a

high computational cost. To increase consistency among

frames, VPSNet [22] designs a temporal fusion module

based on optical ﬂow, that aggregates ﬁve neighboring past

and future warped features. In contrast, our network is more

accurate and efﬁcient, although it operates in an online fash-

ion: only the current frame is processed, while the Trans-

former block reads past features from the memory and at-

tends to relevant positions in order to provide an enhanced

representation. ViP-DeepLab [34] models video panoptic

segmentation as concatenated image panoptic segmentation

and achieves the state-of-the-art for this task. Compared to

this work, we explicitly encode spatio-temporal correlations

for a boost in performance and obtain competitive results

with a more lightweight network.

As shown in Figure 1, we propose a novel video panop-

tic segmentation approach by extending the single frame

panoptic segmentation network Panoptic-DeepLab [9] with

aTransformer video module and a motion estimation de-

coder. Our module, inspired by the pure Transformer block

[38], reﬁnes the current backbone output features by pro-

cessing the sequence of spatio-temporal features from cur-

rent and past frames. As we strive to develop an efﬁcient

network, we design a lightweight variant of the pure Trans-

former block that is faster than the original implementation

[38]. We also present several ways to factorize the atten-

tion operation of the Transformer over space and time and

compare their accuracy and efﬁciency in extensive ablation

studies. Given the enhanced feature representations from

the Transformer module, three convolutional decoders re-

cover the spatial resolution of the input image and multi-

ple heads perform semantic segmentation, instance center

prediction, instance offset regression and optical ﬂow es-

timation. To ensure consistent instance identiﬁers for the

same instance across frames, we implement a simple track-

ing module [27, 43] based on mask propagation with optical

ﬂow and instance ID association between warped and pre-

dicted instance masks based on the class label and the inter-

section over union. Our video panoptic segmentation net-

work is designed to achieve a good trade-off between speed

and accuracy. Each newly introduced module is carefully

designed to preserve the efﬁciency of the system. The pro-

posed network can be trained in a weakly supervised regime

with a sparsely annotated dataset, since it does not require

labels for previous frames. We perform extensive experi-

ments on the Cityscapes-VPS [22] dataset and demonstrate

that the proposed methods improve both image and video-

based panoptic segmentation.

To summarize, our main contributions are the following:

1. We propose a novel video panoptic segmentation net-

work with a pure Transformer-based video module

that applies spatial self-attention and temporal self-

attention on sequences of current and past image fea-

tures for more accurate panoptic prediction.

2. We extend our panoptic segmentation network with

an optical ﬂow decoder that is used for instance mask

warping in the tracking process. Instance ID associa-

tion between warped and predicted panoptic segmenta-

tion ensures temporally consistent instance identiﬁers

across frames in a video sequence.

3. We propose a lightweight Transformer module in-

spired by the pure Transfomer architecture with three

different designs of the attention mechanism: space

self-attention, global time-space attention and local

time-space attention. We compare the variants in ex-

tensive experiments.

4. We perform extensive experiments on the Cityscapes-

VPS dataset and demonstrate that the proposed mod-

ules increase accuracy and temporal consistency, with-

out introducing signiﬁcant extra computational cost.

2. Related Work

Panoptic Segmentation Proposal-based approaches

employ an object detector for generating bounding box pro-

posals, which are further segmented into instances. Seman-

tic segments are usually merged with instance masks with a

post-processing step that solves overlaps and semantic class

conﬂicts. There are many works [23, 32, 25] that build

their networks on top of the two-stage instance segmenta-

tion framework Mask-RCNN [16], which is extended with

a semantic segmentation head. UPSNet [44] proposes a

parameter-free panoptic segmentation head supervised with

an explicit panoptic segmentation loss. Seamless Segmen-

tation [33] introduces a lightweight DeepLab-inspired seg-

mentation head [8] and achieves high panoptic quality. Mo-

tivated by the recent success of one-shot detectors, FPSNet

[11] adopts RetinaNet [26] for proposal generation, and

achieves fast inference speed. [42] extends RetinaNet with a

semantic segmentation head and a pixel offset center predic-

tion head, which is used for clustering pixels into instances.

DenseBox [19] proposes a parameter-free mask construc-

tion method that reuses bounding box proposals discarded

in the Non-Maxima Suppression step. Panoptic DeepLab

[9] achieves state-of-the-art results on multiple benchmarks

with a bottom-up network that performs semantic segmen-

tation and instance center regression. Axial-DeepLab [39]

builds a stand-alone attention model with factorized self-

attention along the height and width dimension.

Video Panoptic Segmentation This task has been re-

cently introduced in [22], which also proposes the baseline

VPSNet network. VPSNet is built on top of the two-stage

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Time-SpaceTransformersforVideoPanopticSegmentationAndraPetrovaiTechnicalUniversityofCluj-NapocaCluj-Napoca,Romaniaandra.petrovai@cs.utcluj.roSergiuNedevschiTechnicalUniversityofCluj-NapocaCluj-Napoca,Romaniasergiu.nedevschi@cs.utcluj.roAbstractWeproposeanovelsolutionforthetaskofvideopanopticsegmenta...

展开>> 收起<<

Time-Space Transformers for Video Panoptic Segmentation Andra Petrovai Technical University of Cluj-Napoca.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Time-Space Transformers for Video Panoptic Segmentation Andra Petrovai Technical University of Cluj-Napoca

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: