Time-Space Transformers for Video Panoptic Segmentation Andra Petrovai Technical University of Cluj-Napoca

2025-05-06 0 0 8.81MB 10 页 10玖币
侵权投诉
Time-Space Transformers for Video Panoptic Segmentation
Andra Petrovai
Technical University of Cluj-Napoca
Cluj-Napoca, Romania
andra.petrovai@cs.utcluj.ro
Sergiu Nedevschi
Technical University of Cluj-Napoca
Cluj-Napoca, Romania
sergiu.nedevschi@cs.utcluj.ro
Abstract
We propose a novel solution for the task of video
panoptic segmentation, that simultaneously predicts pixel-
level semantic and instance segmentation and generates
clip-level instance tracks. Our network, named VPS-
Transformer, with a hybrid architecture based on the
state-of-the-art panoptic segmentation network Panoptic-
DeepLab, combines a convolutional architecture for single-
frame panoptic segmentation and a novel video module
based on an instantiation of the pure Transformer block.
The Transformer, equipped with attention mechanisms,
models spatio-temporal relations between backbone out-
put features of current and past frames for more accu-
rate and consistent panoptic estimates. As the pure Trans-
former block introduces large computation overhead when
processing high resolution images, we propose a few de-
sign changes for a more efficient compute. We study how
to aggregate information more effectively over the space-
time volume and we compare several variants of the Trans-
former block with different attention schemes. Extensive ex-
periments on the Cityscapes-VPS dataset demonstrate that
our best model improves the temporal consistency and video
panoptic quality by a margin of 2.2%, with little extra com-
putation.
1. Introduction
Video panoptic segmentation [22] extends panoptic seg-
mentation [24] to video and provides a holistic scene un-
derstanding across space and time by performing pixel level
segmentation and instance level classification and tracking.
The task has a broad applicability in many real-world sys-
tems such as robotics and automated driving, which nat-
urally process video streams, rather than single frames.
Video panoptic segmentation is more challenging than its
image-level counterpart, because it requires temporally con-
sistent inter-frame predictions.
Video sequences provide rich information, such as tem-
poral cues and motion patterns, which could be exploited
Optical Flow
Decoder
Backbone Backbone Backbone
Optical Flow
Decoder
Panoptic
Decoder
Panoptic
Decoder
Panoptic
Decoder
Transformer
Video Module
Transformer
Video Module
Transformer
Video Module
Warp
k-2 k-1 k
ID
Association
ID
Association
ID
Association
Warp
Figure 1. High level overview of our VPS-Transformer network,
which processes video frames and outputs panoptic segmentation
and consistent instance identifiers. We propose a Transformer
based video module to model temporal and spatial correlations
among pixels from the current frame features and past frames
features. Instance tracking is performed by warping the previous
panoptic prediction with optical flow and associating the instance
IDs with the current instance segmentation.
for more accurate and consistent panoptic segmentation.
Although we can clearly benefit from modeling the tempo-
ral correlations between frames, new challenges arise when
processing video data. We can generally assume a high level
of temporal consistency across consecutive frames, how-
ever, this could be broken by occlusions and new objects
caused by fast scene evolution. In this context, temporal
information should be carefully used such that we avoid in-
troducing outdated information in our predictions [29, 21].
On the other hand, extending a single frame solution to pro-
cess multiple video frames often incurs a high computa-
tional cost [22], causing overhead in training and inference.
However, striking a balance between efficiency and accu-
racy is important from a practical point of view.
Unlike panoptic image segmentation [24], which has re-
ceived increased attention from the research community
arXiv:2210.03546v1 [cs.CV] 7 Oct 2022
[23, 9, 19], video panoptic segmentation [22] is a newly in-
troduced task that has been less studied. Existing methods,
such as VPSNet [22] and ViP-DeepLab [34] focus on im-
proving the panoptic segmentation quality, however with a
high computational cost. To increase consistency among
frames, VPSNet [22] designs a temporal fusion module
based on optical flow, that aggregates five neighboring past
and future warped features. In contrast, our network is more
accurate and efficient, although it operates in an online fash-
ion: only the current frame is processed, while the Trans-
former block reads past features from the memory and at-
tends to relevant positions in order to provide an enhanced
representation. ViP-DeepLab [34] models video panoptic
segmentation as concatenated image panoptic segmentation
and achieves the state-of-the-art for this task. Compared to
this work, we explicitly encode spatio-temporal correlations
for a boost in performance and obtain competitive results
with a more lightweight network.
As shown in Figure 1, we propose a novel video panop-
tic segmentation approach by extending the single frame
panoptic segmentation network Panoptic-DeepLab [9] with
aTransformer video module and a motion estimation de-
coder. Our module, inspired by the pure Transformer block
[38], refines the current backbone output features by pro-
cessing the sequence of spatio-temporal features from cur-
rent and past frames. As we strive to develop an efficient
network, we design a lightweight variant of the pure Trans-
former block that is faster than the original implementation
[38]. We also present several ways to factorize the atten-
tion operation of the Transformer over space and time and
compare their accuracy and efficiency in extensive ablation
studies. Given the enhanced feature representations from
the Transformer module, three convolutional decoders re-
cover the spatial resolution of the input image and multi-
ple heads perform semantic segmentation, instance center
prediction, instance offset regression and optical flow es-
timation. To ensure consistent instance identifiers for the
same instance across frames, we implement a simple track-
ing module [27, 43] based on mask propagation with optical
flow and instance ID association between warped and pre-
dicted instance masks based on the class label and the inter-
section over union. Our video panoptic segmentation net-
work is designed to achieve a good trade-off between speed
and accuracy. Each newly introduced module is carefully
designed to preserve the efficiency of the system. The pro-
posed network can be trained in a weakly supervised regime
with a sparsely annotated dataset, since it does not require
labels for previous frames. We perform extensive experi-
ments on the Cityscapes-VPS [22] dataset and demonstrate
that the proposed methods improve both image and video-
based panoptic segmentation.
To summarize, our main contributions are the following:
1. We propose a novel video panoptic segmentation net-
work with a pure Transformer-based video module
that applies spatial self-attention and temporal self-
attention on sequences of current and past image fea-
tures for more accurate panoptic prediction.
2. We extend our panoptic segmentation network with
an optical flow decoder that is used for instance mask
warping in the tracking process. Instance ID associa-
tion between warped and predicted panoptic segmenta-
tion ensures temporally consistent instance identifiers
across frames in a video sequence.
3. We propose a lightweight Transformer module in-
spired by the pure Transfomer architecture with three
different designs of the attention mechanism: space
self-attention, global time-space attention and local
time-space attention. We compare the variants in ex-
tensive experiments.
4. We perform extensive experiments on the Cityscapes-
VPS dataset and demonstrate that the proposed mod-
ules increase accuracy and temporal consistency, with-
out introducing significant extra computational cost.
2. Related Work
Panoptic Segmentation Proposal-based approaches
employ an object detector for generating bounding box pro-
posals, which are further segmented into instances. Seman-
tic segments are usually merged with instance masks with a
post-processing step that solves overlaps and semantic class
conflicts. There are many works [23, 32, 25] that build
their networks on top of the two-stage instance segmenta-
tion framework Mask-RCNN [16], which is extended with
a semantic segmentation head. UPSNet [44] proposes a
parameter-free panoptic segmentation head supervised with
an explicit panoptic segmentation loss. Seamless Segmen-
tation [33] introduces a lightweight DeepLab-inspired seg-
mentation head [8] and achieves high panoptic quality. Mo-
tivated by the recent success of one-shot detectors, FPSNet
[11] adopts RetinaNet [26] for proposal generation, and
achieves fast inference speed. [42] extends RetinaNet with a
semantic segmentation head and a pixel offset center predic-
tion head, which is used for clustering pixels into instances.
DenseBox [19] proposes a parameter-free mask construc-
tion method that reuses bounding box proposals discarded
in the Non-Maxima Suppression step. Panoptic DeepLab
[9] achieves state-of-the-art results on multiple benchmarks
with a bottom-up network that performs semantic segmen-
tation and instance center regression. Axial-DeepLab [39]
builds a stand-alone attention model with factorized self-
attention along the height and width dimension.
Video Panoptic Segmentation This task has been re-
cently introduced in [22], which also proposes the baseline
VPSNet network. VPSNet is built on top of the two-stage
摘要:

Time-SpaceTransformersforVideoPanopticSegmentationAndraPetrovaiTechnicalUniversityofCluj-NapocaCluj-Napoca,Romaniaandra.petrovai@cs.utcluj.roSergiuNedevschiTechnicalUniversityofCluj-NapocaCluj-Napoca,Romaniasergiu.nedevschi@cs.utcluj.roAbstractWeproposeanovelsolutionforthetaskofvideopanopticsegmenta...

展开>> 收起<<
Time-Space Transformers for Video Panoptic Segmentation Andra Petrovai Technical University of Cluj-Napoca.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:8.81MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注