[23, 9, 19], video panoptic segmentation [22] is a newly in-
troduced task that has been less studied. Existing methods,
such as VPSNet [22] and ViP-DeepLab [34] focus on im-
proving the panoptic segmentation quality, however with a
high computational cost. To increase consistency among
frames, VPSNet [22] designs a temporal fusion module
based on optical flow, that aggregates five neighboring past
and future warped features. In contrast, our network is more
accurate and efficient, although it operates in an online fash-
ion: only the current frame is processed, while the Trans-
former block reads past features from the memory and at-
tends to relevant positions in order to provide an enhanced
representation. ViP-DeepLab [34] models video panoptic
segmentation as concatenated image panoptic segmentation
and achieves the state-of-the-art for this task. Compared to
this work, we explicitly encode spatio-temporal correlations
for a boost in performance and obtain competitive results
with a more lightweight network.
As shown in Figure 1, we propose a novel video panop-
tic segmentation approach by extending the single frame
panoptic segmentation network Panoptic-DeepLab [9] with
aTransformer video module and a motion estimation de-
coder. Our module, inspired by the pure Transformer block
[38], refines the current backbone output features by pro-
cessing the sequence of spatio-temporal features from cur-
rent and past frames. As we strive to develop an efficient
network, we design a lightweight variant of the pure Trans-
former block that is faster than the original implementation
[38]. We also present several ways to factorize the atten-
tion operation of the Transformer over space and time and
compare their accuracy and efficiency in extensive ablation
studies. Given the enhanced feature representations from
the Transformer module, three convolutional decoders re-
cover the spatial resolution of the input image and multi-
ple heads perform semantic segmentation, instance center
prediction, instance offset regression and optical flow es-
timation. To ensure consistent instance identifiers for the
same instance across frames, we implement a simple track-
ing module [27, 43] based on mask propagation with optical
flow and instance ID association between warped and pre-
dicted instance masks based on the class label and the inter-
section over union. Our video panoptic segmentation net-
work is designed to achieve a good trade-off between speed
and accuracy. Each newly introduced module is carefully
designed to preserve the efficiency of the system. The pro-
posed network can be trained in a weakly supervised regime
with a sparsely annotated dataset, since it does not require
labels for previous frames. We perform extensive experi-
ments on the Cityscapes-VPS [22] dataset and demonstrate
that the proposed methods improve both image and video-
based panoptic segmentation.
To summarize, our main contributions are the following:
1. We propose a novel video panoptic segmentation net-
work with a pure Transformer-based video module
that applies spatial self-attention and temporal self-
attention on sequences of current and past image fea-
tures for more accurate panoptic prediction.
2. We extend our panoptic segmentation network with
an optical flow decoder that is used for instance mask
warping in the tracking process. Instance ID associa-
tion between warped and predicted panoptic segmenta-
tion ensures temporally consistent instance identifiers
across frames in a video sequence.
3. We propose a lightweight Transformer module in-
spired by the pure Transfomer architecture with three
different designs of the attention mechanism: space
self-attention, global time-space attention and local
time-space attention. We compare the variants in ex-
tensive experiments.
4. We perform extensive experiments on the Cityscapes-
VPS dataset and demonstrate that the proposed mod-
ules increase accuracy and temporal consistency, with-
out introducing significant extra computational cost.
2. Related Work
Panoptic Segmentation Proposal-based approaches
employ an object detector for generating bounding box pro-
posals, which are further segmented into instances. Seman-
tic segments are usually merged with instance masks with a
post-processing step that solves overlaps and semantic class
conflicts. There are many works [23, 32, 25] that build
their networks on top of the two-stage instance segmenta-
tion framework Mask-RCNN [16], which is extended with
a semantic segmentation head. UPSNet [44] proposes a
parameter-free panoptic segmentation head supervised with
an explicit panoptic segmentation loss. Seamless Segmen-
tation [33] introduces a lightweight DeepLab-inspired seg-
mentation head [8] and achieves high panoptic quality. Mo-
tivated by the recent success of one-shot detectors, FPSNet
[11] adopts RetinaNet [26] for proposal generation, and
achieves fast inference speed. [42] extends RetinaNet with a
semantic segmentation head and a pixel offset center predic-
tion head, which is used for clustering pixels into instances.
DenseBox [19] proposes a parameter-free mask construc-
tion method that reuses bounding box proposals discarded
in the Non-Maxima Suppression step. Panoptic DeepLab
[9] achieves state-of-the-art results on multiple benchmarks
with a bottom-up network that performs semantic segmen-
tation and instance center regression. Axial-DeepLab [39]
builds a stand-alone attention model with factorized self-
attention along the height and width dimension.
Video Panoptic Segmentation This task has been re-
cently introduced in [22], which also proposes the baseline
VPSNet network. VPSNet is built on top of the two-stage