scheme is not based on instance temporal length, but on the
length of videos; lastly, the relation between spatial segmenta-
tion and temporal association is not investigated. As the most
required abilities, spatial segmentation and temporal associa-
tion are expected to promote mutually, thus can leave a positive
impact on another when promoting one. Unfortunately, there is
few work to discuss the relation between them.
There are some existing error analyzing toolboxes that may
partially solve the above-mentioned problems. Several image-
level error analyzing toolboxes[18,16,17] try to diagnose er-
rors and observe how much they contribute to the performance
decline, but they fail to distinguish errors distributed in spatial
and temporal dimensions. Some video-level tools[19,20,21]
focus on diagnosing video action and relation errors, not video
objects. They pay attention to the subjective impacts brought by
annotators and task-specific object attributes(e.g., context size
for action, relation distribution, etc.), which are not applicable
for VIS, not to mention demonstrating the relation of spatial
segmentation and temporal association.
Thus we introduce TIVE, a novel Toolbox for Identify var-
ious Video instance segmentation Errors. Decomposing gen-
eral localization error in spatial and temporal dimensions, TIVE
clearly subdivides 7 isolated error types, as well as explores
model performance on different instance temporal lengths. By
weighting the error contributions to mAP damage by individu-
ally fixing oracles, we can understand how these error sources
relate to the overall metric, which is crucial for algorithm de-
velopment and model selection in deployment. The variation
of spatial segmentation and temporal association error weights
can laterally reflect the model ability change. Evaluating per-
formance over instance temporal length can help the commu-
nity to evaluate models for real scenarios. Figure.1shows the
comparison between TIVE and other error analyzing toolboxes.
Providing comprehensive analysis of several typical algo-
rithms, clear discrepancies between methods are revealed by
error weights, we find that short video instances that live
less than 16 frames are harder to recognize for all methods.
Only one of the investigated algorithms can enable spatial
segmentation and temporal association to benefit from each
other, while others generally meet at most one aspect, this
phenomenon may demand further exploration by the com-
munity. Due to the modulated functional design, we can
easily extend TIVE to other video object recognition tasks,
e.g., MOT[12,22], VOS[13,14,23,24,25,26,27,28,29,
30,31,32,33] and VIP[15] task, whose metric calculation
have strong similarity with video instance segmentation, and
the principle of TIVE is also referential to identify errors in
Video Semantic Segmentation(VSS)[34,35] and Video Panop-
tic Segmentation(VPS)[36] task.
2. Related Work
2.1. Video instance segmentation
As an extension of image-level instance segmentation
task[37,38,39,40,41,42,43,44,45], current video instance
segmentation methods can be roughly divided into online and
offline paradigms, which derive from MOT, VOS and newly
raised vision transformer techniques.
Online methods select one frame as reference and one or
several other frames as query, where ground truth labels and
masks of query frames are considered as learning targets[1,46,
47,48,49]. At inference stage, they first perform frame-level
instance segmentation by object detector[50,51,52,53,54] or
instance segmentor[37,45,55], then conduct temporal associ-
ation with tracking modules[56,57,58], which is usually con-
ducted under manual-designed rules and representation com-
parison. Except for the pioneer Mask Track R-CNN[1], later
works tend to leverage more frame-level predictions to refine
results of each query frame[49], e.g., classification scores and
predicted masks, which provide rich temporal references.
Offline methods take several randomly sampled frames from
a video clip as input both in training and inference progress
and directly predict mask sequences, labels and masks from
all sampled frames are supervision signals. Maskprop[59]
and Proposereduce[60] combine mask propagation technique
from VOS tasks with frame-level instance segmentation mod-
els to segment video instances in spatial-temporal dimensions.
Specifically, they use Mask R-CNN[37] to get frame-level in-
stance categories and masks, then propagate them to the entire
video clip. Compared to the propagation-based methods that
have a complicated processing pipeline to generate sequence re-
sults for multiple video instances, the transformer-based meth-
ods dominate the state-of-the-art performance[61,62,63,64,
65] recently. Thanks to the strong ability to capture global
context, this type of models directly learn to segment mask se-
quences during training and produce sequence-level predictions
in only one-time inference.
2.2. Error analyzing tools
Although previous literature provides qualitative proofs to
demonstrate their model superiorities over others, but limited
visual comparisons are incomplete and nonobjective. There ex-
ists toolboxes identifying relative vision recognition errors in
frame and video level may provide useful guidance.
Image-level toolboxes. UAP[16] tried to explain the effects
of object detection errors based on cocoapi, subjective error
types and fixing oracles are defined to explore the metric upper
bounds. But with progressive weighting scheme, it fails to iso-
late contributions of errors. TIDE[17] is the most recent image-
level object recognition error analyzing toolbox, which clearly
defines isolated errors and weighting contribution of each by
individually fixing oracles, providing meaningful observations
and suggestions to mainstream methods and algorithm design.
Video-level toolboxes. Few related works search for identi-
fying video recognition errors, they more focus on 1) exploring
challenging factors for object tracking based on self-established
dataset, whose instances distribute in no more than one chal-
lenge factor[19], thus keeping models from handling compli-
cated data distribution; 2) diagnosing detection errors for hu-
man actions and video relations, rather than focusing on video
objects. Chen et.al. [20] analyzed the subjective factors of
annotations and gave some conclusions about effects of them,
2