TIVE A Toolbox for Identifying Video Instance Segmentation Errors

2025-05-06 0 0 4.2MB 10 页 10玖币

TIVE: A Toolbox for Identifying Video Instance Segmentation Errors

Wenhe Jia, Lu Yang, Zilong Jia, Wenyi Zhao, Yilin Zhou, Qing Song∗

aBeijing University of Posts and Telecommunications, Artiﬁcial Intelligence Academy, 10th xitucheng road, Haidian District, Beijing, 100086, Beijing, China

Abstract

Since ﬁrst proposed, Video Instance Segmentation(VIS) task has attracted vast researchers’ focus on architecture modeling to boost

performance. Though great advances achieved in online and oﬄine paradigms, there are still insuﬃcient means to identify model

errors and distinguish discrepancies between methods, as well approaches that correctly reﬂect models’ performance in recogniz-

ing object instances of various temporal lengths remain barely available. More importantly, as the fundamental model abilities

demanded by the task, spatial segmentation and temporal association are still understudied in both evaluation and interaction mech-

anisms.

In this paper, we introduce TIVE, a Toolbox for Identifying Video instance segmentation Errors. By directly operating output

prediction ﬁles, TIVE deﬁnes isolated error types and weights each type’s damage to mAP, for the purpose of distinguishing model

characters. By decomposing localization quality in spatial-temporal dimensions, model’s potential drawbacks on spatial segmenta-

tion and temporal association can be revealed. TIVE can also report mAP over instance temporal length for real applications. We

conduct extensive experiments by the toolbox to further illustrate how spatial segmentation and temporal association aﬀect each

other. We expect the analysis of TIVE can give the researchers more insights, guiding the community to promote more meaningful

explorations for video instance segmentation. The proposed toolbox is available at https://github.com/wenhe-jia/TIVE.

Keywords: Video instance segmentation, Error analyzing toolbox, Fine-grained metrics

1. Intruduction

As the indispensable technique in numerous real applica-

tions, e.g., video surveillance and editing, autonomous driving,

Video Instance Segmentation (VIS)[1,2,3,4,5] is emerging

among various vision tasks[6,7,8,9,10,11] in recent years.

Compared to image-level instance segmentation, video instance

segmentors are required to assign unique identities to video in-

stances. Demand for both spatial segmentation and temporal

association leaves VIS at the intersection of mask-level object

recognition and sequence modeling, making it one of the most

fundamental role among video object recognition tasks(e.g.,

Multi-Object Tracking (MOT)[12], Video Object Segmentation

(VOS)[13,14] and Video Instance Parsing (VIP)[15]). How-

ever, discussion about error sources, model abilities, and inter-

acting mechanism on the above-mentioned aspects is unavail-

able, as well proper evaluating scheme for recognizing video

instances with diﬀerent attributes.

Though continuous works are proposed by leaps and bounds,

the following problems still puzzle the community: ﬁrstly, we

don’t know how false positive and negative predictions relate

to the overall metric. When optimizing mAP alone, we may

inevitably neglect the relative importance of diﬀerent types of

∗Corresponding author

Email addresses: jiawh@bupt.edu.cn (Wenhe Jia),

soeaver@bupt.edu.cn (Lu Yang), jzl@bupt.edu.cn (Zilong Jia),

2013211876@bupt.edu.cn (Wenyi Zhao), ylzhou@bupt.edu.cn (Yilin

Zhou), priv@bupt.edu.cn (Qing Song)

TIVE

TIDE

UAP

Figure 1: Comparison between TIVE and other error

analyzing toolboxes. UAP[16] weight error contribution in a

progressively ﬁxing mechanism, while TIDE[17] can give

objective and isolated error analysis. But both of them cannot

distinguish spatial and temporal misslocalized false positives.

errors that vary among applications, leaving discrepancies be-

tween algorithms unclear. For example, temporal association is

crucial to recognize objects that disappear or are occluded tem-

porarily in video surveillance, and accurate spatial segmenta-

tion is required by autonomous driving systems to precisely for-

mulate obstacle avoidance operations; secondly, no appropriate

scheme to evaluate model performance over instance temporal

length. For attribute analysis, performances in diﬀerent tem-

poral lengths are notably important, but the oﬃcial evaluation

Preprint submitted to Neurocomputing October 18, 2022

arXiv:2210.08856v1 [cs.CV] 17 Oct 2022

scheme is not based on instance temporal length, but on the

length of videos; lastly, the relation between spatial segmenta-

tion and temporal association is not investigated. As the most

required abilities, spatial segmentation and temporal associa-

tion are expected to promote mutually, thus can leave a positive

impact on another when promoting one. Unfortunately, there is

few work to discuss the relation between them.

There are some existing error analyzing toolboxes that may

partially solve the above-mentioned problems. Several image-

level error analyzing toolboxes[18,16,17] try to diagnose er-

rors and observe how much they contribute to the performance

decline, but they fail to distinguish errors distributed in spatial

and temporal dimensions. Some video-level tools[19,20,21]

focus on diagnosing video action and relation errors, not video

objects. They pay attention to the subjective impacts brought by

annotators and task-speciﬁc object attributes(e.g., context size

for action, relation distribution, etc.), which are not applicable

for VIS, not to mention demonstrating the relation of spatial

segmentation and temporal association.

Thus we introduce TIVE, a novel Toolbox for Identify var-

ious Video instance segmentation Errors. Decomposing gen-

eral localization error in spatial and temporal dimensions, TIVE

clearly subdivides 7 isolated error types, as well as explores

model performance on diﬀerent instance temporal lengths. By

weighting the error contributions to mAP damage by individu-

ally ﬁxing oracles, we can understand how these error sources

relate to the overall metric, which is crucial for algorithm de-

velopment and model selection in deployment. The variation

of spatial segmentation and temporal association error weights

can laterally reﬂect the model ability change. Evaluating per-

formance over instance temporal length can help the commu-

nity to evaluate models for real scenarios. Figure.1shows the

comparison between TIVE and other error analyzing toolboxes.

Providing comprehensive analysis of several typical algo-

rithms, clear discrepancies between methods are revealed by

error weights, we ﬁnd that short video instances that live

less than 16 frames are harder to recognize for all methods.

Only one of the investigated algorithms can enable spatial

segmentation and temporal association to beneﬁt from each

other, while others generally meet at most one aspect, this

phenomenon may demand further exploration by the com-

munity. Due to the modulated functional design, we can

easily extend TIVE to other video object recognition tasks,

e.g., MOT[12,22], VOS[13,14,23,24,25,26,27,28,29,

30,31,32,33] and VIP[15] task, whose metric calculation

have strong similarity with video instance segmentation, and

the principle of TIVE is also referential to identify errors in

Video Semantic Segmentation(VSS)[34,35] and Video Panop-

tic Segmentation(VPS)[36] task.

2. Related Work

2.1. Video instance segmentation

As an extension of image-level instance segmentation

task[37,38,39,40,41,42,43,44,45], current video instance

segmentation methods can be roughly divided into online and

oﬄine paradigms, which derive from MOT, VOS and newly

raised vision transformer techniques.

Online methods select one frame as reference and one or

several other frames as query, where ground truth labels and

masks of query frames are considered as learning targets[1,46,

47,48,49]. At inference stage, they ﬁrst perform frame-level

instance segmentation by object detector[50,51,52,53,54] or

instance segmentor[37,45,55], then conduct temporal associ-

ation with tracking modules[56,57,58], which is usually con-

ducted under manual-designed rules and representation com-

parison. Except for the pioneer Mask Track R-CNN[1], later

works tend to leverage more frame-level predictions to reﬁne

results of each query frame[49], e.g., classiﬁcation scores and

predicted masks, which provide rich temporal references.

Oﬄine methods take several randomly sampled frames from

a video clip as input both in training and inference progress

and directly predict mask sequences, labels and masks from

all sampled frames are supervision signals. Maskprop[59]

and Proposereduce[60] combine mask propagation technique

from VOS tasks with frame-level instance segmentation mod-

els to segment video instances in spatial-temporal dimensions.

Speciﬁcally, they use Mask R-CNN[37] to get frame-level in-

stance categories and masks, then propagate them to the entire

video clip. Compared to the propagation-based methods that

have a complicated processing pipeline to generate sequence re-

sults for multiple video instances, the transformer-based meth-

ods dominate the state-of-the-art performance[61,62,63,64,

65] recently. Thanks to the strong ability to capture global

context, this type of models directly learn to segment mask se-

quences during training and produce sequence-level predictions

in only one-time inference.

2.2. Error analyzing tools

Although previous literature provides qualitative proofs to

demonstrate their model superiorities over others, but limited

visual comparisons are incomplete and nonobjective. There ex-

ists toolboxes identifying relative vision recognition errors in

frame and video level may provide useful guidance.

Image-level toolboxes. UAP[16] tried to explain the eﬀects

of object detection errors based on cocoapi, subjective error

types and ﬁxing oracles are deﬁned to explore the metric upper

bounds. But with progressive weighting scheme, it fails to iso-

late contributions of errors. TIDE[17] is the most recent image-

level object recognition error analyzing toolbox, which clearly

deﬁnes isolated errors and weighting contribution of each by

individually ﬁxing oracles, providing meaningful observations

and suggestions to mainstream methods and algorithm design.

Video-level toolboxes. Few related works search for identi-

fying video recognition errors, they more focus on 1) exploring

challenging factors for object tracking based on self-established

dataset, whose instances distribute in no more than one chal-

lenge factor[19], thus keeping models from handling compli-

cated data distribution; 2) diagnosing detection errors for hu-

man actions and video relations, rather than focusing on video

objects. Chen et.al. [20] analyzed the subjective factors of

annotations and gave some conclusions about eﬀects of them,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TIVE:AToolboxforIdentifyingVideoInstanceSegmentationErrorsWenheJia,LuYang,ZilongJia,WenyiZhao,YilinZhou,QingSongaBeijingUniversityofPostsandTelecommunications,ArticialIntelligenceAcademy,10thxituchengroad,HaidianDistrict,Beijing,100086,Beijing,ChinaAbstractSincerstproposed,VideoInstanceSegmentati...

展开>> 收起<<

TIVE A Toolbox for Identifying Video Instance Segmentation Errors.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TIVE A Toolbox for Identifying Video Instance Segmentation Errors

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: