TIVE A Toolbox for Identifying Video Instance Segmentation Errors

2025-05-06 0 0 4.2MB 10 页 10玖币
侵权投诉
TIVE: A Toolbox for Identifying Video Instance Segmentation Errors
Wenhe Jia, Lu Yang, Zilong Jia, Wenyi Zhao, Yilin Zhou, Qing Song
aBeijing University of Posts and Telecommunications, Artificial Intelligence Academy, 10th xitucheng road, Haidian District, Beijing, 100086, Beijing, China
Abstract
Since first proposed, Video Instance Segmentation(VIS) task has attracted vast researchers’ focus on architecture modeling to boost
performance. Though great advances achieved in online and oine paradigms, there are still insucient means to identify model
errors and distinguish discrepancies between methods, as well approaches that correctly reflect models’ performance in recogniz-
ing object instances of various temporal lengths remain barely available. More importantly, as the fundamental model abilities
demanded by the task, spatial segmentation and temporal association are still understudied in both evaluation and interaction mech-
anisms.
In this paper, we introduce TIVE, a Toolbox for Identifying Video instance segmentation Errors. By directly operating output
prediction files, TIVE defines isolated error types and weights each type’s damage to mAP, for the purpose of distinguishing model
characters. By decomposing localization quality in spatial-temporal dimensions, model’s potential drawbacks on spatial segmenta-
tion and temporal association can be revealed. TIVE can also report mAP over instance temporal length for real applications. We
conduct extensive experiments by the toolbox to further illustrate how spatial segmentation and temporal association aect each
other. We expect the analysis of TIVE can give the researchers more insights, guiding the community to promote more meaningful
explorations for video instance segmentation. The proposed toolbox is available at https://github.com/wenhe-jia/TIVE.
Keywords: Video instance segmentation, Error analyzing toolbox, Fine-grained metrics
1. Intruduction
As the indispensable technique in numerous real applica-
tions, e.g., video surveillance and editing, autonomous driving,
Video Instance Segmentation (VIS)[1,2,3,4,5] is emerging
among various vision tasks[6,7,8,9,10,11] in recent years.
Compared to image-level instance segmentation, video instance
segmentors are required to assign unique identities to video in-
stances. Demand for both spatial segmentation and temporal
association leaves VIS at the intersection of mask-level object
recognition and sequence modeling, making it one of the most
fundamental role among video object recognition tasks(e.g.,
Multi-Object Tracking (MOT)[12], Video Object Segmentation
(VOS)[13,14] and Video Instance Parsing (VIP)[15]). How-
ever, discussion about error sources, model abilities, and inter-
acting mechanism on the above-mentioned aspects is unavail-
able, as well proper evaluating scheme for recognizing video
instances with dierent attributes.
Though continuous works are proposed by leaps and bounds,
the following problems still puzzle the community: firstly, we
don’t know how false positive and negative predictions relate
to the overall metric. When optimizing mAP alone, we may
inevitably neglect the relative importance of dierent types of
Corresponding author
Email addresses: jiawh@bupt.edu.cn (Wenhe Jia),
soeaver@bupt.edu.cn (Lu Yang), jzl@bupt.edu.cn (Zilong Jia),
2013211876@bupt.edu.cn (Wenyi Zhao), ylzhou@bupt.edu.cn (Yilin
Zhou), priv@bupt.edu.cn (Qing Song)
TIVE
TIDE
UAP
Figure 1: Comparison between TIVE and other error
analyzing toolboxes. UAP[16] weight error contribution in a
progressively fixing mechanism, while TIDE[17] can give
objective and isolated error analysis. But both of them cannot
distinguish spatial and temporal misslocalized false positives.
errors that vary among applications, leaving discrepancies be-
tween algorithms unclear. For example, temporal association is
crucial to recognize objects that disappear or are occluded tem-
porarily in video surveillance, and accurate spatial segmenta-
tion is required by autonomous driving systems to precisely for-
mulate obstacle avoidance operations; secondly, no appropriate
scheme to evaluate model performance over instance temporal
length. For attribute analysis, performances in dierent tem-
poral lengths are notably important, but the ocial evaluation
Preprint submitted to Neurocomputing October 18, 2022
arXiv:2210.08856v1 [cs.CV] 17 Oct 2022
scheme is not based on instance temporal length, but on the
length of videos; lastly, the relation between spatial segmenta-
tion and temporal association is not investigated. As the most
required abilities, spatial segmentation and temporal associa-
tion are expected to promote mutually, thus can leave a positive
impact on another when promoting one. Unfortunately, there is
few work to discuss the relation between them.
There are some existing error analyzing toolboxes that may
partially solve the above-mentioned problems. Several image-
level error analyzing toolboxes[18,16,17] try to diagnose er-
rors and observe how much they contribute to the performance
decline, but they fail to distinguish errors distributed in spatial
and temporal dimensions. Some video-level tools[19,20,21]
focus on diagnosing video action and relation errors, not video
objects. They pay attention to the subjective impacts brought by
annotators and task-specific object attributes(e.g., context size
for action, relation distribution, etc.), which are not applicable
for VIS, not to mention demonstrating the relation of spatial
segmentation and temporal association.
Thus we introduce TIVE, a novel Toolbox for Identify var-
ious Video instance segmentation Errors. Decomposing gen-
eral localization error in spatial and temporal dimensions, TIVE
clearly subdivides 7 isolated error types, as well as explores
model performance on dierent instance temporal lengths. By
weighting the error contributions to mAP damage by individu-
ally fixing oracles, we can understand how these error sources
relate to the overall metric, which is crucial for algorithm de-
velopment and model selection in deployment. The variation
of spatial segmentation and temporal association error weights
can laterally reflect the model ability change. Evaluating per-
formance over instance temporal length can help the commu-
nity to evaluate models for real scenarios. Figure.1shows the
comparison between TIVE and other error analyzing toolboxes.
Providing comprehensive analysis of several typical algo-
rithms, clear discrepancies between methods are revealed by
error weights, we find that short video instances that live
less than 16 frames are harder to recognize for all methods.
Only one of the investigated algorithms can enable spatial
segmentation and temporal association to benefit from each
other, while others generally meet at most one aspect, this
phenomenon may demand further exploration by the com-
munity. Due to the modulated functional design, we can
easily extend TIVE to other video object recognition tasks,
e.g., MOT[12,22], VOS[13,14,23,24,25,26,27,28,29,
30,31,32,33] and VIP[15] task, whose metric calculation
have strong similarity with video instance segmentation, and
the principle of TIVE is also referential to identify errors in
Video Semantic Segmentation(VSS)[34,35] and Video Panop-
tic Segmentation(VPS)[36] task.
2. Related Work
2.1. Video instance segmentation
As an extension of image-level instance segmentation
task[37,38,39,40,41,42,43,44,45], current video instance
segmentation methods can be roughly divided into online and
oine paradigms, which derive from MOT, VOS and newly
raised vision transformer techniques.
Online methods select one frame as reference and one or
several other frames as query, where ground truth labels and
masks of query frames are considered as learning targets[1,46,
47,48,49]. At inference stage, they first perform frame-level
instance segmentation by object detector[50,51,52,53,54] or
instance segmentor[37,45,55], then conduct temporal associ-
ation with tracking modules[56,57,58], which is usually con-
ducted under manual-designed rules and representation com-
parison. Except for the pioneer Mask Track R-CNN[1], later
works tend to leverage more frame-level predictions to refine
results of each query frame[49], e.g., classification scores and
predicted masks, which provide rich temporal references.
Oine methods take several randomly sampled frames from
a video clip as input both in training and inference progress
and directly predict mask sequences, labels and masks from
all sampled frames are supervision signals. Maskprop[59]
and Proposereduce[60] combine mask propagation technique
from VOS tasks with frame-level instance segmentation mod-
els to segment video instances in spatial-temporal dimensions.
Specifically, they use Mask R-CNN[37] to get frame-level in-
stance categories and masks, then propagate them to the entire
video clip. Compared to the propagation-based methods that
have a complicated processing pipeline to generate sequence re-
sults for multiple video instances, the transformer-based meth-
ods dominate the state-of-the-art performance[61,62,63,64,
65] recently. Thanks to the strong ability to capture global
context, this type of models directly learn to segment mask se-
quences during training and produce sequence-level predictions
in only one-time inference.
2.2. Error analyzing tools
Although previous literature provides qualitative proofs to
demonstrate their model superiorities over others, but limited
visual comparisons are incomplete and nonobjective. There ex-
ists toolboxes identifying relative vision recognition errors in
frame and video level may provide useful guidance.
Image-level toolboxes. UAP[16] tried to explain the eects
of object detection errors based on cocoapi, subjective error
types and fixing oracles are defined to explore the metric upper
bounds. But with progressive weighting scheme, it fails to iso-
late contributions of errors. TIDE[17] is the most recent image-
level object recognition error analyzing toolbox, which clearly
defines isolated errors and weighting contribution of each by
individually fixing oracles, providing meaningful observations
and suggestions to mainstream methods and algorithm design.
Video-level toolboxes. Few related works search for identi-
fying video recognition errors, they more focus on 1) exploring
challenging factors for object tracking based on self-established
dataset, whose instances distribute in no more than one chal-
lenge factor[19], thus keeping models from handling compli-
cated data distribution; 2) diagnosing detection errors for hu-
man actions and video relations, rather than focusing on video
objects. Chen et.al. [20] analyzed the subjective factors of
annotations and gave some conclusions about eects of them,
2
摘要:

TIVE:AToolboxforIdentifyingVideoInstanceSegmentationErrorsWenheJia,LuYang,ZilongJia,WenyiZhao,YilinZhou,QingSongaBeijingUniversityofPostsandTelecommunications,ArticialIntelligenceAcademy,10thxituchengroad,HaidianDistrict,Beijing,100086,Beijing,ChinaAbstractSincerstproposed,VideoInstanceSegmentati...

展开>> 收起<<
TIVE A Toolbox for Identifying Video Instance Segmentation Errors.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:4.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注