1st Place Solutions for the UVO Challenge 2022 Jiajun Zhang1 Boyu Chen2 Zhilong Ji2 Jinfeng Bai2 Zonghai Hu1 1Beijing University of Posts and Telecommunications

2025-04-29 0 0 4.61MB 9 页 10玖币
侵权投诉
1st Place Solutions for the UVO Challenge 2022
Jiajun Zhang1, Boyu Chen2, Zhilong Ji2, Jinfeng Bai2, Zonghai Hu1
1Beijing University of Posts and Telecommunications
2Tomorrow Advancing Life
{jiajun.zhang, zhhu}@bupt.edu.cn, {chenboyu, jizhilong, baijinfeng1}@tal.com
Abstract. This paper describes the approach we have taken in the chal-
lenge. We still adopted the two-stage scheme same as the last cham-
pion, that is, detection first and segmentation followed. We trained more
powerful detector and segmentor separately. Besides, we also perform
pseudo-label training on the test set, based on student-teacher framework
and end-to-end transformer based object detection. The method ranks
first on the 2nd Unidentified Video Objects (UVO) challenge, achieving
AR@100 of 46.8, 64.7 and 32.2 in the limited data frame track, unlimited
data frame track and video track respectively.
1 Introduction
Common instance segmentation algorithms are trained on specific datasets to
learn a fixed number of categories for specific scenarios. UVO [20] discusses a
more realistic scenarios, Open-World instance segmentation. In this scenario, we
expect that the algorithm is capable of detecting or segmenting novel objects
and have the ability of incremental learning.
Compared with the end-to-end instance segmentation framework, two-stage
framework often achieves better performance. We have tried these two kinds
of methods for comparison in our initial experiments, and even the end-to-end
SOTA method Cascade Mask-RCNN [8] with backbone ViTDet [12] still has
a significant gap with the two-stage method. Therefore, we adopted the two-
stage scheme same as the last year’s winner. As mentioned in [7], stwo-stage
architecture enable us to train the detection network and segmentation network
separately on different datasets and use more complex models. Obviously, there
is a trade-off between accuracy and complexity.
To further improve the detection performance on test set and exhaustively
detecting the unseen object/class. We define this problem as Semi-Supervised
Object Detection (SSOD). Different from previous work, Soft-teacher [22] pro-
posed an end-to-end training framework for semi-supervised object detection,
student-teacher framework simultaneously improves the detector and pseudo la-
bels by leveraging a student model for detection training, and a teacher model
which is continuously updated by the student model through the exponential
moving average strategy for online pseudo-labeling.
We will detail the detection, segmentation model and semi-supervised object
detection method we used in the next chapter.
Contribute equally
arXiv:2210.09629v1 [cs.CV] 18 Oct 2022
2 Method
2.1 Detection
DETR [1] proposed a Transformer-based end-to-end object detector without us-
ing hand-designed components like anchor design and NMS, and achieves com-
parable performance with Faster-RCNN [16]. Many following works continue to
improve the DETR-like model, and finally make DETR-like model the current
new SOTA for object detection. For example, Deformable DETR [25] predicts
2D anchor points and designs a deformable attention module that only attends
to certain sampling points around a reference point; DAB-DETR [14] further
extends 2D anchor points to 4D anchor box coordinates to represent queries
and dynamically update boxes in each decoder layer; DN-DETR [11] introduces
a denoising training method to speed up DETR training. It feeds noise-added
ground-truth labels and boxes into the decoder and trains the model to recon-
struct the original ones.
DINO [23] collects the above improvements, and introduce a contrastive way
for denoising training, a mixed query selection method for anchor initialization,
and a look forward twice scheme for box prediction. DINO achieves the best
result of 63.2AP on COCO val2017, which is the current SOTA and show us its
powerful performance. Hence, We adopt DINO as our detector with bakcbone
SwinL [15].
2.2 Segmentation
ViT [6] also performed well in the segmentation track, for example, SETR [24],
Segmenter [18]. Masked-attention Mask Transformer (Mask2Former) [3], is a
new architecture capable of addressing any image segmentation task, and out-
performs of SOTA specialized architectures on all considered tasks and datasets.
Its key components include masked attention, which extracts localized features
by constraining cross-attention within predicted mask regions. It also use multi-
scale high-resolution features which help the model to segment small objects.
ViT-Adapter [2] design a powerful adapater for plain ViT. It consists of:
1) a spatial prior module to capture spatial features from the input image; 2)
a spatial feature injector to inject spatial priors into the ViT; 3) a multi-scale
feature extractor to extract hierachical features fromt the ViT.
Therefore, We adopt Mask2Former [3] as our segmentor with backbone ViT-
Adapter-L [2].
2.3 Semi-Supervised Object Detection
Prevalent SSOD paradigm is to use a multi-stage self-training pipeline: 1) train
model on labeled data; 2) generate pseudo labels on unlabeled data; 3) retrain
model on both labeled and pseudo-labeled data; 4) repeat this process if needed.
Soft Teacher [22] proposed an end-to-end semi-supervised approach, in contrast
2
摘要:

1stPlaceSolutionsfortheUVOChallenge2022JiajunZhang1†,BoyuChen2†,ZhilongJi2∗,JinfengBai2,ZonghaiHu11BeijingUniversityofPostsandTelecommunications2TomorrowAdvancingLife{jiajun.zhang,zhhu}@bupt.edu.cn,{chenboyu,jizhilong,baijinfeng1}@tal.comAbstract.Thispaperdescribestheapproachwehavetakeninthechal-len...

展开>> 收起<<
1st Place Solutions for the UVO Challenge 2022 Jiajun Zhang1 Boyu Chen2 Zhilong Ji2 Jinfeng Bai2 Zonghai Hu1 1Beijing University of Posts and Telecommunications.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:4.61MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注