1st Place Solutions for the UVO Challenge 2022 Jiajun Zhang1 Boyu Chen2 Zhilong Ji2 Jinfeng Bai2 Zonghai Hu1 1Beijing University of Posts and Telecommunications

2025-04-29 0 0 4.61MB 9 页 10玖币

侵权投诉

1st Place Solutions for the UVO Challenge 2022

Jiajun Zhang1†, Boyu Chen2†, Zhilong Ji2∗, Jinfeng Bai2, Zonghai Hu1

1Beijing University of Posts and Telecommunications

2Tomorrow Advancing Life

{jiajun.zhang, zhhu}@bupt.edu.cn, {chenboyu, jizhilong, baijinfeng1}@tal.com

Abstract. This paper describes the approach we have taken in the chal-

lenge. We still adopted the two-stage scheme same as the last cham-

pion, that is, detection ﬁrst and segmentation followed. We trained more

powerful detector and segmentor separately. Besides, we also perform

pseudo-label training on the test set, based on student-teacher framework

and end-to-end transformer based object detection. The method ranks

ﬁrst on the 2nd Unidentiﬁed Video Objects (UVO) challenge, achieving

AR@100 of 46.8, 64.7 and 32.2 in the limited data frame track, unlimited

data frame track and video track respectively.

1 Introduction

Common instance segmentation algorithms are trained on speciﬁc datasets to

learn a ﬁxed number of categories for speciﬁc scenarios. UVO [20] discusses a

more realistic scenarios, Open-World instance segmentation. In this scenario, we

expect that the algorithm is capable of detecting or segmenting novel objects

and have the ability of incremental learning.

Compared with the end-to-end instance segmentation framework, two-stage

framework often achieves better performance. We have tried these two kinds

of methods for comparison in our initial experiments, and even the end-to-end

SOTA method Cascade Mask-RCNN [8] with backbone ViTDet [12] still has

a signiﬁcant gap with the two-stage method. Therefore, we adopted the two-

stage scheme same as the last year’s winner. As mentioned in [7], stwo-stage

architecture enable us to train the detection network and segmentation network

separately on diﬀerent datasets and use more complex models. Obviously, there

is a trade-oﬀ between accuracy and complexity.

To further improve the detection performance on test set and exhaustively

detecting the unseen object/class. We deﬁne this problem as Semi-Supervised

Object Detection (SSOD). Diﬀerent from previous work, Soft-teacher [22] pro-

posed an end-to-end training framework for semi-supervised object detection,

student-teacher framework simultaneously improves the detector and pseudo la-

bels by leveraging a student model for detection training, and a teacher model

which is continuously updated by the student model through the exponential

moving average strategy for online pseudo-labeling.

We will detail the detection, segmentation model and semi-supervised object

detection method we used in the next chapter.

†Contribute equally

arXiv:2210.09629v1 [cs.CV] 18 Oct 2022

2 Method

2.1 Detection

DETR [1] proposed a Transformer-based end-to-end object detector without us-

ing hand-designed components like anchor design and NMS, and achieves com-

parable performance with Faster-RCNN [16]. Many following works continue to

improve the DETR-like model, and ﬁnally make DETR-like model the current

new SOTA for object detection. For example, Deformable DETR [25] predicts

2D anchor points and designs a deformable attention module that only attends

to certain sampling points around a reference point; DAB-DETR [14] further

extends 2D anchor points to 4D anchor box coordinates to represent queries

and dynamically update boxes in each decoder layer; DN-DETR [11] introduces

a denoising training method to speed up DETR training. It feeds noise-added

ground-truth labels and boxes into the decoder and trains the model to recon-

struct the original ones.

DINO [23] collects the above improvements, and introduce a contrastive way

for denoising training, a mixed query selection method for anchor initialization,

and a look forward twice scheme for box prediction. DINO achieves the best

result of 63.2AP on COCO val2017, which is the current SOTA and show us its

powerful performance. Hence, We adopt DINO as our detector with bakcbone

SwinL [15].

2.2 Segmentation

ViT [6] also performed well in the segmentation track, for example, SETR [24],

Segmenter [18]. Masked-attention Mask Transformer (Mask2Former) [3], is a

new architecture capable of addressing any image segmentation task, and out-

performs of SOTA specialized architectures on all considered tasks and datasets.

Its key components include masked attention, which extracts localized features

by constraining cross-attention within predicted mask regions. It also use multi-

scale high-resolution features which help the model to segment small objects.

ViT-Adapter [2] design a powerful adapater for plain ViT. It consists of:

1) a spatial prior module to capture spatial features from the input image; 2)

a spatial feature injector to inject spatial priors into the ViT; 3) a multi-scale

feature extractor to extract hierachical features fromt the ViT.

Therefore, We adopt Mask2Former [3] as our segmentor with backbone ViT-

Adapter-L [2].

2.3 Semi-Supervised Object Detection

Prevalent SSOD paradigm is to use a multi-stage self-training pipeline: 1) train

model on labeled data; 2) generate pseudo labels on unlabeled data; 3) retrain

model on both labeled and pseudo-labeled data; 4) repeat this process if needed.

Soft Teacher [22] proposed an end-to-end semi-supervised approach, in contrast

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1stPlaceSolutionsfortheUVOChallenge2022JiajunZhang1†,BoyuChen2†,ZhilongJi2∗,JinfengBai2,ZonghaiHu11BeijingUniversityofPostsandTelecommunications2TomorrowAdvancingLife{jiajun.zhang,zhhu}@bupt.edu.cn,{chenboyu,jizhilong,baijinfeng1}@tal.comAbstract.Thispaperdescribestheapproachwehavetakeninthechal-len...

展开>> 收起<<

1st Place Solutions for the UVO Challenge 2022 Jiajun Zhang1 Boyu Chen2 Zhilong Ji2 Jinfeng Bai2 Zonghai Hu1 1Beijing University of Posts and Telecommunications.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1st Place Solutions for the UVO Challenge 2022 Jiajun Zhang1 Boyu Chen2 Zhilong Ji2 Jinfeng Bai2 Zonghai Hu1 1Beijing University of Posts and Telecommunications

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: