ConfMix Unsupervised Domain Adaptation for Object Detection via Conﬁdence-based Mixing Giulio Mattolin1 Luca Zanella2 Elisa Ricci12 Yiming Wang2

2025-05-02 0 0 5.47MB 12 页 10玖币

侵权投诉

ConfMix: Unsupervised Domain Adaptation for Object Detection

via Conﬁdence-based Mixing

Giulio Mattolin1, Luca Zanella2, Elisa Ricci1,2, Yiming Wang2

1University of Trento, Trento, Italy 2Fondazione Bruno Kessler, Trento, Italy

lzanella@fbk.eu

Abstract

Unsupervised Domain Adaptation (UDA) for object de-

tection aims to adapt a model trained on a source domain

to detect instances from a new target domain for which an-

notations are not available. Different from traditional ap-

proaches, we propose ConfMix, the ﬁrst method that intro-

duces a sample mixing strategy based on region-level detec-

tion conﬁdence for adaptive object detector learning. We

mix the local region of the target sample that corresponds

to the most conﬁdent pseudo detections with a source im-

age, and apply an additional consistency loss term to grad-

ually adapt towards the target data distribution. In order to

robustly deﬁne a conﬁdence score for a region, we exploit

the conﬁdence score per pseudo detection that accounts for

both the detector-dependent conﬁdence and the bounding

box uncertainty. Moreover, we propose a novel pseudo la-

belling scheme that progressively ﬁlters the pseudo target

detections using the conﬁdence metric that varies from a

loose to strict manner along the training. We perform ex-

tensive experiments with three datasets, achieving state-of-

the-art performance in two of them and approaching the

supervised target model performance in the other. Code is

available at https://github.com/giuliomattolin/ConfMix.

1. Introduction

Object detection is a fundamental task in computer vi-

sion which involves the classiﬁcation and localisation, e.g.

by bounding boxes, of objects of interest belonging to cer-

tain predeﬁned categories. Due to its importance in many

applications such as autonomous driving, video surveillance

and robotic perception, object detection has received signif-

icant attention, leading to the development of several differ-

ent models [15, 34, 41, 35]. However, as detectors mostly

This work has been supported by the European Union’s Horizon

2020 research and innovation programme under grant agreement No.

957337, and the European Commission Internal Security Fund for Po-

lice under grant agreement No. ISFP-2020-AG-PROTECT-101034216-

PROTECTOR.

Figure 1. ConfMix is based on a novel sample mixing strategy

which combines the source image and the target region (orange

box) with the highest pseudo detection conﬁdence.

rely on deep learning, it is a well known fact that they suffer

from severe performance degradation when being tested on

images that are visually different from the ones encountered

during training, due to the domain shift [4].

To address this problem, recent research efforts have

been put on devising Unsupervised Domain Adaptation

(UDA) techniques for building deep models that can adapt

from an annotated source dataset to a target one without

tedious manual annotations [40, 11, 1, 16, 9]. The vast

majority of UDA methods for detection resort on adversar-

ial training and on exploiting the Gradient Reversal Layer

(GRL) [11] to perform adaptation both at image-level and

instance-level [4, 37, 55, 39, 43]. Other approaches mostly

focus on robustly producing pseudo detections in order to

effectively ﬁnetune the model on the target data [49, 52, 44].

In general, while over the last few years several solutions

have been proposed in the literature for adapting two-stage

object detectors, we argue that devising UDA approaches

which can also be applied to one-stage detectors would be

desirable. Indeed, the latter methods are more appropriate

in applications such as autonomous driving that necessitate

of real-time processing and high computational efﬁciency.

arXiv:2210.11539v1 [cs.CV] 20 Oct 2022

Concurrently, recent works in computer vision have

shown the beneﬁt of adopting sophisticated data augmenta-

tion techniques by synthesising mixed samples with target

and source images in order to improve generalisation ability

of deep architectures [51, 50, 17]. These methods have been

considered in the context of UDA for classiﬁcation [45, 30]

and semantic segmentation [10, 32, 20, 5], demonstrating

some empirical advantage. However, extending these ap-

proaches to UDA for detection is far from trivial.

Inspired by these previous works, in this paper we pro-

pose ConfMix, the ﬁrst mixing-based UDA approach for

object detection based on the regional conﬁdence of pseudo

detections. The main idea behind ConfMix is illustrated in

Figure 1. Speciﬁcally, we propose to artiﬁcially generate

samples by combining the region of target images where

the model is most conﬁdent with source images. We also

introduce during training an associated consistency loss to

enforce coherent predictions among generated images. Our

intuition is that, by combining source and target images and

forming new mixed samples, we are training our model on

novel, synthetically generated sample images with reliable

pseudo detections and with visual appearance close to the

samples of target domain, thus improving the generalisa-

tion capabilities of the detector. Moreover, the quality of

pseudo detections plays an essential role during adaptation,

and is tightly related to the conﬁdence metric. By exploit-

ing a stricter conﬁdence metric, e.g. enriching the detector-

dependent conﬁdence with bounding box uncertainty [6],

one can obtain more reliable pseudo detections, however

with a reduced number. To mitigate this, we propose to

progressively restrict the conﬁdence metric for pseudo la-

belling. With a less strict conﬁdence metric at the initial

adaptation phase, we allow more pseudo detections in or-

der to learn the representation of the target domain, while

with a gradually stricter conﬁdence metric, we aim to im-

prove the detection accuracy with more trustworthy pseudo

detections. We conduct extensive experiments on different

datasets (Cityscapes [7] →FoggyCityscapes [38], Sim10K

[19] →Cityscapes and KITTI [13] →Cityscapes) and we

show that our approach outperforms existing algorithms in

most setups.

We summarise our main contributions as below:

• We introduce the ﬁrst sample-mixing UDA method

for object detection. Our approach, named ConfMix,

mixes samples from source and target domains based

on the regional conﬁdence of target pseudo detections.

• We propose a novel Progressive Pseudo-labelling

scheme by gradually restricting the conﬁdence metric

along the adaptive learning, which allows for a smooth

transition when learning target representation, thus im-

proving detection accuracy.

•ConfMix scores the new state-of-the-art adapta-

tion performance, achieving +1.7% on Sim10k →

Cityscapes, and +3.7% on KITTI →Cityscapes in

terms of mean Average Precision (mAP).

2. Related work

Object Detection. Current object detection models can be

grouped into two main categories: one-stage and two-stage

approaches. One-stage object detectors, such as YOLO [34]

and FCOS [41], adopt a uniﬁed framework to obtain ﬁnal

results directly from the feature maps generated by a CNN

backbone. These frameworks are very computationally ef-

ﬁcient and are able to achieve near real-time speed during

inference. On the other hand, two-stage object detectors,

such as RCNN [15], generate predictions by ﬁrst extract-

ing region proposals and then, leveraging this information,

produce classiﬁcation labels and bounding box coordinates.

Such models are widely adopted for their high performance

but, although research has been conducted to improve de-

tection speed [14, 35, 8], they are considerably slower com-

pared to one-stage detectors.

Unsupervised Domain Adaptation. Given a labelled

source domain and an unlabelled target domain, UDA aims

to use the available data to produce a model that is able to

generalise and perform well on the target domain. A con-

ventional approach is to reduce the domain gap by directly

minimising the distance between feature distributions us-

ing discrepancy loss functions [29, 40]. On the other hand,

adversarial-based methods [11, 12, 42], employ a domain

discriminator and a feature extractor that learns to produce

domain-invariant feature representations by fooling the dis-

criminator. Many works demonstrated the beneﬁt of us-

ing pseudo labels to maximally leverage information from

the target domain [28, 23, 24], eventually considering a

gradual scheme for incorporating them [48]. Other works

have focused on adopting sample mixing techniques, such

as mixup [51] or CutMix [50], to improve generalisation.

For instance, in [45, 47] domain-level mixup regularisation

is applied to ensure domain invariance in the learned fea-

ture representations, while in [3, 33] the model’s attention

is used to re-assign the conﬁdence of saliency-guided sam-

ples and labels. Similar ideas are implemented in previous

works considering the segmentation task [10, 32, 20, 5, 31].

However, to the best of our knowledge, no previous works

have been proposed to exploit mixing techniques for UDA

in the context of object detection.

UDA for Object Detection. In the context of object detec-

tion, UDA was recently introduced by [4], which proposed

image- and instance-level alignment using two GRLs [11]

on Faster R-CNN. Subsequently, several methods started to

address this problem mainly using two-stage detectors. Fo-

cusing on image-level, [37] showed that strong-local align-

ment and weak-global alignment of the features extracted

from the backbone improve adaptation, while [55], focus-

ing on instance-level, exploited RPN proposals to perform

region-level alignment. To adapt the source-biased decision

boundary to the target data, [2] combined adversarial train-

ing with image-to-image translation by generating interpo-

lated samples using Cycle-GAN [54]. Other recent works

have proposed applying self-training with pseudo detec-

tions to perform the adaptation. To address the risk of per-

formance degradation caused by overﬁtting noisy pseudo

detections, [49] introduces an uncertainty-based fusion of

pseudo detections sets generated via stochastic inference,

[27] proposes self-entropy descent (SED) as a metric to

search for an appropriate conﬁdence threshold for reliable

pseudo detections, while [44] uses a student-teacher frame-

work and gradually updates the source-trained model.

Few works have addressed UDA for one-stage detectors,

e.g. FCOS [25, 26, 18] or SSD [21]. In particular, adopt-

ing a self-training procedure reduces the negative effects of

inaccurate pseudo detections by performing hard negative

pseudo detections mining followed by a weak negative min-

ing strategy, where instance-level scores are computed for

each detection considering all neighbouring boxes [21]. In

addition, adversarial learning is employed using GRL [11]

and a discriminator with the aim of extracting discrimi-

native background features and reducing the domain shift.

However, our approach is radically different, as it does not

require additional architectural components to the network,

but proposes a mixing-based data augmentation strategy to

promote regularisation of the model.

3. Method

The proposed ConfMix, as illustrated in Figure 2, syn-

thesises an image xM∈RW×H×Cby mixing a source im-

age xS∈RW×H×Cand the local region of a target im-

age xT∈RW×H×Cwith the most reliable pseudo de-

tections. We ﬁrst predict a set of NTpseudo detections

yT=˜yi

T|i∈[1, NT]on the target image and compute

the conﬁdence per pseudo detection using the detector net-

work F(Θ) that is parameterised with Θand is originally

trained only on the source data. We opt to follow a Gaus-

sian modelling of the bounding box predictions, instead of

the deterministic one, in order to improve the reliability of

the detector conﬁdence with the uncertainty of the bounding

box prediction. Next, we divide the target image xTinto

regions of equal size and select the region with the high-

est average conﬁdence of pseudo detections to mix with the

source sample xS, forming the mixed sample xM.

We pass xT,xS, and xMto the detector F(Θ) and obtain

their corresponding detections ˜

yT,˜

ySand ˜

yM, respectively.

The detector then learns to adapt to the target domain by

imposing a consistency loss Lcons which promotes the sim-

ilarity between ˜

yMand the combined detections ˜

yS,T by

merging the source ˜

ySand target ˜

yTdetections according

to how the two sample images are mixed. The supervision

of source ground-truth detections ySis achieved with the

detector-related loss Ldet in order to maintain the detector

capability during adaptation.

In the following sections, we describe our proposed

ConfMix in details, where we ﬁrst introduce the estimation

of the Gaussian-based detection conﬁdence in Sec. 3.1, fol-

lowed by the conﬁdence-based region mixing strategy for

synthesising training samples in Sec. 3.2 and the progres-

sive pseudo labelling in Sec. 3.3. Finally, we present the

training objectives with losses in Sec. 3.4.

3.1. Gaussian-based detection conﬁdence

Conventional object detectors, such as YOLO [34],

Faster R-CNN [35] and FCOS [41], compute and assign

to each detection a conﬁdence score Cdet ∈[0,1] that

is often detector-dependent and is used to ﬁlter out unre-

liable predictions via non-maximum suppression. How-

ever, such conﬁdence score does not account for the reli-

ability of the predicted bounding box b= [bx, by, bh, bw],

where [bx, by]are the position of bounding box on the im-

age and bhand bwrepresent the height and width, respec-

tively. As suggested in [6], by taking into consideration

both the detector-dependent conﬁdence and the conﬁdence

that is derived from the uncertainty of bounding box predic-

tion, one can improve the reliability of pseudo detections

and reduce the number of false positives.

In order to compute the bounding box uncertainty, bre-

quires a Gaussian-based modelling. Speciﬁcally, for each

element in b, the detector model predicts both a mean µ

and a variance Σ, where the variance represents the localisa-

tion uncertainty. Thus, we can express the Gaussian-based

bounding box ˆ

bas:

b= [µbx, µby, µbh, µbw,Σbx,Σby,Σbh,Σbw],(1)

where both the means ˆ

bµ= [µbx, µby, µbh, µbw]and

the variances ˆ

bΣ= [ Σbx,Σby ,Σbh,Σbw ]are predicted by

the detector with an updated regression loss (see details in

Sec. 3.4). Note that a sigmoid function σ(·)is applied to

the predicted variance value to ensure its range is between

0 and 1.

As a larger variance value implies a higher uncertainty,

the conﬁdence of a bounding box is computed as:

Cbbx = 1 −mean(ˆ

bΣ),(2)

where mean(·)computes the average variance of ˆ

bΣ.

The combined conﬁdence can thus be computed as:

Ccomb =Cdet ·Cbbx.(3)

3.2. Conﬁdence-based Region Mixing

With the estimated conﬁdence for each pseudo detection

on the target image, we design a novel mixing strategy to

synthesise new training samples with highly reliable pseudo

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConfMix:UnsupervisedDomainAdaptationforObjectDetectionviaCondence-basedMixingGiulioMattolin1,LucaZanella2,ElisaRicci1;2,YimingWang21UniversityofTrento,Trento,Italy2FondazioneBrunoKessler,Trento,Italylzanella@fbk.euAbstractUnsupervisedDomainAdaptation(UDA)forobjectde-tectionaimstoadaptamodeltrainedo...

展开>> 收起<<

ConfMix Unsupervised Domain Adaptation for Object Detection via Conﬁdence-based Mixing Giulio Mattolin1 Luca Zanella2 Elisa Ricci12 Yiming Wang2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ConfMix Unsupervised Domain Adaptation for Object Detection via Conﬁdence-based Mixing Giulio Mattolin1 Luca Zanella2 Elisa Ricci12 Yiming Wang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: