ConfMix Unsupervised Domain Adaptation for Object Detection via Confidence-based Mixing Giulio Mattolin1 Luca Zanella2 Elisa Ricci12 Yiming Wang2

2025-05-02 0 0 5.47MB 12 页 10玖币
侵权投诉
ConfMix: Unsupervised Domain Adaptation for Object Detection
via Confidence-based Mixing
Giulio Mattolin1, Luca Zanella2, Elisa Ricci1,2, Yiming Wang2
1University of Trento, Trento, Italy 2Fondazione Bruno Kessler, Trento, Italy
lzanella@fbk.eu
Abstract
Unsupervised Domain Adaptation (UDA) for object de-
tection aims to adapt a model trained on a source domain
to detect instances from a new target domain for which an-
notations are not available. Different from traditional ap-
proaches, we propose ConfMix, the first method that intro-
duces a sample mixing strategy based on region-level detec-
tion confidence for adaptive object detector learning. We
mix the local region of the target sample that corresponds
to the most confident pseudo detections with a source im-
age, and apply an additional consistency loss term to grad-
ually adapt towards the target data distribution. In order to
robustly define a confidence score for a region, we exploit
the confidence score per pseudo detection that accounts for
both the detector-dependent confidence and the bounding
box uncertainty. Moreover, we propose a novel pseudo la-
belling scheme that progressively filters the pseudo target
detections using the confidence metric that varies from a
loose to strict manner along the training. We perform ex-
tensive experiments with three datasets, achieving state-of-
the-art performance in two of them and approaching the
supervised target model performance in the other. Code is
available at https://github.com/giuliomattolin/ConfMix.
1. Introduction
Object detection is a fundamental task in computer vi-
sion which involves the classification and localisation, e.g.
by bounding boxes, of objects of interest belonging to cer-
tain predefined categories. Due to its importance in many
applications such as autonomous driving, video surveillance
and robotic perception, object detection has received signif-
icant attention, leading to the development of several differ-
ent models [15, 34, 41, 35]. However, as detectors mostly
This work has been supported by the European Union’s Horizon
2020 research and innovation programme under grant agreement No.
957337, and the European Commission Internal Security Fund for Po-
lice under grant agreement No. ISFP-2020-AG-PROTECT-101034216-
PROTECTOR.
Figure 1. ConfMix is based on a novel sample mixing strategy
which combines the source image and the target region (orange
box) with the highest pseudo detection confidence.
rely on deep learning, it is a well known fact that they suffer
from severe performance degradation when being tested on
images that are visually different from the ones encountered
during training, due to the domain shift [4].
To address this problem, recent research efforts have
been put on devising Unsupervised Domain Adaptation
(UDA) techniques for building deep models that can adapt
from an annotated source dataset to a target one without
tedious manual annotations [40, 11, 1, 16, 9]. The vast
majority of UDA methods for detection resort on adversar-
ial training and on exploiting the Gradient Reversal Layer
(GRL) [11] to perform adaptation both at image-level and
instance-level [4, 37, 55, 39, 43]. Other approaches mostly
focus on robustly producing pseudo detections in order to
effectively finetune the model on the target data [49, 52, 44].
In general, while over the last few years several solutions
have been proposed in the literature for adapting two-stage
object detectors, we argue that devising UDA approaches
which can also be applied to one-stage detectors would be
desirable. Indeed, the latter methods are more appropriate
in applications such as autonomous driving that necessitate
of real-time processing and high computational efficiency.
arXiv:2210.11539v1 [cs.CV] 20 Oct 2022
Concurrently, recent works in computer vision have
shown the benefit of adopting sophisticated data augmenta-
tion techniques by synthesising mixed samples with target
and source images in order to improve generalisation ability
of deep architectures [51, 50, 17]. These methods have been
considered in the context of UDA for classification [45, 30]
and semantic segmentation [10, 32, 20, 5], demonstrating
some empirical advantage. However, extending these ap-
proaches to UDA for detection is far from trivial.
Inspired by these previous works, in this paper we pro-
pose ConfMix, the first mixing-based UDA approach for
object detection based on the regional confidence of pseudo
detections. The main idea behind ConfMix is illustrated in
Figure 1. Specifically, we propose to artificially generate
samples by combining the region of target images where
the model is most confident with source images. We also
introduce during training an associated consistency loss to
enforce coherent predictions among generated images. Our
intuition is that, by combining source and target images and
forming new mixed samples, we are training our model on
novel, synthetically generated sample images with reliable
pseudo detections and with visual appearance close to the
samples of target domain, thus improving the generalisa-
tion capabilities of the detector. Moreover, the quality of
pseudo detections plays an essential role during adaptation,
and is tightly related to the confidence metric. By exploit-
ing a stricter confidence metric, e.g. enriching the detector-
dependent confidence with bounding box uncertainty [6],
one can obtain more reliable pseudo detections, however
with a reduced number. To mitigate this, we propose to
progressively restrict the confidence metric for pseudo la-
belling. With a less strict confidence metric at the initial
adaptation phase, we allow more pseudo detections in or-
der to learn the representation of the target domain, while
with a gradually stricter confidence metric, we aim to im-
prove the detection accuracy with more trustworthy pseudo
detections. We conduct extensive experiments on different
datasets (Cityscapes [7] FoggyCityscapes [38], Sim10K
[19] Cityscapes and KITTI [13] Cityscapes) and we
show that our approach outperforms existing algorithms in
most setups.
We summarise our main contributions as below:
We introduce the first sample-mixing UDA method
for object detection. Our approach, named ConfMix,
mixes samples from source and target domains based
on the regional confidence of target pseudo detections.
We propose a novel Progressive Pseudo-labelling
scheme by gradually restricting the confidence metric
along the adaptive learning, which allows for a smooth
transition when learning target representation, thus im-
proving detection accuracy.
ConfMix scores the new state-of-the-art adapta-
tion performance, achieving +1.7% on Sim10k
Cityscapes, and +3.7% on KITTI Cityscapes in
terms of mean Average Precision (mAP).
2. Related work
Object Detection. Current object detection models can be
grouped into two main categories: one-stage and two-stage
approaches. One-stage object detectors, such as YOLO [34]
and FCOS [41], adopt a unified framework to obtain final
results directly from the feature maps generated by a CNN
backbone. These frameworks are very computationally ef-
ficient and are able to achieve near real-time speed during
inference. On the other hand, two-stage object detectors,
such as RCNN [15], generate predictions by first extract-
ing region proposals and then, leveraging this information,
produce classification labels and bounding box coordinates.
Such models are widely adopted for their high performance
but, although research has been conducted to improve de-
tection speed [14, 35, 8], they are considerably slower com-
pared to one-stage detectors.
Unsupervised Domain Adaptation. Given a labelled
source domain and an unlabelled target domain, UDA aims
to use the available data to produce a model that is able to
generalise and perform well on the target domain. A con-
ventional approach is to reduce the domain gap by directly
minimising the distance between feature distributions us-
ing discrepancy loss functions [29, 40]. On the other hand,
adversarial-based methods [11, 12, 42], employ a domain
discriminator and a feature extractor that learns to produce
domain-invariant feature representations by fooling the dis-
criminator. Many works demonstrated the benefit of us-
ing pseudo labels to maximally leverage information from
the target domain [28, 23, 24], eventually considering a
gradual scheme for incorporating them [48]. Other works
have focused on adopting sample mixing techniques, such
as mixup [51] or CutMix [50], to improve generalisation.
For instance, in [45, 47] domain-level mixup regularisation
is applied to ensure domain invariance in the learned fea-
ture representations, while in [3, 33] the model’s attention
is used to re-assign the confidence of saliency-guided sam-
ples and labels. Similar ideas are implemented in previous
works considering the segmentation task [10, 32, 20, 5, 31].
However, to the best of our knowledge, no previous works
have been proposed to exploit mixing techniques for UDA
in the context of object detection.
UDA for Object Detection. In the context of object detec-
tion, UDA was recently introduced by [4], which proposed
image- and instance-level alignment using two GRLs [11]
on Faster R-CNN. Subsequently, several methods started to
address this problem mainly using two-stage detectors. Fo-
cusing on image-level, [37] showed that strong-local align-
ment and weak-global alignment of the features extracted
from the backbone improve adaptation, while [55], focus-
ing on instance-level, exploited RPN proposals to perform
region-level alignment. To adapt the source-biased decision
boundary to the target data, [2] combined adversarial train-
ing with image-to-image translation by generating interpo-
lated samples using Cycle-GAN [54]. Other recent works
have proposed applying self-training with pseudo detec-
tions to perform the adaptation. To address the risk of per-
formance degradation caused by overfitting noisy pseudo
detections, [49] introduces an uncertainty-based fusion of
pseudo detections sets generated via stochastic inference,
[27] proposes self-entropy descent (SED) as a metric to
search for an appropriate confidence threshold for reliable
pseudo detections, while [44] uses a student-teacher frame-
work and gradually updates the source-trained model.
Few works have addressed UDA for one-stage detectors,
e.g. FCOS [25, 26, 18] or SSD [21]. In particular, adopt-
ing a self-training procedure reduces the negative effects of
inaccurate pseudo detections by performing hard negative
pseudo detections mining followed by a weak negative min-
ing strategy, where instance-level scores are computed for
each detection considering all neighbouring boxes [21]. In
addition, adversarial learning is employed using GRL [11]
and a discriminator with the aim of extracting discrimi-
native background features and reducing the domain shift.
However, our approach is radically different, as it does not
require additional architectural components to the network,
but proposes a mixing-based data augmentation strategy to
promote regularisation of the model.
3. Method
The proposed ConfMix, as illustrated in Figure 2, syn-
thesises an image xMRW×H×Cby mixing a source im-
age xSRW×H×Cand the local region of a target im-
age xTRW×H×Cwith the most reliable pseudo de-
tections. We first predict a set of NTpseudo detections
˜
yT=˜yi
T|i[1, NT]on the target image and compute
the confidence per pseudo detection using the detector net-
work F(Θ) that is parameterised with Θand is originally
trained only on the source data. We opt to follow a Gaus-
sian modelling of the bounding box predictions, instead of
the deterministic one, in order to improve the reliability of
the detector confidence with the uncertainty of the bounding
box prediction. Next, we divide the target image xTinto
regions of equal size and select the region with the high-
est average confidence of pseudo detections to mix with the
source sample xS, forming the mixed sample xM.
We pass xT,xS, and xMto the detector F(Θ) and obtain
their corresponding detections ˜
yT,˜
ySand ˜
yM, respectively.
The detector then learns to adapt to the target domain by
imposing a consistency loss Lcons which promotes the sim-
ilarity between ˜
yMand the combined detections ˜
yS,T by
merging the source ˜
ySand target ˜
yTdetections according
to how the two sample images are mixed. The supervision
of source ground-truth detections ySis achieved with the
detector-related loss Ldet in order to maintain the detector
capability during adaptation.
In the following sections, we describe our proposed
ConfMix in details, where we first introduce the estimation
of the Gaussian-based detection confidence in Sec. 3.1, fol-
lowed by the confidence-based region mixing strategy for
synthesising training samples in Sec. 3.2 and the progres-
sive pseudo labelling in Sec. 3.3. Finally, we present the
training objectives with losses in Sec. 3.4.
3.1. Gaussian-based detection confidence
Conventional object detectors, such as YOLO [34],
Faster R-CNN [35] and FCOS [41], compute and assign
to each detection a confidence score Cdet [0,1] that
is often detector-dependent and is used to filter out unre-
liable predictions via non-maximum suppression. How-
ever, such confidence score does not account for the reli-
ability of the predicted bounding box b= [bx, by, bh, bw],
where [bx, by]are the position of bounding box on the im-
age and bhand bwrepresent the height and width, respec-
tively. As suggested in [6], by taking into consideration
both the detector-dependent confidence and the confidence
that is derived from the uncertainty of bounding box predic-
tion, one can improve the reliability of pseudo detections
and reduce the number of false positives.
In order to compute the bounding box uncertainty, bre-
quires a Gaussian-based modelling. Specifically, for each
element in b, the detector model predicts both a mean µ
and a variance Σ, where the variance represents the localisa-
tion uncertainty. Thus, we can express the Gaussian-based
bounding box ˆ
bas:
ˆ
b= [µbx, µby, µbh, µbw,Σbx,Σby,Σbh,Σbw],(1)
where both the means ˆ
bµ= [µbx, µby, µbh, µbw]and
the variances ˆ
bΣ= [ Σbx,Σby ,Σbh,Σbw ]are predicted by
the detector with an updated regression loss (see details in
Sec. 3.4). Note that a sigmoid function σ(·)is applied to
the predicted variance value to ensure its range is between
0 and 1.
As a larger variance value implies a higher uncertainty,
the confidence of a bounding box is computed as:
Cbbx = 1 mean(ˆ
bΣ),(2)
where mean(·)computes the average variance of ˆ
bΣ.
The combined confidence can thus be computed as:
Ccomb =Cdet ·Cbbx.(3)
3.2. Confidence-based Region Mixing
With the estimated confidence for each pseudo detection
on the target image, we design a novel mixing strategy to
synthesise new training samples with highly reliable pseudo
摘要:

ConfMix:UnsupervisedDomainAdaptationforObjectDetectionviaCondence-basedMixingGiulioMattolin1,LucaZanella2,ElisaRicci1;2,YimingWang21UniversityofTrento,Trento,Italy2FondazioneBrunoKessler,Trento,Italylzanella@fbk.euAbstractUnsupervisedDomainAdaptation(UDA)forobjectde-tectionaimstoadaptamodeltrainedo...

展开>> 收起<<
ConfMix Unsupervised Domain Adaptation for Object Detection via Confidence-based Mixing Giulio Mattolin1 Luca Zanella2 Elisa Ricci12 Yiming Wang2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.47MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注