Concurrently, recent works in computer vision have
shown the benefit of adopting sophisticated data augmenta-
tion techniques by synthesising mixed samples with target
and source images in order to improve generalisation ability
of deep architectures [51, 50, 17]. These methods have been
considered in the context of UDA for classification [45, 30]
and semantic segmentation [10, 32, 20, 5], demonstrating
some empirical advantage. However, extending these ap-
proaches to UDA for detection is far from trivial.
Inspired by these previous works, in this paper we pro-
pose ConfMix, the first mixing-based UDA approach for
object detection based on the regional confidence of pseudo
detections. The main idea behind ConfMix is illustrated in
Figure 1. Specifically, we propose to artificially generate
samples by combining the region of target images where
the model is most confident with source images. We also
introduce during training an associated consistency loss to
enforce coherent predictions among generated images. Our
intuition is that, by combining source and target images and
forming new mixed samples, we are training our model on
novel, synthetically generated sample images with reliable
pseudo detections and with visual appearance close to the
samples of target domain, thus improving the generalisa-
tion capabilities of the detector. Moreover, the quality of
pseudo detections plays an essential role during adaptation,
and is tightly related to the confidence metric. By exploit-
ing a stricter confidence metric, e.g. enriching the detector-
dependent confidence with bounding box uncertainty [6],
one can obtain more reliable pseudo detections, however
with a reduced number. To mitigate this, we propose to
progressively restrict the confidence metric for pseudo la-
belling. With a less strict confidence metric at the initial
adaptation phase, we allow more pseudo detections in or-
der to learn the representation of the target domain, while
with a gradually stricter confidence metric, we aim to im-
prove the detection accuracy with more trustworthy pseudo
detections. We conduct extensive experiments on different
datasets (Cityscapes [7] →FoggyCityscapes [38], Sim10K
[19] →Cityscapes and KITTI [13] →Cityscapes) and we
show that our approach outperforms existing algorithms in
most setups.
We summarise our main contributions as below:
• We introduce the first sample-mixing UDA method
for object detection. Our approach, named ConfMix,
mixes samples from source and target domains based
on the regional confidence of target pseudo detections.
• We propose a novel Progressive Pseudo-labelling
scheme by gradually restricting the confidence metric
along the adaptive learning, which allows for a smooth
transition when learning target representation, thus im-
proving detection accuracy.
•ConfMix scores the new state-of-the-art adapta-
tion performance, achieving +1.7% on Sim10k →
Cityscapes, and +3.7% on KITTI →Cityscapes in
terms of mean Average Precision (mAP).
2. Related work
Object Detection. Current object detection models can be
grouped into two main categories: one-stage and two-stage
approaches. One-stage object detectors, such as YOLO [34]
and FCOS [41], adopt a unified framework to obtain final
results directly from the feature maps generated by a CNN
backbone. These frameworks are very computationally ef-
ficient and are able to achieve near real-time speed during
inference. On the other hand, two-stage object detectors,
such as RCNN [15], generate predictions by first extract-
ing region proposals and then, leveraging this information,
produce classification labels and bounding box coordinates.
Such models are widely adopted for their high performance
but, although research has been conducted to improve de-
tection speed [14, 35, 8], they are considerably slower com-
pared to one-stage detectors.
Unsupervised Domain Adaptation. Given a labelled
source domain and an unlabelled target domain, UDA aims
to use the available data to produce a model that is able to
generalise and perform well on the target domain. A con-
ventional approach is to reduce the domain gap by directly
minimising the distance between feature distributions us-
ing discrepancy loss functions [29, 40]. On the other hand,
adversarial-based methods [11, 12, 42], employ a domain
discriminator and a feature extractor that learns to produce
domain-invariant feature representations by fooling the dis-
criminator. Many works demonstrated the benefit of us-
ing pseudo labels to maximally leverage information from
the target domain [28, 23, 24], eventually considering a
gradual scheme for incorporating them [48]. Other works
have focused on adopting sample mixing techniques, such
as mixup [51] or CutMix [50], to improve generalisation.
For instance, in [45, 47] domain-level mixup regularisation
is applied to ensure domain invariance in the learned fea-
ture representations, while in [3, 33] the model’s attention
is used to re-assign the confidence of saliency-guided sam-
ples and labels. Similar ideas are implemented in previous
works considering the segmentation task [10, 32, 20, 5, 31].
However, to the best of our knowledge, no previous works
have been proposed to exploit mixing techniques for UDA
in the context of object detection.
UDA for Object Detection. In the context of object detec-
tion, UDA was recently introduced by [4], which proposed
image- and instance-level alignment using two GRLs [11]
on Faster R-CNN. Subsequently, several methods started to
address this problem mainly using two-stage detectors. Fo-
cusing on image-level, [37] showed that strong-local align-
ment and weak-global alignment of the features extracted
from the backbone improve adaptation, while [55], focus-
ing on instance-level, exploited RPN proposals to perform