schematic diagram of WUDA is shown in Figure 1. For the
newly defined task, it is necessary to explore a framework to
tackle the problems of transfer learning and weakly super-
vised segmentation at the same time. The realization of this
task can reduce the requirements for source domain labels in
future UDA tasks.
In summary, this paper makes the following contributions:
• We define a novel task: unsupervised domain adapta-
tion based on weak source domain labels (WUDA). For
this task, we propose two intuitive frameworks: Weakly
Supervised Semantic Segmentation + Unsupervised Do-
main Adaptation (WSSS-UDA) and Target Domain Ob-
ject Detection + Weakly Supervised Semantic Segmen-
tation (TDOD-WSSS).
• We benchmark typical weakly supervised semantic seg-
mentation, unsupervised domain adaptation, and object
detection techniques under our two proposed frame-
works, and find that the results of framework WSSS-
UDA can reach 83% of the UDA method with fine source
domain labels.
• We construct a series of datasets with different domain
shifts. To the best of our knowledge, we are the first to
use representation shift for domain shift measurement in
urban landscape datasets. The constructed dataset will be
open for research on WUDA/UDA under multiple do-
main shifts.
• To further analyze the impact of different degrees of do-
main shift on our proposed frameworks, we conduct ex-
tended experiments using our constructed datasets and
find that framework TDOD-WSSS is more sensitive to
changes in domain shift than framework WSSS-UDA.
Related Work
WUDA will involve weakly supervised semantic segmenta-
tion, unsupervised domain adaptation, object detection, and
the measure of domain shift techniques. In this section, we
will review these related previous works.
Weakly Supervised Semantic Segmentation
In computer vision tasks, pixel-wise mask annotations takes
far more time compared to weak annotations (Lin et al.
2014), and the need for time-saving motivates weakly super-
vised semantic segmentation. Labels for weakly supervised
segmentation can be bounding boxes, points, scribbles and
image-level tags. Methods (Dai, He, and Sun 2015;Khoreva
et al. 2017;Li, Arnab, and Torr 2018;Song et al. 2019;
Kulharia et al. 2020) using bounding boxes as supervision
usually employ GrabCut (Rother, Kolmogorov, and Blake
2004) or segment proposals techniques to get more accu-
rate semantic labels and can achieve results close (95% or
even higher) to fully supervised methods. Point-supervisied
and scribble-supervised methods (Bearman et al. 2016;Qian
et al. 2019;Lin et al. 2016;Vernaza and Chandraker 2017;
Tang et al. 2018a,b) take advantage of location and category
information in annotations and achieve excellent segmen-
tation results. Tag-supervised methods (Jiang et al. 2019;
Wang et al. 2020b;Lee et al. 2021b;Li et al. 2021b) of-
ten use class activation mapping (CAM) (Zhou et al. 2016)
algorithm to obtain localization maps of the main objects in
the images.
Unsupervised Domain Adaptation for Semantic
Segmentation
Unsupervised Domain Adaptation (UDA) is committed to
solving the problem of poor model generalization caused by
inconsistent data distribution in the source and target do-
mains. Self-training (ST) and adversarial training (AT) are
key schemes of UDA: self-training schemes (Zou et al. 2018,
2019;Lian et al. 2019;Li et al. 2020;Lv et al. 2020;Melas-
Kyriazi and Manrai 2021;Tranheden et al. 2021;Guo et al.
2021;Araslanov and Roth 2021) typically set a threshold to
filter pseudo-labels with high confidence on the target do-
main, and use the pseudo-labels to supervise target domain
training; adversarial training methods (Tsai et al. 2018;Luo
et al. 2019b,a;Du et al. 2019;Vu et al. 2019;Tsai et al. 2019;
Yang et al. 2020b;Wang et al. 2020a;Li et al. 2021a) usually
add a domain discriminator to the model. The adversarial
game of the segmenter and the discriminator can make the
segmentation results of the source and target domains tend
to be consistent. There are also works (Zhang et al. 2019;
Pan et al. 2020;Wang et al. 2020c;Yu et al. 2021;Wang,
Peng, and Zhang 2021;Mei et al. 2020) that perform both
self-training and adversarial training to achieve good seg-
mentation results on the target domain.
Object Detection
Autonomous driving technology has greatly promoted the
development of object detection. There are many pioneering
works that can be widely used in various object detection
tasks, such as some two-stage methods (Girshick et al. 2014;
Girshick 2015;Ren et al. 2015) that first perform object ex-
traction, and then classify the extracted objects. Yolo series
of algorithms (Redmon et al. 2016;Redmon and Farhadi
2017,2018;Bochkovskiy, Wang, and Liao 2020;Jocher
et al. 2020) can simultaneously achieve object extraction
and classification in one network. The current popular ob-
ject detection method Yolov5 (Jocher et al. 2020) has been
able to achieve 72.7% mean average precision (mAP) on the
coco2017 val dataset. Object detection techniques also help
to extract bounding boxes in weakly supervised segmenta-
tion methods (Lan et al. 2021;Lee et al. 2021a).
Domain Shift Assessment
Domain shift comes from the difference between the source
and target domain data. There are various factors (e.g. im-
age content, view angle, image texture, etc.) that contribute
to domain shift. While for Convolutional Neural Networks
(CNN), texture is the most critical factor. Many studies
(Geirhos et al. 2018;Nam et al. 2019) suggest that the focus
of Convolutional Neural Networks (CNN) and human eyes
is different when processing images: human eyes are sensi-
tive to the content information of the image (e.g. shapes),
while CNN is more sensitive to the style of the image (e.g.
texture). Actually, if it involves the calculation of image tex-
ture differences, most methods are based on the output fea-
tures of the middle layer of the neural network. For example,