paper, we focus on the object detection problem in foggy
weather. Huang et al. [22] propose a DSNet (Dual-Subnet
Network) that involves a detection subnet and a restoration
subnet. This network can be trained with multi-task learning
by combining visibility enhancement task and object detec-
tion task, thus outperforms pure object detectors. Hahner
et al. [17] develop a fog simulation approach to enhance
existing real lidar dataset, and show this approach can be
leveraged to improve current object detection methods in
foggy weather. Qian et al. [35] propose a MVDNet (Multi-
modal Vehicle Detection Network) that takes advantage of
lidar and radar signals to obtain proposals. Then the region-
wise features from these two sensors are fused together to
get final detection results. Bijelic et al. [2] develop a network
that takes the data from four sensors as input: lidar, RGB
camera, gated camera, and radar. This architecture uses
entropy-steered adaptive deep fusion to get fused feature
maps for prediction. These methods typically rely on input
data from other sensors rather than RGB camera itself, which
is not the general case for many autonomous driving cars.
Thus we aim to develop an object detection architecture that
only takes RGB camera data as input in this work.
2.3. Domain adaptation for object detection
Domain adaptation reduces the discrepancy between dif-
ferent domains, thus allows the model trained on source
domain to be applicable on unlabeled target domain. Pre-
vious domain adaptation works mainly focus on the task
of image classification [46
–
48, 56], while more and more
methods have been proposed to solve domain adaptation for
object detection in recent years [5,15,24,39,49,50,55,58,60].
Domain adaptive detectors could be obtained if the features
from different domains are aligned [5,15,18,39,49,52]. From
this perspective, Chen et al. [5] introduce a Domain Adaptive
Faster R-CNN framework to reduce domain gap from image
level and instance level, and the image-and-instance consis-
tency is subsequently employed to improve cross-domain
robustness. He et al. [18] propose a MAF (multi-adversarial
Faster R-CNN) framework to minimize the domain distri-
bution disparity by aligning domain features and proposal
features hierarchically. On the other hand, some works try
to solve domain adaptation through image style transfer
methods [21, 24, 41]. Shan et al. [41] first convert images
from source domain to target domain with image transla-
tion module, then train the object detector with adversarial
training on target domain. Hsu et al. [21] choose to translate
images progressively, and add a weighted task loss during
adversarial training stage for tackling the problem of image
quality difference. Many previous methods [4, 27, 38, 62]
design complex architectures. [62] used multi-scale back-
bone Feature Pyramid Networks and considered pixel-level
and category-level adaptation. [27] used the complex Graph
Convolution Network and graph matching algorithms. [38]
used the similarity-based clustering and grouping. [4] uses
the uncertainty-guided self-training mechanism (Probabilis-
tic Teacher and Focal Loss) to capture the uncertainty of
unlabeled target data from a gradually evolving teacher and
guides student learning. Differently, our method does not
bring extra learnable parameters to original Faster R-CNN
model because our AdvGRL is based on adversarial training
(gradient reversal) and Domain-level Metric Regularization
is based on triplet loss. Previous domain adaptation meth-
ods usually treat training samples at the same challenging
level, while we employ advGRL for adversarial hard ex-
ample mining to improve transfer learning. Moreover, we
generate an auxiliary domain and apply domain-level metric
regularization to avoid overfitting.
3. Proposed Method
In this section, we will first introduce the overall network
architecture, then describe the image-level and object-level
adaptation method, and finally, reveal the details of AdvGRL
and domain-level metric regularization.
3.1. Network Architecture
As illustrated in Fig. 2, our proposed model adopts the
pipeline in Faster R-CNN for object detection. The Con-
volutional Neural Network (CNN) backbone extracts the
image-level features from the RGB images and send them to
Region Proposal Network (RPN) to generate object propos-
als. Afterwards, the ROI pooling accepts both image-level
features and object proposals as the input to retrieve the
object-level features. Eventually, a detection head is applied
on the object-level features to produce the final predictions.
Based on the Faster R-CNN framework, we integrate two
more components: image-level domain adaptation module,
and object-level domain adaptation module. For both mod-
ules, we deploy a new Adversarial Gradient Reversal Layer
(AdvGRL) together with the domain classifier to extract
domain-invariant features and perform adversarial hard ex-
ample mining. Moreover, we involve an auxiliary domain to
impose a new domain-level metric regularization to enforce
the feature metric distance between different domains. All
three domains, i.e., source, target, and auxiliary domains,
will be employed simultaneously during the training.
3.2. Image-level Adaptation
The image-level domain representation is obtained from
the backbone feature extraction and contains rich global
information such as style, scale and illumination, which can
potentially pose significant impacts on the detection task [5].
Therefore, a domain classifier is introduced to classify the
domains of the upcoming image-level features to enhance
the image-level global alignment. The domain classifier is
just a simple CNN with two convolutional layers and it will
output a prediction to identify the feature domain. We use