This shared backbone promotes generalization across do-
mains, prevents feature over-alignment, and relaxes the ther-
mal dataset size requirement. For feature alignment, we
train the target-specific attention using adversarial learning to
attend to and transfer more domain-invariant and transferable
features among all shared features to alleviate negative
transfer. The main contributions of our work are as follows:
•We establish an unsupervised RGB-to-thermal domain
adaptation method using a multi-domain attention net-
work and adversarial attention learning.
•We evaluate our method on thermal image classifica-
tion tasks and outperform the state-of-the-art RGB-to-
thermal adaptation approach on two benchmarks.
•We demonstrate the versatility of our approach, lever-
aging it to perform thermal river scene segmentation,
and to the best of our knowledge, are the first to utilize
synthetic RGB data for thermal semantic segmentation.
II. RELATED WORK
Unsupervised Domain Adaptation: UDA has been suc-
cessfully applied to a variety of vision tasks including image
classification [19], [20], [21], [22], [23], semantic segmenta-
tion [15], [14], [24] and 2D/3D object detection [25], [26].
Domain alignment is the fundamental principle of UDA,
and can be achieved by two main methodologies: domain
mapping and domain-invariant feature learning [13]. Domain
mapping can be viewed as pixel-level alignment which maps
images from one domain to another via image translation.
For instance, PixelDA [27] and CyCADA [15] map source
training data into the target domain using conditional GANs
and train the downstream model on the fake target data.
Pixel-level alignment can remove the domain differences
in the input space to some extent but such differences
are primarily low-level [13]. Other works achieve domain
adaptation by domain-invariant feature learning or feature-
level alignment. By mapping source and target input data
to the same feature distribution, a downstream predictor
trained on such domain-invariant features from source can
also work well on the target domain. This is typically done
by minimizing a distance defined on distributions [21], or by
adversarial training via a domain discriminator that attempts
to distinguish between source and target features [19], [20],
[22], [14], [23]. Our method is similar to these works and can
be viewed as an instance of the general pipeline in [20] by
leveraging multi-domain network and attention mechanisms.
RGB-to-Thermal UDA: Despite the success of UDA
on visible images, adapting models from visible to thermal
remains challenging due to their larger domain gap. Existing
RGB-to-thermal adaptation works like MS-UDA [9] and
HeatNet [10] distill knowledge from a semantic segmentation
network pretrained on RGB datasets to their two-stream
network by pseudo-labeling RGB-thermal image pairs. How-
ever, as the pseudo-labels are generated for the RGB image
in a pair, the main domain gap here is intra-modal, between
the pretraining dataset and RGB images in the paired dataset,
rather than inter-modal.
Our work is mostly related to SGADA [23] and Marnissi
et al. [26] which aim to transfer knowledge from RGB
to thermal without requiring thermal annotations or RGB-
thermal pairs. For pedestrian detection, Marnissi et al. [26]
incorporates alignment at difficult levels into Faster R-
CNN [28] using adversarial training. SGADA [23] is built
upon ADDA [20] with an additional self-training procedure.
For pseudo-labeling, not only the model prediction and confi-
dence are considered, but also the prediction and confidence
from the domain discriminator. It achieves the best results
on MS-COCO [2] to FLIR ADAS [3] adaptation benchmark,
however, its performance largely depends on the quality of
pseudo labels generated by ADDA.
Attention Networks: Attention mechanisms allow models
to dynamically attend to certain parts of the input that are
more effective for a task, and become important concepts
in neural networks. Attention can be grouped into different
types, including sequence attention, channel attention [29],
and spatial attention [30], etc. For domain adaptation, Wang
et al. [17] and Zhang et al. [18] propose transferable atten-
tion networks using self-attention mechanisms to highlight
transferable features. The spatial attention they employed
attend to different regions in a feature map. Instead, we
use channel-wise attention [29] to attend to different feature
maps and use residual adapters [31] to align them, with the
intuition that certain types of features are more transferable
than others. The transferability difference in feature types
(i.e., channels) should be focused on more than in feature
regions (i.e., spatial locations) for cross-modal domains.
III. PROPOSED METHOD
A. Multi-Domain Attention Network
Our multi-domain attention network design draws ideas
from multi-domain learning [31] and task attention mech-
anisms in multi-task learning [32]. Both works use a
shared backbone network and domain/task-specific param-
eters to separate a shared representation learned from all
domain/tasks and domain/task-specific modeling capabilities.
It has been shown that sharing weights across domains/tasks
promotes the generalization ability. In contrast with encour-
aging disentanglement in a supervised setup [31], [32], we
use domain-specific attention with adversarial learning to
facilitate domain-invariant feature extraction and alignment
for domain adaptation.
Our multi-domain attention network consists of an
encoder-decoder backbone, shared by both source and target
domains, with domain-specific attention modules attached at
various stages of the encoder. For UDA classification (Fig. 2),
the architecture consists of the shared backbone and classifier
(blue), source-specific (green), and target-specific (red) atten-
tion modules. Hypothesizing that different sensor modality
favors different types of features, we use channel-wise at-
tention, i.e., Squeeze-and-Excitation (SE) [29], to highlight
more domain-invariant and easily-transferable feature maps
among all shared features, and use residual adapters [31] to
align them across domains.