Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain Attention Network Lu Gan Connor Lee and Soon-Jo Chung

2025-05-06 0 0 4.06MB 7 页 10玖币
侵权投诉
Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain
Attention Network
Lu Gan, Connor Lee, and Soon-Jo Chung
Abstract This work presents a new method for unsuper-
vised thermal image classification and semantic segmentation
by transferring knowledge from the RGB domain using a
multi-domain attention network. Our method does not re-
quire any thermal annotations or co-registered RGB-thermal
pairs, enabling robots to perform visual tasks at night and
in adverse weather conditions without incurring additional
costs of data labeling and registration. Current unsupervised
domain adaptation methods look to align global images or
features across domains. However, when the domain shift is
significantly larger for cross-modal data, not all features can
be transferred. We solve this problem by using a shared
backbone network that promotes generalization, and domain-
specific attention that reduces negative transfer by attending
to domain-invariant and easily-transferable features. Our ap-
proach outperforms the state-of-the-art RGB-to-thermal adap-
tation method in classification benchmarks, and is successfully
applied to thermal river scene segmentation using only syn-
thetic RGB images. Our code is made publicly available at
https://github.com/ganlumomo/thermal-uda-attention.
I. INTRODUCTION
Cameras are critical for robot perception as they provide
dense measurements and rich environmental information.
However, most existing vision models are developed for cam-
eras operating in the visible spectrum due to their ubiquity
and the accessibility of large-scale RGB datasets [1], [2].
Although these models allow robotic systems such as au-
tonomous vehicles (AV) to work well in ideal conditions with
sufficient illumination, their performance is largely degraded
at night and in adverse conditions. Thermal cameras, on the
other hand, detect electromagnetic waves beyond the visible
spectrum that penetrate through dust, smoke, and light fog,
enabling around-the-clock robotic operations.
One popular approach towards robust vision is to lever-
age thermal images in conjunction with RGB via multi-
spectral sensor fusion. These methods have largely benefited
from recent interests in AV technology, resulting in curated
datasets [3], [4] being made publicly available. Notable
examples are GAFF [5] and CFT [6], two multi-spectral
object detection networks trained on paired RGB-thermal
image datasets for feature extraction and fusion. In par-
ticular, the fusion network in [6] sees a 25% performance
improvement over a single RGB branch on the FLIR-aligned
dataset [3]. Urban semantic segmentation has also been
improved for nighttime and adverse weather after integrating
thermal capabilities [7], [8], [9], [10]. However, these models
*This work is funded by Ford Motor Company and in part by the Office
of Naval Research. The authors are with the Division of Engineering and
Applied Science, California Institute of Technology, Pasadena, CA 91125,
USA {ganlu, clee, sjchung}@caltech.edu
Synthetic Annotated RGB Unannotated Thermal
UDA
Fig. 1: Our RGB-to-thermal unsupervised domain adapta-
tion (UDA) leverages knowledge learned from a synthetic
annotated RGB dataset to perform semantic segmentation on
thermal river scenes without requiring thermal annotations.
are fully-supervised, using annotated images or co-registered
RGB-thermal pairs which are expensive to acquire and small
in scale [11]. In non-AV applications, the lack of thermal
data and cost of labeling hinder the development of thermal
vision models, especially when current vision models, like
Transfomers, have been trending larger [12].
To overcome this issue, we look to leverage existing large-
scale RGB datasets to learn thermal models via unsupervised
domain adaptation (UDA) techniques. UDA aims to transfer
the knowledge learned in a labeled source domain to an
unlabeled target domain [13]. Although most UDA methods
focus on domains from different environments but within
the same modality (mainly RGB images), such as GTAV-
to-Cityscapes [14], [15], the underlying assumption that a
domain-invariant feature representation exists also applies to
cross-modal data, especially for semantic-related tasks.
In this work, we aim to transfer knowledge learned from
labeled RGB images to unlabeled thermal images. This is
challenging for two reasons: First, cross-modal domains have
larger domain shifts and more dissimilar features compared
to domains within same modalities. UDA methods that match
global images or feature distributions of both domains can
hurt generalization and lead to negative transfer in which
untransferable features are forcefully aligned [16], [17], [18].
Second, UDA methods based on generative adversarial net-
works (GANs) need a large amount of unlabeled target data
to be well-trained [13] which can also be unavailable in the
thermal domain.
We surmount these challenges by designing a multi-
domain attention network with a shared backbone and
domain-specific attention for RGB-to-thermal adaptation.
arXiv:2210.04367v1 [cs.CV] 9 Oct 2022
This shared backbone promotes generalization across do-
mains, prevents feature over-alignment, and relaxes the ther-
mal dataset size requirement. For feature alignment, we
train the target-specific attention using adversarial learning to
attend to and transfer more domain-invariant and transferable
features among all shared features to alleviate negative
transfer. The main contributions of our work are as follows:
We establish an unsupervised RGB-to-thermal domain
adaptation method using a multi-domain attention net-
work and adversarial attention learning.
We evaluate our method on thermal image classifica-
tion tasks and outperform the state-of-the-art RGB-to-
thermal adaptation approach on two benchmarks.
We demonstrate the versatility of our approach, lever-
aging it to perform thermal river scene segmentation,
and to the best of our knowledge, are the first to utilize
synthetic RGB data for thermal semantic segmentation.
II. RELATED WORK
Unsupervised Domain Adaptation: UDA has been suc-
cessfully applied to a variety of vision tasks including image
classification [19], [20], [21], [22], [23], semantic segmenta-
tion [15], [14], [24] and 2D/3D object detection [25], [26].
Domain alignment is the fundamental principle of UDA,
and can be achieved by two main methodologies: domain
mapping and domain-invariant feature learning [13]. Domain
mapping can be viewed as pixel-level alignment which maps
images from one domain to another via image translation.
For instance, PixelDA [27] and CyCADA [15] map source
training data into the target domain using conditional GANs
and train the downstream model on the fake target data.
Pixel-level alignment can remove the domain differences
in the input space to some extent but such differences
are primarily low-level [13]. Other works achieve domain
adaptation by domain-invariant feature learning or feature-
level alignment. By mapping source and target input data
to the same feature distribution, a downstream predictor
trained on such domain-invariant features from source can
also work well on the target domain. This is typically done
by minimizing a distance defined on distributions [21], or by
adversarial training via a domain discriminator that attempts
to distinguish between source and target features [19], [20],
[22], [14], [23]. Our method is similar to these works and can
be viewed as an instance of the general pipeline in [20] by
leveraging multi-domain network and attention mechanisms.
RGB-to-Thermal UDA: Despite the success of UDA
on visible images, adapting models from visible to thermal
remains challenging due to their larger domain gap. Existing
RGB-to-thermal adaptation works like MS-UDA [9] and
HeatNet [10] distill knowledge from a semantic segmentation
network pretrained on RGB datasets to their two-stream
network by pseudo-labeling RGB-thermal image pairs. How-
ever, as the pseudo-labels are generated for the RGB image
in a pair, the main domain gap here is intra-modal, between
the pretraining dataset and RGB images in the paired dataset,
rather than inter-modal.
Our work is mostly related to SGADA [23] and Marnissi
et al. [26] which aim to transfer knowledge from RGB
to thermal without requiring thermal annotations or RGB-
thermal pairs. For pedestrian detection, Marnissi et al. [26]
incorporates alignment at difficult levels into Faster R-
CNN [28] using adversarial training. SGADA [23] is built
upon ADDA [20] with an additional self-training procedure.
For pseudo-labeling, not only the model prediction and confi-
dence are considered, but also the prediction and confidence
from the domain discriminator. It achieves the best results
on MS-COCO [2] to FLIR ADAS [3] adaptation benchmark,
however, its performance largely depends on the quality of
pseudo labels generated by ADDA.
Attention Networks: Attention mechanisms allow models
to dynamically attend to certain parts of the input that are
more effective for a task, and become important concepts
in neural networks. Attention can be grouped into different
types, including sequence attention, channel attention [29],
and spatial attention [30], etc. For domain adaptation, Wang
et al. [17] and Zhang et al. [18] propose transferable atten-
tion networks using self-attention mechanisms to highlight
transferable features. The spatial attention they employed
attend to different regions in a feature map. Instead, we
use channel-wise attention [29] to attend to different feature
maps and use residual adapters [31] to align them, with the
intuition that certain types of features are more transferable
than others. The transferability difference in feature types
(i.e., channels) should be focused on more than in feature
regions (i.e., spatial locations) for cross-modal domains.
III. PROPOSED METHOD
A. Multi-Domain Attention Network
Our multi-domain attention network design draws ideas
from multi-domain learning [31] and task attention mech-
anisms in multi-task learning [32]. Both works use a
shared backbone network and domain/task-specific param-
eters to separate a shared representation learned from all
domain/tasks and domain/task-specific modeling capabilities.
It has been shown that sharing weights across domains/tasks
promotes the generalization ability. In contrast with encour-
aging disentanglement in a supervised setup [31], [32], we
use domain-specific attention with adversarial learning to
facilitate domain-invariant feature extraction and alignment
for domain adaptation.
Our multi-domain attention network consists of an
encoder-decoder backbone, shared by both source and target
domains, with domain-specific attention modules attached at
various stages of the encoder. For UDA classification (Fig. 2),
the architecture consists of the shared backbone and classifier
(blue), source-specific (green), and target-specific (red) atten-
tion modules. Hypothesizing that different sensor modality
favors different types of features, we use channel-wise at-
tention, i.e., Squeeze-and-Excitation (SE) [29], to highlight
more domain-invariant and easily-transferable feature maps
among all shared features, and use residual adapters [31] to
align them across domains.
摘要:

UnsupervisedRGB-to-ThermalDomainAdaptationviaMulti-DomainAttentionNetworkLuGan,ConnorLee,andSoon-JoChungAbstract—Thisworkpresentsanewmethodforunsuper-visedthermalimageclassicationandsemanticsegmentationbytransferringknowledgefromtheRGBdomainusingamulti-domainattentionnetwork.Ourmethoddoesnotre-quir...

展开>> 收起<<
Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain Attention Network Lu Gan Connor Lee and Soon-Jo Chung.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:4.06MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注