Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain Attention Network Lu Gan Connor Lee and Soon-Jo Chung

2025-05-06 2 0 4.06MB 7 页 10玖币

侵权投诉

Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain

Attention Network

Lu Gan, Connor Lee, and Soon-Jo Chung

Abstract— This work presents a new method for unsuper-

vised thermal image classiﬁcation and semantic segmentation

by transferring knowledge from the RGB domain using a

multi-domain attention network. Our method does not re-

quire any thermal annotations or co-registered RGB-thermal

pairs, enabling robots to perform visual tasks at night and

in adverse weather conditions without incurring additional

costs of data labeling and registration. Current unsupervised

domain adaptation methods look to align global images or

features across domains. However, when the domain shift is

signiﬁcantly larger for cross-modal data, not all features can

be transferred. We solve this problem by using a shared

backbone network that promotes generalization, and domain-

speciﬁc attention that reduces negative transfer by attending

to domain-invariant and easily-transferable features. Our ap-

proach outperforms the state-of-the-art RGB-to-thermal adap-

tation method in classiﬁcation benchmarks, and is successfully

applied to thermal river scene segmentation using only syn-

thetic RGB images. Our code is made publicly available at

https://github.com/ganlumomo/thermal-uda-attention.

I. INTRODUCTION

Cameras are critical for robot perception as they provide

dense measurements and rich environmental information.

However, most existing vision models are developed for cam-

eras operating in the visible spectrum due to their ubiquity

and the accessibility of large-scale RGB datasets [1], [2].

Although these models allow robotic systems such as au-

tonomous vehicles (AV) to work well in ideal conditions with

sufﬁcient illumination, their performance is largely degraded

at night and in adverse conditions. Thermal cameras, on the

other hand, detect electromagnetic waves beyond the visible

spectrum that penetrate through dust, smoke, and light fog,

enabling around-the-clock robotic operations.

One popular approach towards robust vision is to lever-

age thermal images in conjunction with RGB via multi-

spectral sensor fusion. These methods have largely beneﬁted

from recent interests in AV technology, resulting in curated

datasets [3], [4] being made publicly available. Notable

examples are GAFF [5] and CFT [6], two multi-spectral

object detection networks trained on paired RGB-thermal

image datasets for feature extraction and fusion. In par-

ticular, the fusion network in [6] sees a 25% performance

improvement over a single RGB branch on the FLIR-aligned

dataset [3]. Urban semantic segmentation has also been

improved for nighttime and adverse weather after integrating

thermal capabilities [7], [8], [9], [10]. However, these models

*This work is funded by Ford Motor Company and in part by the Ofﬁce

of Naval Research. The authors are with the Division of Engineering and

Applied Science, California Institute of Technology, Pasadena, CA 91125,

USA {ganlu, clee, sjchung}@caltech.edu

Synthetic Annotated RGB Unannotated Thermal

UDA

Fig. 1: Our RGB-to-thermal unsupervised domain adapta-

tion (UDA) leverages knowledge learned from a synthetic

annotated RGB dataset to perform semantic segmentation on

thermal river scenes without requiring thermal annotations.

are fully-supervised, using annotated images or co-registered

RGB-thermal pairs which are expensive to acquire and small

in scale [11]. In non-AV applications, the lack of thermal

data and cost of labeling hinder the development of thermal

vision models, especially when current vision models, like

Transfomers, have been trending larger [12].

To overcome this issue, we look to leverage existing large-

scale RGB datasets to learn thermal models via unsupervised

domain adaptation (UDA) techniques. UDA aims to transfer

the knowledge learned in a labeled source domain to an

unlabeled target domain [13]. Although most UDA methods

focus on domains from different environments but within

the same modality (mainly RGB images), such as GTAV-

to-Cityscapes [14], [15], the underlying assumption that a

domain-invariant feature representation exists also applies to

cross-modal data, especially for semantic-related tasks.

In this work, we aim to transfer knowledge learned from

labeled RGB images to unlabeled thermal images. This is

challenging for two reasons: First, cross-modal domains have

larger domain shifts and more dissimilar features compared

to domains within same modalities. UDA methods that match

global images or feature distributions of both domains can

hurt generalization and lead to negative transfer in which

untransferable features are forcefully aligned [16], [17], [18].

Second, UDA methods based on generative adversarial net-

works (GANs) need a large amount of unlabeled target data

to be well-trained [13] which can also be unavailable in the

thermal domain.

We surmount these challenges by designing a multi-

domain attention network with a shared backbone and

domain-speciﬁc attention for RGB-to-thermal adaptation.

arXiv:2210.04367v1 [cs.CV] 9 Oct 2022

This shared backbone promotes generalization across do-

mains, prevents feature over-alignment, and relaxes the ther-

mal dataset size requirement. For feature alignment, we

train the target-speciﬁc attention using adversarial learning to

attend to and transfer more domain-invariant and transferable

features among all shared features to alleviate negative

transfer. The main contributions of our work are as follows:

•We establish an unsupervised RGB-to-thermal domain

adaptation method using a multi-domain attention net-

work and adversarial attention learning.

•We evaluate our method on thermal image classiﬁca-

tion tasks and outperform the state-of-the-art RGB-to-

thermal adaptation approach on two benchmarks.

•We demonstrate the versatility of our approach, lever-

aging it to perform thermal river scene segmentation,

and to the best of our knowledge, are the ﬁrst to utilize

synthetic RGB data for thermal semantic segmentation.

II. RELATED WORK

Unsupervised Domain Adaptation: UDA has been suc-

cessfully applied to a variety of vision tasks including image

classiﬁcation [19], [20], [21], [22], [23], semantic segmenta-

tion [15], [14], [24] and 2D/3D object detection [25], [26].

Domain alignment is the fundamental principle of UDA,

and can be achieved by two main methodologies: domain

mapping and domain-invariant feature learning [13]. Domain

mapping can be viewed as pixel-level alignment which maps

images from one domain to another via image translation.

For instance, PixelDA [27] and CyCADA [15] map source

training data into the target domain using conditional GANs

and train the downstream model on the fake target data.

Pixel-level alignment can remove the domain differences

in the input space to some extent but such differences

are primarily low-level [13]. Other works achieve domain

adaptation by domain-invariant feature learning or feature-

level alignment. By mapping source and target input data

to the same feature distribution, a downstream predictor

trained on such domain-invariant features from source can

also work well on the target domain. This is typically done

by minimizing a distance deﬁned on distributions [21], or by

adversarial training via a domain discriminator that attempts

to distinguish between source and target features [19], [20],

[22], [14], [23]. Our method is similar to these works and can

be viewed as an instance of the general pipeline in [20] by

leveraging multi-domain network and attention mechanisms.

RGB-to-Thermal UDA: Despite the success of UDA

on visible images, adapting models from visible to thermal

remains challenging due to their larger domain gap. Existing

RGB-to-thermal adaptation works like MS-UDA [9] and

HeatNet [10] distill knowledge from a semantic segmentation

network pretrained on RGB datasets to their two-stream

network by pseudo-labeling RGB-thermal image pairs. How-

ever, as the pseudo-labels are generated for the RGB image

in a pair, the main domain gap here is intra-modal, between

the pretraining dataset and RGB images in the paired dataset,

rather than inter-modal.

Our work is mostly related to SGADA [23] and Marnissi

et al. [26] which aim to transfer knowledge from RGB

to thermal without requiring thermal annotations or RGB-

thermal pairs. For pedestrian detection, Marnissi et al. [26]

incorporates alignment at difﬁcult levels into Faster R-

CNN [28] using adversarial training. SGADA [23] is built

upon ADDA [20] with an additional self-training procedure.

For pseudo-labeling, not only the model prediction and conﬁ-

dence are considered, but also the prediction and conﬁdence

from the domain discriminator. It achieves the best results

on MS-COCO [2] to FLIR ADAS [3] adaptation benchmark,

however, its performance largely depends on the quality of

pseudo labels generated by ADDA.

Attention Networks: Attention mechanisms allow models

to dynamically attend to certain parts of the input that are

more effective for a task, and become important concepts

in neural networks. Attention can be grouped into different

types, including sequence attention, channel attention [29],

and spatial attention [30], etc. For domain adaptation, Wang

et al. [17] and Zhang et al. [18] propose transferable atten-

tion networks using self-attention mechanisms to highlight

transferable features. The spatial attention they employed

attend to different regions in a feature map. Instead, we

use channel-wise attention [29] to attend to different feature

maps and use residual adapters [31] to align them, with the

intuition that certain types of features are more transferable

than others. The transferability difference in feature types

(i.e., channels) should be focused on more than in feature

regions (i.e., spatial locations) for cross-modal domains.

III. PROPOSED METHOD

A. Multi-Domain Attention Network

Our multi-domain attention network design draws ideas

from multi-domain learning [31] and task attention mech-

anisms in multi-task learning [32]. Both works use a

shared backbone network and domain/task-speciﬁc param-

eters to separate a shared representation learned from all

domain/tasks and domain/task-speciﬁc modeling capabilities.

It has been shown that sharing weights across domains/tasks

promotes the generalization ability. In contrast with encour-

aging disentanglement in a supervised setup [31], [32], we

use domain-speciﬁc attention with adversarial learning to

facilitate domain-invariant feature extraction and alignment

for domain adaptation.

Our multi-domain attention network consists of an

encoder-decoder backbone, shared by both source and target

domains, with domain-speciﬁc attention modules attached at

various stages of the encoder. For UDA classiﬁcation (Fig. 2),

the architecture consists of the shared backbone and classiﬁer

(blue), source-speciﬁc (green), and target-speciﬁc (red) atten-

tion modules. Hypothesizing that different sensor modality

favors different types of features, we use channel-wise at-

tention, i.e., Squeeze-and-Excitation (SE) [29], to highlight

more domain-invariant and easily-transferable feature maps

among all shared features, and use residual adapters [31] to

align them across domains.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedRGB-to-ThermalDomainAdaptationviaMulti-DomainAttentionNetworkLuGan,ConnorLee,andSoon-JoChungAbstractThisworkpresentsanewmethodforunsuper-visedthermalimageclassicationandsemanticsegmentationbytransferringknowledgefromtheRGBdomainusingamulti-domainattentionnetwork.Ourmethoddoesnotre-quir...

展开>> 收起<<

Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain Attention Network Lu Gan Connor Lee and Soon-Jo Chung.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised RGB-to-Thermal Domain Adaptation via Multi-Domain Attention Network Lu Gan Connor Lee and Soon-Jo Chung

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: