Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing Tim Siebert Kai Norman Clasen Mahdyar Ravanbakhsh and Begüm Demir

2025-05-02 0 0 501.83KB 9 页 10玖币

侵权投诉

Multi-Modal Fusion Transformer for Visual Question

Answering in Remote Sensing

Tim Siebert∗, Kai Norman Clasen∗, Mahdyar Ravanbakhsh, and Begüm Demir

Technische Universitat Berlin, Einsteinufer 17, 10587, Berlin, Germany

ABSTRACT

With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very

fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA)

has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS

images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image

and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-

speciﬁc representations in their fusion modules instead of joint representation learning. However, to discover

the underlying relation between both the image and question modality, the model is required to learn the joint

representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-speciﬁc

representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed

architecture consists of three main modules: i) the feature extraction module for extracting the modality-speciﬁc

features; ii) the fusion module, which leverages a user-deﬁned number of multi-modal transformer layers of the

VisualBERT model (VB); and iii) the classiﬁcation module to obtain the answer. In contrast to recently proposed

transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to speciﬁc

questions, e.g., questions concerning pre-deﬁned objects. Experimental results obtained on the RSVQAxBEN

and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the eﬀectiveness

of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description

of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include

all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the

importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the

framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multi-

modal-fusion-transformer-for-vqa-in-rs.

Keywords: Multi-modal transformer, visual question answering, deep learning, remote sensing.

1. INTRODUCTION

With advances in satellite technology, remote sensing (RS) image archives are rapidly growing, providing an

unprecedented amount of data, which is a great source for information extraction in the framework of several

diﬀerent Earth observation applications. As a result, there has been an increased demand for systems that

provide an intuitive interface to this wealth of information. For this purpose, the development of accurate visual

question answering (VQA) systems has recently become an important research topic in RS [1]. VQA deﬁnes a

framework that allows retrieval of use-case-speciﬁc information [2]. By asking a free-form question to a selected

image, the user can query various types of information without any need for remote sensing-related expertise.

Most of the existing VQA architectures consist of three main modules: i) the feature extraction module; ii)

the fusion module; and iii) the classiﬁcation module. The feature extraction module extracts high-level features

for both input modalities (i.e., image and question text). After encoding the input modalities, the fusion module

is required to discover a cross-interaction between both features. Since the model needs to select the relevant part

of the image concerning the question, combining the features in a meaningful way is crucial. Finally, the output of

the fusion module is passed to the classiﬁcation module to generate the natural language answer. In RS, relatively

Further author information: (Send correspondence to Tim Siebert)

Tim Siebert: E-mail: siebert@tu-berlin.de

*Equal Contribution

arXiv:2210.04510v1 [cs.CV] 10 Oct 2022

few VQA architectures are investigated [1,3–7]. In [1], a convolutional neural network (CNN) is applied as an

image encoder in the feature extraction module. The natural language encoder is based on the skip-thoughts

[8] architecture, while the fusion module is deﬁned based on a simple, non-learnable point-wise multiplication of

the feature vectors. In [4], image modality representation is provided by a VGG-16 network [9], which extracts

two sets of image features: i) an image feature map extracted from the last convolutional layer; and ii) an image

feature vector derived from the ﬁnal fully connected layer. The text modality representations are extracted by

a gated-recurrent unit (GRU) [10]. The two modality-speciﬁc representations are merged by the fusion module

that leverages the more complex mutual-attention component. In [6], the authors reveal the potential of an

eﬀective fusion module to build competitive RS VQA models. In the computer vision (CV) community, it has

been shown that to discover the underlying relation between the modalities, a model is required to learn a joint

representation instead of applying a simple combination (e.g., concatenation, addition, or multiplication) of the

modality-speciﬁc feature vectors [11]. In [3], this knowledge is transferred to the RS VQA domain by introducing

the ﬁrst transformer-based architecture combined with an object detector pre-processor for the image features

(called CrossModal). The object detector is trained on an RS object detection dataset (i.e., the xView dataset

[12]) to recognize target objects such as cars, buildings, and ships. However, since the CrossModal architecture

leverages objects deﬁned in xView, the model is specialized for VQA tasks that contain those objects. In contrast,

the largest RSVQA benchmark dataset (RSVQAxBEN [13]) includes questions regarding objects not included

in xView (e.g., the dataset contains questions that require a distinction between coniferous and broad-leaved

trees). In [7], a model called Prompt-RSVQA is proposed that takes advantage of the language transformer

DistilBERT [14]. To this end, the image feature extraction module uses a pre-trained multi-class classiﬁcation

network to predict image labels, then projects the image labels into a word embedding space. Finally, the image

features encoded as words and the question text are merged in the transformer-based fusion module. Using word

embeddings of the image labels as image features in Prompt-RSVQA comes with limitations: i) the model is

tailored to the dataset that the classiﬁer is trained on; ii) some questions are not covered in this formulation

(e.g., counting, comparison) since the relation between the class labels are not presented in the predicted label

set. The aforementioned models provide promising results, guiding the motivation towards fusion modules that

exploit transformer-based architectures. However, they are limited to the datasets that the classiﬁer or the object

detector is trained on, which makes the model unable to learn questions that are not covered in this formulation

(e.g., counting, comparison, and objects that are not in the classiﬁer/object detector training set).

To overcome these issues, in this paper, we present a multi-modal transformer-based architecture that lever-

ages a diﬀerent number of multi-modal transformer layers of the VisualBERT [11] model as a fusion module.

Instead of applying an object detector, we utilize BoxExtractor, a more general box extractor which does not

overemphasize objects, within the image feature extraction module. The resulting boxes are fed into a ResNet

[15] to generate image tokens as embeddings of the extracted boxes. For the text modality, the BertTokenizer [16]

is used to tokenize the questions. Our fusion module takes modality-speciﬁc tokens as input and processes them

with a user-deﬁned number lof VisualBERT (VB) layers. The classiﬁcation module consists of a Multi-Layer

Perceptron (MLP) that generates an output vector that represents the answer. Experimental results obtained on

large-scale RS VQA benchmark datasets [1,13] (which only include the RGB bands of Sentinel-2 multispectral

images) demonstrate the success of the proposed architecture (called VBFusion) compared to the standard RS

VQA model. To analyze the importance of the other spectral bands for characterization of the complex informa-

tion content of Sentinel-2 images in the framework of VQA, we add the other 10m and all 20m spectral bands to

the RSVQAxBEN [13] dataset. The results show that the inclusion of these spectral bands signiﬁcantly improves

the VQA performance, as the additional spectral information helps the model to take better advantage of the

complex image data. To the best of our knowledge, this is the ﬁrst study to consider multispectral RS VQA that

is not limited to the RGB bands.

The remaining sections are organized as follows; Section 2will introduce the proposed architecture, including

the feature extraction pipeline. In Section 3, we will introduce the datasets for the experiments, the experimental

setup, and in Section 4analyze our results. We will conclude our work in Section 5and provide an outlook for

future research directions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-ModalFusionTransformerforVisualQuestionAnsweringinRemoteSensingTimSiebert,KaiNormanClasen,MahdyarRavanbakhsh,andBegümDemirTechnischeUniversitatBerlin,Einsteinufer17,10587,Berlin,GermanyABSTRACTWiththenewgenerationofsatellitetechnologies,thearchivesofremotesensing(RS)imagesaregrowingveryfast....

展开>> 收起<<

Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing Tim Siebert Kai Norman Clasen Mahdyar Ravanbakhsh and Begüm Demir.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing Tim Siebert Kai Norman Clasen Mahdyar Ravanbakhsh and Begüm Demir

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: