Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing Tim Siebert Kai Norman Clasen Mahdyar Ravanbakhsh and Begüm Demir

2025-05-02 0 0 501.83KB 9 页 10玖币
侵权投诉
Multi-Modal Fusion Transformer for Visual Question
Answering in Remote Sensing
Tim Siebert, Kai Norman Clasen, Mahdyar Ravanbakhsh, and Begüm Demir
Technische Universitat Berlin, Einsteinufer 17, 10587, Berlin, Germany
ABSTRACT
With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very
fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA)
has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS
images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image
and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-
specific representations in their fusion modules instead of joint representation learning. However, to discover
the underlying relation between both the image and question modality, the model is required to learn the joint
representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific
representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed
architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific
features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the
VisualBERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed
transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific
questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN
and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness
of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description
of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include
all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the
importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the
framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multi-
modal-fusion-transformer-for-vqa-in-rs.
Keywords: Multi-modal transformer, visual question answering, deep learning, remote sensing.
1. INTRODUCTION
With advances in satellite technology, remote sensing (RS) image archives are rapidly growing, providing an
unprecedented amount of data, which is a great source for information extraction in the framework of several
different Earth observation applications. As a result, there has been an increased demand for systems that
provide an intuitive interface to this wealth of information. For this purpose, the development of accurate visual
question answering (VQA) systems has recently become an important research topic in RS [1]. VQA defines a
framework that allows retrieval of use-case-specific information [2]. By asking a free-form question to a selected
image, the user can query various types of information without any need for remote sensing-related expertise.
Most of the existing VQA architectures consist of three main modules: i) the feature extraction module; ii)
the fusion module; and iii) the classification module. The feature extraction module extracts high-level features
for both input modalities (i.e., image and question text). After encoding the input modalities, the fusion module
is required to discover a cross-interaction between both features. Since the model needs to select the relevant part
of the image concerning the question, combining the features in a meaningful way is crucial. Finally, the output of
the fusion module is passed to the classification module to generate the natural language answer. In RS, relatively
Further author information: (Send correspondence to Tim Siebert)
Tim Siebert: E-mail: siebert@tu-berlin.de
*Equal Contribution
arXiv:2210.04510v1 [cs.CV] 10 Oct 2022
few VQA architectures are investigated [1,37]. In [1], a convolutional neural network (CNN) is applied as an
image encoder in the feature extraction module. The natural language encoder is based on the skip-thoughts
[8] architecture, while the fusion module is defined based on a simple, non-learnable point-wise multiplication of
the feature vectors. In [4], image modality representation is provided by a VGG-16 network [9], which extracts
two sets of image features: i) an image feature map extracted from the last convolutional layer; and ii) an image
feature vector derived from the final fully connected layer. The text modality representations are extracted by
a gated-recurrent unit (GRU) [10]. The two modality-specific representations are merged by the fusion module
that leverages the more complex mutual-attention component. In [6], the authors reveal the potential of an
effective fusion module to build competitive RS VQA models. In the computer vision (CV) community, it has
been shown that to discover the underlying relation between the modalities, a model is required to learn a joint
representation instead of applying a simple combination (e.g., concatenation, addition, or multiplication) of the
modality-specific feature vectors [11]. In [3], this knowledge is transferred to the RS VQA domain by introducing
the first transformer-based architecture combined with an object detector pre-processor for the image features
(called CrossModal). The object detector is trained on an RS object detection dataset (i.e., the xView dataset
[12]) to recognize target objects such as cars, buildings, and ships. However, since the CrossModal architecture
leverages objects defined in xView, the model is specialized for VQA tasks that contain those objects. In contrast,
the largest RSVQA benchmark dataset (RSVQAxBEN [13]) includes questions regarding objects not included
in xView (e.g., the dataset contains questions that require a distinction between coniferous and broad-leaved
trees). In [7], a model called Prompt-RSVQA is proposed that takes advantage of the language transformer
DistilBERT [14]. To this end, the image feature extraction module uses a pre-trained multi-class classification
network to predict image labels, then projects the image labels into a word embedding space. Finally, the image
features encoded as words and the question text are merged in the transformer-based fusion module. Using word
embeddings of the image labels as image features in Prompt-RSVQA comes with limitations: i) the model is
tailored to the dataset that the classifier is trained on; ii) some questions are not covered in this formulation
(e.g., counting, comparison) since the relation between the class labels are not presented in the predicted label
set. The aforementioned models provide promising results, guiding the motivation towards fusion modules that
exploit transformer-based architectures. However, they are limited to the datasets that the classifier or the object
detector is trained on, which makes the model unable to learn questions that are not covered in this formulation
(e.g., counting, comparison, and objects that are not in the classifier/object detector training set).
To overcome these issues, in this paper, we present a multi-modal transformer-based architecture that lever-
ages a different number of multi-modal transformer layers of the VisualBERT [11] model as a fusion module.
Instead of applying an object detector, we utilize BoxExtractor, a more general box extractor which does not
overemphasize objects, within the image feature extraction module. The resulting boxes are fed into a ResNet
[15] to generate image tokens as embeddings of the extracted boxes. For the text modality, the BertTokenizer [16]
is used to tokenize the questions. Our fusion module takes modality-specific tokens as input and processes them
with a user-defined number lof VisualBERT (VB) layers. The classification module consists of a Multi-Layer
Perceptron (MLP) that generates an output vector that represents the answer. Experimental results obtained on
large-scale RS VQA benchmark datasets [1,13] (which only include the RGB bands of Sentinel-2 multispectral
images) demonstrate the success of the proposed architecture (called VBFusion) compared to the standard RS
VQA model. To analyze the importance of the other spectral bands for characterization of the complex informa-
tion content of Sentinel-2 images in the framework of VQA, we add the other 10m and all 20m spectral bands to
the RSVQAxBEN [13] dataset. The results show that the inclusion of these spectral bands significantly improves
the VQA performance, as the additional spectral information helps the model to take better advantage of the
complex image data. To the best of our knowledge, this is the first study to consider multispectral RS VQA that
is not limited to the RGB bands.
The remaining sections are organized as follows; Section 2will introduce the proposed architecture, including
the feature extraction pipeline. In Section 3, we will introduce the datasets for the experiments, the experimental
setup, and in Section 4analyze our results. We will conclude our work in Section 5and provide an outlook for
future research directions.
摘要:

Multi-ModalFusionTransformerforVisualQuestionAnsweringinRemoteSensingTimSiebert,KaiNormanClasen,MahdyarRavanbakhsh,andBegümDemirTechnischeUniversitatBerlin,Einsteinufer17,10587,Berlin,GermanyABSTRACTWiththenewgenerationofsatellitetechnologies,thearchivesofremotesensing(RS)imagesaregrowingveryfast....

展开>> 收起<<
Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing Tim Siebert Kai Norman Clasen Mahdyar Ravanbakhsh and Begüm Demir.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:501.83KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注