few VQA architectures are investigated [1,3–7]. In [1], a convolutional neural network (CNN) is applied as an
image encoder in the feature extraction module. The natural language encoder is based on the skip-thoughts
[8] architecture, while the fusion module is defined based on a simple, non-learnable point-wise multiplication of
the feature vectors. In [4], image modality representation is provided by a VGG-16 network [9], which extracts
two sets of image features: i) an image feature map extracted from the last convolutional layer; and ii) an image
feature vector derived from the final fully connected layer. The text modality representations are extracted by
a gated-recurrent unit (GRU) [10]. The two modality-specific representations are merged by the fusion module
that leverages the more complex mutual-attention component. In [6], the authors reveal the potential of an
effective fusion module to build competitive RS VQA models. In the computer vision (CV) community, it has
been shown that to discover the underlying relation between the modalities, a model is required to learn a joint
representation instead of applying a simple combination (e.g., concatenation, addition, or multiplication) of the
modality-specific feature vectors [11]. In [3], this knowledge is transferred to the RS VQA domain by introducing
the first transformer-based architecture combined with an object detector pre-processor for the image features
(called CrossModal). The object detector is trained on an RS object detection dataset (i.e., the xView dataset
[12]) to recognize target objects such as cars, buildings, and ships. However, since the CrossModal architecture
leverages objects defined in xView, the model is specialized for VQA tasks that contain those objects. In contrast,
the largest RSVQA benchmark dataset (RSVQAxBEN [13]) includes questions regarding objects not included
in xView (e.g., the dataset contains questions that require a distinction between coniferous and broad-leaved
trees). In [7], a model called Prompt-RSVQA is proposed that takes advantage of the language transformer
DistilBERT [14]. To this end, the image feature extraction module uses a pre-trained multi-class classification
network to predict image labels, then projects the image labels into a word embedding space. Finally, the image
features encoded as words and the question text are merged in the transformer-based fusion module. Using word
embeddings of the image labels as image features in Prompt-RSVQA comes with limitations: i) the model is
tailored to the dataset that the classifier is trained on; ii) some questions are not covered in this formulation
(e.g., counting, comparison) since the relation between the class labels are not presented in the predicted label
set. The aforementioned models provide promising results, guiding the motivation towards fusion modules that
exploit transformer-based architectures. However, they are limited to the datasets that the classifier or the object
detector is trained on, which makes the model unable to learn questions that are not covered in this formulation
(e.g., counting, comparison, and objects that are not in the classifier/object detector training set).
To overcome these issues, in this paper, we present a multi-modal transformer-based architecture that lever-
ages a different number of multi-modal transformer layers of the VisualBERT [11] model as a fusion module.
Instead of applying an object detector, we utilize BoxExtractor, a more general box extractor which does not
overemphasize objects, within the image feature extraction module. The resulting boxes are fed into a ResNet
[15] to generate image tokens as embeddings of the extracted boxes. For the text modality, the BertTokenizer [16]
is used to tokenize the questions. Our fusion module takes modality-specific tokens as input and processes them
with a user-defined number lof VisualBERT (VB) layers. The classification module consists of a Multi-Layer
Perceptron (MLP) that generates an output vector that represents the answer. Experimental results obtained on
large-scale RS VQA benchmark datasets [1,13] (which only include the RGB bands of Sentinel-2 multispectral
images) demonstrate the success of the proposed architecture (called VBFusion) compared to the standard RS
VQA model. To analyze the importance of the other spectral bands for characterization of the complex informa-
tion content of Sentinel-2 images in the framework of VQA, we add the other 10m and all 20m spectral bands to
the RSVQAxBEN [13] dataset. The results show that the inclusion of these spectral bands significantly improves
the VQA performance, as the additional spectral information helps the model to take better advantage of the
complex image data. To the best of our knowledge, this is the first study to consider multispectral RS VQA that
is not limited to the RGB bands.
The remaining sections are organized as follows; Section 2will introduce the proposed architecture, including
the feature extraction pipeline. In Section 3, we will introduce the datasets for the experiments, the experimental
setup, and in Section 4analyze our results. We will conclude our work in Section 5and provide an outlook for
future research directions.