
Hamburg Senate decided to revitalize the former warehouse
district at the Elbe riverbank that was abandoned due to the
containerization of goods, making the storage capacities of the
warehouses superfluous. This revitalization project was called
HafenCity and yielded a mixed-use urban district with the
Elbphilharmonie, the international maritime museum, or the
HafenCity university being part of it. The Elbphilharmonie
was designed to serve as an icon for this cultural upgrade
and a new landmark for Hamburg. The bottom of the building
is an old brick warehouse. On top of that is a modern glass
construction that imitates a hoisted sail or the sea’s waves.
This unique architecture stands out. However, many voices
were raised that this modern architecture disturbs the view on
the historic part of the city. Moreover, the construction cost of
the Elbphilharmonie was e866 million, several times more
expensive than the initially estimated price. Consequently,
the Elbphilharmonie is a controversial building, admired and
criticized simultaneously [5].
This paper aims to provide an overview to the different
opinions communicated on social media. The proposed data
pipeline processes domain-specific social media data and
yields a structured data analysis. As part of this pipeline,
the domain expert can select an image of a building as the
aspect of interest. Then, all images in the dataset depicting
the same building are retrieved, and a message-level and
aspect-based sentiment analysis is conducted on these posts.
Therefore, this contribution is two-fold: On the one hand, a
new approach for multimodal aspect-based sentiment analysis
on social media data is proposed. On the other hand, this
approach was proven effective in an interdisciplinary project
between domain experts and computer scientists to conduct an
empirical study about the Elbphilharmonie in Hamburg. The
proposed pipeline can be applied to any building by selecting
other query images and aspects of interest. In addition, it can
be transferred to unseen data by extracting image features from
the respective data and updating the queries accordingly. Our
code and dataset are published on Github1.
The remaining paper is structured as follows: Section II
discusses previous approaches towards image retrieval on
landmark images, sentiment analysis on social media, and the
combination of both. Section III showcases the study design
and the resulting dataset from the platform Flickr. Finally, in
sections IV and V, different image retrieval and sentiment
analysis methods are compared on test datasets, and the best-
performing ones are applied to filter the Flickr dataset.
II. RELATED WORK
Social media and online review data have been used to mine
users’ opinions in different domains, for example, opinions
towards a specific brand [6], [7]. The methods used in these
studies include topic modeling [8] or aspect-based sentiment
analysis [6] and focus on textual data. To account for the
multimodal nature of social media posts, the authors in [9]
included the posts’ images in their study by clustering them
1https://github.com/MiriUll/multimodal ABSA Elbphilharmonie
based on their depicted content. As a result, they could identify
different brands and products in the images and applied
sentiment analysis on the accompanying texts. Similarly, Fang
et al. [10] extracted aspects from images, such as buildings,
and performed aspect-based sentiment analysis on them. These
approaches focus on retrieving different targets addressed
in the data. In contrast, the authors in [11] conducted a
case study about a specific, pre-defined target, the Rinjani
mountain, a popular tourist place in Indonesia. They selected
images portraying the mountain in question from social media
platforms and performed a dictionary-based sentiment analysis
on the image descriptions to retrieve the tourists’ opinions
towards this destination. This paper reports on a similar study.
However, our advanced image-based aspect selection and
sentiment analysis approaches yield a more in-depth analysis.
A. Image retrieval on landmark images
Image retrieval is the task of finding images showing similar
objects in an extensive database of images. The images are
transformed into a feature space, and the features are compared
to find similar contents. In contrast to image classification,
where models must classify all images of a class regardless
of the intra-class diversity, the image retrieval features must
account for precisely these differences [12]. Traditional tech-
niques, such as scale-invariant features transform (SIFT) [13]
or KAZE [14], describe images based on distinctive locations
and interest points in them [15]. To build a global feature
vector based on these local properties, the descriptors are
aggregated, e.g., by clustering them into visual words as in the
vectors of locally aggregated descriptors (VLAD) [16]. Other
approaches learn image representations with neural networks
by fine-tuning pre-trained classification models for the retrieval
task. To fine-tune for retrieval on landmark images, the authors
in [17] published the Google landmark dataset and trained
their deep local features (DELF) model on it. Other landmark
retrieval models are the average precision model [18] or the
deep local and global features (DELG) model [19].
B. (Aspect-based) sentiment analysis
In this paper, two different types of sentiments are analyzed.
The message-level sentiment describes the overall sentiment
of a post. In contrast, aspect-based models investigate the
sentiment about a specific word or phrase in the post. With
this, the models can retrieve opinions about a specific topic,
independent of the overall sentiment of a message [20]. Neural
networks are a popular method to train classifiers for sentiment
prediction because their nested structure can perform an in-
depth analysis of the input data and therefore gain a good
understanding of complex text features [21]. The Interna-
tional Workshop on Semantic Evaluation 2017 (SemEval-
2017) featured a task about sentiment analysis in Twitter
posts [22]. At this task, ensemble models, i.e., models that
combine different layer types, were among the most popular
and successful competitors. Among them are combinations of
convolutional neural networks (CNNs) and long short-term
memory networks (LSTMs) [23], [24] and a combination of