such as cross-modal retrieval (Li et al.,2021b) and
image-sentence matching (Xu et al.,2020b;Liu
et al.,2020). In fact, the hierarchical structures of
both texts and images advocate for composition-
level modeling besides single tokens or image
patches (Socher et al.,2014). By exploring com-
positional semantics for sarcasm detection, it helps
to identify more complex inconsistencies, e.g., in-
consistency between a pair of related entities and a
group of image patches.
Moreover, as figurativeness and subtlety inherent
in sarcasm utterances may bring a negative impact
to sarcasm detection, some works (Li et al.,2021a;
Veale and Hao,2010) found that the identification
of sarcasm also relies on the external knowledge of
the world beyond the input texts and images as new
contextual information. What’s more, it has drawn
increasing research interest in how to incorporate
knowledge to boost many machine learning algo-
rithms such as recommendation system (Sun et al.,
2021) and relation extraction (Sun et al.,2022).
Indeed, several studies extracted image attributes
(Cai et al.,2019) or adjective-noun pairs (ANPs)
(Xu et al.,2020a) from images as visual semantic
information to bridge the gap between texts and
images. However, constrained by limited training
data, such external knowledge may not be suffi-
cient or accurate to represent the images (as shown
in Figure 1) which may bring negative effects for
sarcasm detection. Therefore, how to choose and
leverage external knowledge for sarcasm detection
is also worth being investigated.
To tackle the limitations mentioned above, in
this work, we propose a novel hierarchical frame-
work for sarcasm detection. Specifically, our pro-
posed method takes both atomic-level congruity
between independent image objects and tokens, as
well as composition-level congruity considering
object relations and semantic dependencies to pro-
mote multi-modal sarcasm identification. To obtain
atomic-level congruity, we first adopt the multi-
head cross attention mechanism (Vaswani et al.,
2017) to project features from different modalities
into the same space and then compute a similarity
score for each token-object pair via inner products.
Next, we obtain composition-level congruity based
on the output features of both textual modality
and visual modality acquired in the previous step.
Concretely, we construct textual graphs and visual
graphs using semantic dependencies among words
and spatial dependencies among regions of objects,
respectively, to capture composition-level feature
for each modality using graph attention networks
(Veliˇ
ckovi´
c et al.,2018). Our model concatenates
both atomic-level and composition-level congruity
features where semantic mismatches between the
texts and images in different levels are jointly con-
sidered. Specially, we elaborate the terminology
used in our paper again: congruity represents the
semantic consistency between image and text. If
the meaning of the image and text pair is contra-
dictory, this pair will get less congruity. Atomic is
between token and image patch, and compositional
is between a group of tokens (phrase) and a group
of patches (visual object).
Last but not the least, we propose to adopt the
pre-trained transferable foundation models (e.g.,
CLIP (Radford et al.,2021,2019)) to extract text
information from the visual modality as external
knowledge to assist sarcasm detection. The ratio-
nality of applying transferable foundation models
is due to their effectiveness on a comprehensive
set of tasks (e.g., descriptive and objective caption
generation task) based on the zero-shot setting. As
such, the extracted text contains ample informa-
tion of the image which can be used to construct
additional discriminative features for sarcasm de-
tection. Similar to the original textual input, the
generated external knowledge also contains hier-
archical information for sarcasm detection which
can be consistently incorporated into our proposed
framework to compute multi-granularity congruity
against the original text input.
The main contributions of this paper are summa-
rized as follows: 1) To the best of our knowledge,
we are the first to exploit hierarchical semantic in-
teractions between textual and visual modalities
to jointly model the atomic-level and composition-
level congruities for sarcasm detection; 2) We pro-
pose a novel kind of external knowledge for sar-
casm detection by using the pre-trained founda-
tion model to generate image captions which can
be naturally adopted as the input of our proposed
framework; 3) We conduct extensive experiments
on a publicly available multi-modal sarcasm de-
tection benchmark dataset showing the superiority
of our method over state-of-the-art methods with
additional improvement using external knowledge.