Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement Hui Liu1Wenya Wang23Haoliang Li1

2025-05-06 0 0 2.39MB 12 页 10玖币
侵权投诉
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity
Modeling with Knowledge Enhancement
Hui Liu1Wenya Wang2,3 Haoliang Li1
1City University of Hong Kong
2Nanyang Technological University
3University of Washington
liuhui3-c@my.cityu.edu.hk, wangwy@ntu.edu.sg, haoliang.li@cityu.edu.hk
Abstract
Sarcasm is a linguistic phenomenon indicat-
ing a discrepancy between literal meanings
and implied intentions. Due to its sophisti-
cated nature, it is usually challenging to be de-
tected from the text itself. As a result, multi-
modal sarcasm detection has received more at-
tention in both academia and industries. How-
ever, most existing techniques only modeled
the atomic-level inconsistencies between the
text input and its accompanying image, ig-
noring more complex compositions for both
modalities. Moreover, they neglected the rich
information contained in external knowledge,
e.g., image captions. In this paper, we pro-
pose a novel hierarchical framework for sar-
casm detection by exploring both the atomic-
level congruity based on multi-head cross at-
tention mechanism and the composition-level
congruity based on graph neural networks,
where a post with low congruity can be iden-
tified as sarcasm. In addition, we exploit the
effect of various knowledge resources for sar-
casm detection. Evaluation results on a public
multi-modal sarcasm detection dataset based
on Twitter demonstrate the superiority of our
proposed model.
1 Introduction
Sarcasm refers to satire or ironical statements
where the literal meaning of words is contrary to
the authentic intention of the speaker to insult some-
one or humorously criticize something. Sarcasm
detection has received considerable critical atten-
tion because sarcasm utterances are ubiquitous in
today’s social media platforms like Twitter and
Reddit. However, it is a challenging task to distin-
guish sarcastic posts to date in light of their highly
figurative nature and intricate linguistic synonymy
(Pan et al.,2020;Tay et al.,2018).
Early sarcasm detection methods mainly relied
on fixed textual patterns, e.g., lexical indicators,
syntactic rules, specific hashtag labels and emoji
Figure 1: An example of sarcasm along with the corre-
sponding image and different types of external knowl-
edge extracted from the image. The sarcasm sentence
represents the need for some good news. However, the
image of the TV program is switched to bad news de-
picting severe storms (bad weather) which contradicts
the sentence.
occurrences (Davidov et al.,2010;Maynard and
Greenwood,2014;Felbo et al.,2017), which usu-
ally had poor performances and generalization abil-
ities by failing to exploit contextual information.
To resolve this issue, (Tay et al.,2018;Joshi et al.,
2015;Ghosh and Veale,2017;Xiong et al.,2019)
considered sarcasm contexts or the sentiments of
sarcasm makers as useful clues to model congruity
level within texts to gain consistent improvement.
However, purely text-modality-based sarcasm de-
tection methods may fail to discriminate certain
sarcastic utterances as shown in Figure 1. In this
case, it is hard to identify the actual sentiment of the
text in the absence of the image forecasting severe
weather. As text-image pairs are commonly ob-
served in the current social platform, multi-modal
methods become more effective for sarcasm predic-
tion by capturing congruity information between
textual and visual modalities (Pan et al.,2020;Xu
et al.,2020a;Schifanella et al.,2016;Liu et al.,
2021;Liang et al.,2021;Cai et al.,2019).
However, most of the existing multi-modal tech-
niques only considered the congruity level between
each token and image-patch (Xu et al.,2020a;Tay
et al.,2018) and ignored the importance of multi-
granularity (e.g., granularity such as objects, and
relations between objects) alignments, which have
been proved to be effective in other related tasks,
arXiv:2210.03501v2 [cs.CL] 17 Oct 2022
such as cross-modal retrieval (Li et al.,2021b) and
image-sentence matching (Xu et al.,2020b;Liu
et al.,2020). In fact, the hierarchical structures of
both texts and images advocate for composition-
level modeling besides single tokens or image
patches (Socher et al.,2014). By exploring com-
positional semantics for sarcasm detection, it helps
to identify more complex inconsistencies, e.g., in-
consistency between a pair of related entities and a
group of image patches.
Moreover, as figurativeness and subtlety inherent
in sarcasm utterances may bring a negative impact
to sarcasm detection, some works (Li et al.,2021a;
Veale and Hao,2010) found that the identification
of sarcasm also relies on the external knowledge of
the world beyond the input texts and images as new
contextual information. What’s more, it has drawn
increasing research interest in how to incorporate
knowledge to boost many machine learning algo-
rithms such as recommendation system (Sun et al.,
2021) and relation extraction (Sun et al.,2022).
Indeed, several studies extracted image attributes
(Cai et al.,2019) or adjective-noun pairs (ANPs)
(Xu et al.,2020a) from images as visual semantic
information to bridge the gap between texts and
images. However, constrained by limited training
data, such external knowledge may not be suffi-
cient or accurate to represent the images (as shown
in Figure 1) which may bring negative effects for
sarcasm detection. Therefore, how to choose and
leverage external knowledge for sarcasm detection
is also worth being investigated.
To tackle the limitations mentioned above, in
this work, we propose a novel hierarchical frame-
work for sarcasm detection. Specifically, our pro-
posed method takes both atomic-level congruity
between independent image objects and tokens, as
well as composition-level congruity considering
object relations and semantic dependencies to pro-
mote multi-modal sarcasm identification. To obtain
atomic-level congruity, we first adopt the multi-
head cross attention mechanism (Vaswani et al.,
2017) to project features from different modalities
into the same space and then compute a similarity
score for each token-object pair via inner products.
Next, we obtain composition-level congruity based
on the output features of both textual modality
and visual modality acquired in the previous step.
Concretely, we construct textual graphs and visual
graphs using semantic dependencies among words
and spatial dependencies among regions of objects,
respectively, to capture composition-level feature
for each modality using graph attention networks
(Veliˇ
ckovi´
c et al.,2018). Our model concatenates
both atomic-level and composition-level congruity
features where semantic mismatches between the
texts and images in different levels are jointly con-
sidered. Specially, we elaborate the terminology
used in our paper again: congruity represents the
semantic consistency between image and text. If
the meaning of the image and text pair is contra-
dictory, this pair will get less congruity. Atomic is
between token and image patch, and compositional
is between a group of tokens (phrase) and a group
of patches (visual object).
Last but not the least, we propose to adopt the
pre-trained transferable foundation models (e.g.,
CLIP (Radford et al.,2021,2019)) to extract text
information from the visual modality as external
knowledge to assist sarcasm detection. The ratio-
nality of applying transferable foundation models
is due to their effectiveness on a comprehensive
set of tasks (e.g., descriptive and objective caption
generation task) based on the zero-shot setting. As
such, the extracted text contains ample informa-
tion of the image which can be used to construct
additional discriminative features for sarcasm de-
tection. Similar to the original textual input, the
generated external knowledge also contains hier-
archical information for sarcasm detection which
can be consistently incorporated into our proposed
framework to compute multi-granularity congruity
against the original text input.
The main contributions of this paper are summa-
rized as follows: 1) To the best of our knowledge,
we are the first to exploit hierarchical semantic in-
teractions between textual and visual modalities
to jointly model the atomic-level and composition-
level congruities for sarcasm detection; 2) We pro-
pose a novel kind of external knowledge for sar-
casm detection by using the pre-trained founda-
tion model to generate image captions which can
be naturally adopted as the input of our proposed
framework; 3) We conduct extensive experiments
on a publicly available multi-modal sarcasm de-
tection benchmark dataset showing the superiority
of our method over state-of-the-art methods with
additional improvement using external knowledge.
2 Related Work
2.1 Multi-modality Sarcasm Detection
With the rapid growth of multi-modality posts on
modern social media, detecting sarcasm for text
and image modalities has increased research atten-
tion. Schifanella et al. (2016) first defined multi-
modal sarcasm detection task. Cai et al. (2019) cre-
ated a multi-modal sarcasm detection dataset based
on Twitter and proposed a powerful baseline fusing
features extracted from both modalities. Xu et al.
(2020a) modeled both cross-modality contrast and
semantic associations by constructing the Decom-
position and Relation Network to capture common-
alities and discrepancies between images and texts.
Pan et al. (2020) and Liang et al. (2021) modeled
intra-modality and inter-modality incongruities uti-
lizing transformers (Vaswani et al.,2017) and graph
neural networks, respectively. However, these
works neglect the important associations played
by hierarchical or multi-level cross-modality dis-
matches. To address this limitation, we propose to
capture multi-level associations between modali-
ties by cross attentions and graph neural networks
to identify sarcasm in this work.
2.2 Knowledge Enhanced Sarcasm Detection
Li et al. (2021a) and Veale and Hao (2010) pointed
out that commonsense knowledge is crucial for sar-
casm detection. For multi-modal based sarcasm
detection, Cai et al. (2019) proposed to predict five
attributes for each image based on the pre-trained
ResNet model (He et al.,2016) as the third modal-
ity for sarcasm detection. In a similar fashion, Xu
et al. (2020a) extracted adjective-noun pairs(ANPs)
from every image to reason discrepancies between
texts and ANPs. In addition, as some samples can
contain text information for the images, Pan et al.
(2020) and Liang et al. (2021) proposed to apply the
Optical Character Recognition (OCR) to acquire
texts on the images. More recently, Liang et al.
(2022) proposed to incorporate objection detection
framework and label information of detected visual
objects to mitigate modality gap. However, the
knowledge extracted from these methods is either
not expressive enough to convey the information of
the images or is only restricted to a fixed set, e.g.,
nearly one thousand classes for image attributes
or ANPs. Moreover, it should be noted that not
every sarcasm post has text on images. To this
end, in this paper, we propose to generate a de-
scriptive caption with rich semantic information
for each image based on the pre-trained Clipcap
model (Mokady et al.,2021), which uses the CLIP
(Radford et al.,2021) encoding as a prefix to the
caption by employing a simple mapping network
and then fine-tunes GPT-2 (Radford et al.,2019) to
generate the image captions.
3 Methodology
Our proposed framework contains four main com-
ponents: Feature Extraction, Atomic-Level Cross-
Modal Congruity, Composition-Level Cross-Modal
Congruity and Knowledge Enhancement. Given an
input text-image pair, the feature extraction mod-
ule aims to generate text features and image fea-
tures via a pre-trained text encoder and an image
encoder, respectively. These features will then
be fed as input to the atomic-level cross-modal
congruity module to obtain congruity scores via a
multi-head cross attention model (MCA). To pro-
duce composition-level congruity scores, we con-
struct a textual graph and a visual graph and adopt
graph attention networks (GAT) to exploit complex
compositions of different tokens as well as image
objects. The input features to the GAT are taken
from the output of the atomic-level module. Due
to the page limitation, we place our illustration
figure in Figure 6. Our model is flexible to incor-
porate external knowledge as a "virtual" modality,
which could be used to generate complementary
features analogous to the image modality for con-
gruity score computation.
3.1 Task Definition & Motivation
Multi-modal sarcasm detection aims to identify
whether a given text associated with an image
has a sarcastic meaning. Formally, given a multi-
modal text-image pair
(XT, XI)
, where
XT
corre-
sponds to a textual tweet and
XI
is the correspond-
ing image, the goal is to produce an output label
y∈ {0,1}
, where
1
indicates a sarcastic tweet and
0
otherwise. The goal of our model is to learn a
hierarchical multi-modal sarcasm detection model
(by taking both atomic-level and composition-level
congruity into consideration) based on the input of
textual modality, image modality and the external
knowledge if chosen.
The reason to use composition-level modeling is
to cope with the complex structures inherent in two
modalities. For example, as shown in Figure 2, the
semantic meaning of the sentence depends on com-
posing your life,awesome and pretend to reflect a
摘要:

TowardsMulti-ModalSarcasmDetectionviaHierarchicalCongruityModelingwithKnowledgeEnhancementHuiLiu1WenyaWang2,3HaoliangLi11CityUniversityofHongKong2NanyangTechnologicalUniversity3UniversityofWashingtonliuhui3-c@my.cityu.edu.hk,wangwy@ntu.edu.sg,haoliang.li@cityu.edu.hkAbstractSarcasmisalinguisticpheno...

展开>> 收起<<
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement Hui Liu1Wenya Wang23Haoliang Li1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:2.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注