Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement Hui Liu1Wenya Wang23Haoliang Li1

2025-05-06 0 0 2.39MB 12 页 10玖币

侵权投诉

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity

Modeling with Knowledge Enhancement

Hui Liu1Wenya Wang2,3 Haoliang Li1

1City University of Hong Kong

2Nanyang Technological University

3University of Washington

liuhui3-c@my.cityu.edu.hk, wangwy@ntu.edu.sg, haoliang.li@cityu.edu.hk

Abstract

Sarcasm is a linguistic phenomenon indicat-

ing a discrepancy between literal meanings

and implied intentions. Due to its sophisti-

cated nature, it is usually challenging to be de-

tected from the text itself. As a result, multi-

modal sarcasm detection has received more at-

tention in both academia and industries. How-

ever, most existing techniques only modeled

the atomic-level inconsistencies between the

text input and its accompanying image, ig-

noring more complex compositions for both

modalities. Moreover, they neglected the rich

information contained in external knowledge,

e.g., image captions. In this paper, we pro-

pose a novel hierarchical framework for sar-

casm detection by exploring both the atomic-

level congruity based on multi-head cross at-

tention mechanism and the composition-level

congruity based on graph neural networks,

where a post with low congruity can be iden-

tiﬁed as sarcasm. In addition, we exploit the

effect of various knowledge resources for sar-

casm detection. Evaluation results on a public

multi-modal sarcasm detection dataset based

on Twitter demonstrate the superiority of our

proposed model.

1 Introduction

Sarcasm refers to satire or ironical statements

where the literal meaning of words is contrary to

the authentic intention of the speaker to insult some-

one or humorously criticize something. Sarcasm

detection has received considerable critical atten-

tion because sarcasm utterances are ubiquitous in

today’s social media platforms like Twitter and

Reddit. However, it is a challenging task to distin-

guish sarcastic posts to date in light of their highly

ﬁgurative nature and intricate linguistic synonymy

(Pan et al.,2020;Tay et al.,2018).

Early sarcasm detection methods mainly relied

on ﬁxed textual patterns, e.g., lexical indicators,

syntactic rules, speciﬁc hashtag labels and emoji

Figure 1: An example of sarcasm along with the corre-

sponding image and different types of external knowl-

edge extracted from the image. The sarcasm sentence

represents the need for some good news. However, the

image of the TV program is switched to bad news de-

picting severe storms (bad weather) which contradicts

the sentence.

occurrences (Davidov et al.,2010;Maynard and

Greenwood,2014;Felbo et al.,2017), which usu-

ally had poor performances and generalization abil-

ities by failing to exploit contextual information.

To resolve this issue, (Tay et al.,2018;Joshi et al.,

2015;Ghosh and Veale,2017;Xiong et al.,2019)

considered sarcasm contexts or the sentiments of

sarcasm makers as useful clues to model congruity

level within texts to gain consistent improvement.

However, purely text-modality-based sarcasm de-

tection methods may fail to discriminate certain

sarcastic utterances as shown in Figure 1. In this

case, it is hard to identify the actual sentiment of the

text in the absence of the image forecasting severe

weather. As text-image pairs are commonly ob-

served in the current social platform, multi-modal

methods become more effective for sarcasm predic-

tion by capturing congruity information between

textual and visual modalities (Pan et al.,2020;Xu

et al.,2020a;Schifanella et al.,2016;Liu et al.,

2021;Liang et al.,2021;Cai et al.,2019).

However, most of the existing multi-modal tech-

niques only considered the congruity level between

each token and image-patch (Xu et al.,2020a;Tay

et al.,2018) and ignored the importance of multi-

granularity (e.g., granularity such as objects, and

relations between objects) alignments, which have

been proved to be effective in other related tasks,

arXiv:2210.03501v2 [cs.CL] 17 Oct 2022

such as cross-modal retrieval (Li et al.,2021b) and

image-sentence matching (Xu et al.,2020b;Liu

et al.,2020). In fact, the hierarchical structures of

both texts and images advocate for composition-

level modeling besides single tokens or image

patches (Socher et al.,2014). By exploring com-

positional semantics for sarcasm detection, it helps

to identify more complex inconsistencies, e.g., in-

consistency between a pair of related entities and a

group of image patches.

Moreover, as ﬁgurativeness and subtlety inherent

in sarcasm utterances may bring a negative impact

to sarcasm detection, some works (Li et al.,2021a;

Veale and Hao,2010) found that the identiﬁcation

of sarcasm also relies on the external knowledge of

the world beyond the input texts and images as new

contextual information. What’s more, it has drawn

increasing research interest in how to incorporate

knowledge to boost many machine learning algo-

rithms such as recommendation system (Sun et al.,

2021) and relation extraction (Sun et al.,2022).

Indeed, several studies extracted image attributes

(Cai et al.,2019) or adjective-noun pairs (ANPs)

(Xu et al.,2020a) from images as visual semantic

information to bridge the gap between texts and

images. However, constrained by limited training

data, such external knowledge may not be sufﬁ-

cient or accurate to represent the images (as shown

in Figure 1) which may bring negative effects for

sarcasm detection. Therefore, how to choose and

leverage external knowledge for sarcasm detection

is also worth being investigated.

To tackle the limitations mentioned above, in

this work, we propose a novel hierarchical frame-

work for sarcasm detection. Speciﬁcally, our pro-

posed method takes both atomic-level congruity

between independent image objects and tokens, as

well as composition-level congruity considering

object relations and semantic dependencies to pro-

mote multi-modal sarcasm identiﬁcation. To obtain

atomic-level congruity, we ﬁrst adopt the multi-

head cross attention mechanism (Vaswani et al.,

2017) to project features from different modalities

into the same space and then compute a similarity

score for each token-object pair via inner products.

Next, we obtain composition-level congruity based

on the output features of both textual modality

and visual modality acquired in the previous step.

Concretely, we construct textual graphs and visual

graphs using semantic dependencies among words

and spatial dependencies among regions of objects,

respectively, to capture composition-level feature

for each modality using graph attention networks

(Veliˇ

ckovi´

c et al.,2018). Our model concatenates

both atomic-level and composition-level congruity

features where semantic mismatches between the

texts and images in different levels are jointly con-

sidered. Specially, we elaborate the terminology

used in our paper again: congruity represents the

semantic consistency between image and text. If

the meaning of the image and text pair is contra-

dictory, this pair will get less congruity. Atomic is

between token and image patch, and compositional

is between a group of tokens (phrase) and a group

of patches (visual object).

Last but not the least, we propose to adopt the

pre-trained transferable foundation models (e.g.,

CLIP (Radford et al.,2021,2019)) to extract text

information from the visual modality as external

knowledge to assist sarcasm detection. The ratio-

nality of applying transferable foundation models

is due to their effectiveness on a comprehensive

set of tasks (e.g., descriptive and objective caption

generation task) based on the zero-shot setting. As

such, the extracted text contains ample informa-

tion of the image which can be used to construct

additional discriminative features for sarcasm de-

tection. Similar to the original textual input, the

generated external knowledge also contains hier-

archical information for sarcasm detection which

can be consistently incorporated into our proposed

framework to compute multi-granularity congruity

against the original text input.

The main contributions of this paper are summa-

rized as follows: 1) To the best of our knowledge,

we are the ﬁrst to exploit hierarchical semantic in-

teractions between textual and visual modalities

to jointly model the atomic-level and composition-

level congruities for sarcasm detection; 2) We pro-

pose a novel kind of external knowledge for sar-

casm detection by using the pre-trained founda-

tion model to generate image captions which can

be naturally adopted as the input of our proposed

framework; 3) We conduct extensive experiments

on a publicly available multi-modal sarcasm de-

tection benchmark dataset showing the superiority

of our method over state-of-the-art methods with

additional improvement using external knowledge.

2 Related Work

2.1 Multi-modality Sarcasm Detection

With the rapid growth of multi-modality posts on

modern social media, detecting sarcasm for text

and image modalities has increased research atten-

tion. Schifanella et al. (2016) ﬁrst deﬁned multi-

modal sarcasm detection task. Cai et al. (2019) cre-

ated a multi-modal sarcasm detection dataset based

on Twitter and proposed a powerful baseline fusing

features extracted from both modalities. Xu et al.

(2020a) modeled both cross-modality contrast and

semantic associations by constructing the Decom-

position and Relation Network to capture common-

alities and discrepancies between images and texts.

Pan et al. (2020) and Liang et al. (2021) modeled

intra-modality and inter-modality incongruities uti-

lizing transformers (Vaswani et al.,2017) and graph

neural networks, respectively. However, these

works neglect the important associations played

by hierarchical or multi-level cross-modality dis-

matches. To address this limitation, we propose to

capture multi-level associations between modali-

ties by cross attentions and graph neural networks

to identify sarcasm in this work.

2.2 Knowledge Enhanced Sarcasm Detection

Li et al. (2021a) and Veale and Hao (2010) pointed

out that commonsense knowledge is crucial for sar-

casm detection. For multi-modal based sarcasm

detection, Cai et al. (2019) proposed to predict ﬁve

attributes for each image based on the pre-trained

ResNet model (He et al.,2016) as the third modal-

ity for sarcasm detection. In a similar fashion, Xu

et al. (2020a) extracted adjective-noun pairs(ANPs)

from every image to reason discrepancies between

texts and ANPs. In addition, as some samples can

contain text information for the images, Pan et al.

(2020) and Liang et al. (2021) proposed to apply the

Optical Character Recognition (OCR) to acquire

texts on the images. More recently, Liang et al.

(2022) proposed to incorporate objection detection

framework and label information of detected visual

objects to mitigate modality gap. However, the

knowledge extracted from these methods is either

not expressive enough to convey the information of

the images or is only restricted to a ﬁxed set, e.g.,

nearly one thousand classes for image attributes

or ANPs. Moreover, it should be noted that not

every sarcasm post has text on images. To this

end, in this paper, we propose to generate a de-

scriptive caption with rich semantic information

for each image based on the pre-trained Clipcap

model (Mokady et al.,2021), which uses the CLIP

(Radford et al.,2021) encoding as a preﬁx to the

caption by employing a simple mapping network

and then ﬁne-tunes GPT-2 (Radford et al.,2019) to

generate the image captions.

3 Methodology

Our proposed framework contains four main com-

ponents: Feature Extraction, Atomic-Level Cross-

Modal Congruity, Composition-Level Cross-Modal

Congruity and Knowledge Enhancement. Given an

input text-image pair, the feature extraction mod-

ule aims to generate text features and image fea-

tures via a pre-trained text encoder and an image

encoder, respectively. These features will then

be fed as input to the atomic-level cross-modal

congruity module to obtain congruity scores via a

multi-head cross attention model (MCA). To pro-

duce composition-level congruity scores, we con-

struct a textual graph and a visual graph and adopt

graph attention networks (GAT) to exploit complex

compositions of different tokens as well as image

objects. The input features to the GAT are taken

from the output of the atomic-level module. Due

to the page limitation, we place our illustration

ﬁgure in Figure 6. Our model is ﬂexible to incor-

porate external knowledge as a "virtual" modality,

which could be used to generate complementary

features analogous to the image modality for con-

gruity score computation.

3.1 Task Deﬁnition & Motivation

Multi-modal sarcasm detection aims to identify

whether a given text associated with an image

has a sarcastic meaning. Formally, given a multi-

modal text-image pair

(XT, XI)

, where

corre-

sponds to a textual tweet and

is the correspond-

ing image, the goal is to produce an output label

y∈ {0,1}

, where

indicates a sarcastic tweet and

otherwise. The goal of our model is to learn a

hierarchical multi-modal sarcasm detection model

(by taking both atomic-level and composition-level

congruity into consideration) based on the input of

textual modality, image modality and the external

knowledge if chosen.

The reason to use composition-level modeling is

to cope with the complex structures inherent in two

modalities. For example, as shown in Figure 2, the

semantic meaning of the sentence depends on com-

posing your life,awesome and pretend to reﬂect a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsMulti-ModalSarcasmDetectionviaHierarchicalCongruityModelingwithKnowledgeEnhancementHuiLiu1WenyaWang2,3HaoliangLi11CityUniversityofHongKong2NanyangTechnologicalUniversity3UniversityofWashingtonliuhui3-c@my.cityu.edu.hk,wangwy@ntu.edu.sg,haoliang.li@cityu.edu.hkAbstractSarcasmisalinguisticpheno...

展开>> 收起<<

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement Hui Liu1Wenya Wang23Haoliang Li1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement Hui Liu1Wenya Wang23Haoliang Li1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: