Hate-CLIPper Multimodal Hateful Meme Classiﬁcation based on Cross-modal Interaction of CLIP Features Gokul Karthik Kumar Karthik Nandakumar

2025-04-27 0 0 3.76MB 13 页 10玖币

侵权投诉

Hate-CLIPper: Multimodal Hateful Meme Classiﬁcation

based on Cross-modal Interaction of CLIP Features

Gokul Karthik Kumar Karthik Nandakumar

Mohamed Bin Zayed University of Artiﬁcial Intelligence (MBZUAI)

Abu Dhabi, UAE

{gokul.kumar, karthik.nandakumar}@mbzuai.ac.ae

Abstract

Hateful memes are a growing menace on so-

cial media. While the image and its corre-

sponding text in a meme are related, they

do not necessarily convey the same mean-

ing when viewed individually. Hence, de-

tecting hateful memes requires careful con-

sideration of both visual and textual informa-

tion. Multimodal pre-training can be beneﬁ-

cial for this task because it effectively captures

the relationship between the image and the

text by representing them in a similar feature

space. Furthermore, it is essential to model

the interactions between the image and text

features through intermediate fusion. Most

existing methods either employ multimodal

pre-training or intermediate fusion, but not

both. In this work, we propose the Hate-

CLIPper architecture, which explicitly mod-

els the cross-modal interactions between the

image and text representations obtained us-

ing Contrastive Language-Image Pre-training

(CLIP) encoders via a feature interaction ma-

trix (FIM). A simple classiﬁer based on the

FIM representation is able to achieve state-

of-the-art performance on the Hateful Memes

Challenge (HMC) dataset with an AUROC of

85.8, which even surpasses the human per-

formance of 82.65. Experiments on other

meme datasets such as Propaganda Memes

and TamilMemes also demonstrate the gen-

eralizability of the proposed approach. Fi-

nally, we analyze the interpretability of the

FIM representation and show that cross-modal

interactions can indeed facilitate the learning

of meaningful concepts. The code for this

work is available at https://github.com/

gokulkarthik/hateclipper.

1 Introduction

Multimodal memes, which can be narrowly de-

ﬁned as images overlaid with text that spread from

person to person, are a popular form of communi-

cation on social media (Kiela et al.,2020). While

Figure 1: Illustrative (not real) examples of multimodal

hateful memes from Kiela et al. (2020). While the

memes on the left column are hateful, the ones in the

middle are non-hateful image confounders, and those

on the right are non-hateful text confounders.

most Internet memes are harmless (and often hu-

morous), some of them can represent hate speech.

Given the scale of the Internet, it is impossible

to manually detect such hateful memes and stop

their spread. However, automated hateful meme

detection is also challenging due to the multimodal

nature of the problem.

Research on automated hateful meme detection

has been recently spurred by the Hateful Memes

Challenge competition (Kiela et al.,2020) held at

NeurIPS 2020 with a focus on identifying multi-

modal hateful memes. The memes in this challenge

were curated in such a way that only a combination

of visual and textual information could succeed.

This was achieved by creating non-hateful “con-

founder” memes by changing only the image or

text in the hateful memes, as shown in Figure 1. In

these examples, an image/text can be harmless or

hateful depending on subtle contextual information

contained in the other modality. Thus, multimodal

(image and text) machine learning (ML) models

are a prerequisite to achieve robust and accurate

detection of such hateful memes.

In a multimodal system, the fusion of different

modalities can occur at various levels. In early fu-

sion schemes (Kiela et al.,2019;Lu et al.,2019;

Li et al.,2019), the raw inputs (e.g., image and

text) are combined and a joint representation of

arXiv:2210.05916v3 [cs.CL] 17 Oct 2022

both modalities is learned. In contrast, late fusion

approaches (Kiela et al.,2020), learn end-to-end

models for each modality and combine their out-

puts. However, both these approaches are not ap-

propriate for hateful memes because the text in a

meme does not play the role of an image caption.

Early fusion schemes are designed for tasks such as

captioning and visual question answering, where

there is a strong underlying assumption that the

associated text describes the contents of the image.

Hateful memes violate this assumption because the

text and image may imply different things. We be-

lieve that this phenomenon makes the early fusion

schemes non-optimal for hateful meme classiﬁca-

tion. In the example shown in the ﬁrst row of Figure

1, the left meme is hateful because of the interac-

tion between the image feature "skunk" and the

text feature "you" in the context of the text feature

"smell". On the other hand, the middle meme is

non-hateful as "skunk" got replaced by "rose" and

the right meme is also non-hateful because "you"

got replaced by "skunk". Thus, the image and text

features are related via common attribute(s). Since

modeling such relationships is easier in the feature

space, an intermediate fusion of image and text

features is more suitable for hateful meme classiﬁ-

cation.

The ability to model relationships in the feature

space also depends on the nature of the extracted

image and text features. Existing intermediate fu-

sion methods such as ConcatBERT (Kiela et al.,

2020) pretrain the image and text encoders indepen-

dently in a unimodal fashion. This could result in

the divergent image and text feature spaces, making

it difﬁcult to learn any relationship between them.

Thus, there is a need to “align” the image and text

features through multimodal pretraining. Moreover,

hateful meme detection requires faithful characteri-

zation of interactions between ﬁne-grained image

and text attributes. Towards achieving this goal, we

make the following contributions in this paper:

•

We propose an architecture called Hate-

CLIPper for multimodal hateful meme classiﬁ-

cation, which relies on an intermediate fusion

of aligned image and text representations ob-

tained using the multimodally pretrained Con-

trastive Language-Image Pretraining (CLIP)

encoders (Radford et al.,2021).

•

We utilize bilinear pooling (outer product) for

the intermediate fusion of the image and text

features in Hate-CLIPper. We refer to this

representation as feature interaction matrix

(FIM) which explicitly models the correla-

tions between the dimensions of the image

and text feature spaces. Due to the expressive-

ness of the FIM representation from the robust

CLIP encoders, we show that a simple classi-

ﬁer with few training epochs is sufﬁcient to

achieve state-of-the-art performance for hate-

ful meme classiﬁcation on three benchmark

datasets without any additional input features

like object bounding boxes, face detection and

text attributes.

•

We demonstrate the interpretability of FIM by

identifying salient locations in the FIM that

trigger the classiﬁcation decision and cluster-

ing the resulting trigger vectors. Results indi-

cate that FIM indeed facilitates the learning

of meaningful concepts.

2 Related Work

The Hateful Memes Challenge (HMC) competi-

tion (Kiela et al.,2020) established a benchmark

dataset for hateful meme detection and evaluated

the performance of humans as well as unimodal

and multimodal ML models. The unimodal mod-

els in the HMC competition include:

Image-Grid

based on ResNet-152 (He et al.,2016) features;

Image-Region

, based on Faster RCNN (Ren et al.,

2017) features; and

Text-BERT

, based on the orig-

inal BERT (Devlin et al.,2018) features. The mul-

timodal models include:

Concat BERT

, which

uses a multilayer perceptron classiﬁer based on

the concatenated ResNet-152 (image) and the orig-

inal BERT (text) features;

MMBT

(Kiela et al.,

2019) models, with Image-Grid and Image-Region

features;

ViLBERT

(Lu et al.,2019); and

Visual

BERT

(Li et al.,2019). A late fusion approach

based on the mean of Image-Region and Text-

BERT output scores was also considered. All the

above models were benchmarked on the “test seen”

split based on the area under the receiver operating

characteristic curve (AUROC) (Bradley,1997) met-

ric. The results indicate a large performance gap

between humans (AUROC of 82.65

) and the best

baseline using Visual BERT (AUROC of 75.44).

The challenge report (Kiela et al.,2021), which

was released after the end of the competition,

https://ai.facebook.com/blog/hateful-memes-challenge-

and-data-set/

Love the way you

smell today

CLIP

Image Encoder

CLIP

Text Encoder

pi1 pi2 ... pin

...

pt1pi1 ... ... pt1pin

... ... ... ...

ptnpi1 ... ... ptnpin

Image Projection

Layers

Text Projection

Layers

pt1pi1

pt1pi2

...

pt1pin

pt2pi1

...

...

ptnpin

Pre Output Layers

Output Layer

Frozen

CLIP

Image Encoder

Trainable Trainable

Input Flatten

Explicit Cross-modal

Interaction

fipi

Feature Interaction Matrix (FIM)

Align-fusion Variant

Cross-fusion Variant

Figure 2: Proposed architecture of Hate-CLIPper for Multimodal Hateful Meme Classiﬁcation.

showed that all the top ﬁve submissions (Zhu,2020;

Muennighoff,2020;Velioglu and Rose,2020;

Lippe et al.,2020;Sandulescu,2020) achieve better

AUROC than the baseline methods. This improve-

ment was achieved primarily through the use of en-

semble models and/or external data and additional

input features. For example, Zhu (2020) used a

diverse ensemble of VL-BERT (Su et al.,2019),

UNITER-ITM (Chen et al.,2019), VILLA-ITM

(Gan et al.,2020) and ERNIE-Vil (Yu et al.,2020)

with additional information about entity, race, and

gender extracted using Cloud APIs and other mod-

els. This method achieved the best AUROC of

84.50 on the “test unseen” split.

Mathias et al. (2021) extended the HMC dataset

with ﬁne-grained labels for protected category and

attack type. Protected category labels include race,

disability, religion, nationality, sex, and empty pro-

tected category. Attack types were labeled as con-

tempt, mocking, inferiority, slur, exclusion, dehu-

manizing, inciting violence, and empty attack. Zia

et al. (2021) used CLIP (Radford et al.,2021) en-

coders to obtain image and text features, which

were simply concatenated and passed to a logistic

regression classiﬁer. Separate classiﬁcation mod-

els were learned for the two multilabel classiﬁca-

tion tasks - protected categories and attack types.

MOMENTA (Pramanick et al.,2021) also uses rep-

resentations generated from CLIP encoders, but

augments them with the additional feature repre-

sentations of objects and faces using VGG-19 (Si-

monyan and Zisserman,2014) and text attributes

using DistilBERT (Sanh et al.,2019). Furthermore,

MOMENTA uses cross-modality attention fusion

(CMAF), which concatenates text and image fea-

tures (weighted by their respective attention scores)

and learns a cross-modal weight matrix to further

modulate the concatenated features. MOMENTA

reports performance only on the HarMeme dataset

(Sandulescu,2020).

Although bilinear pooling (Tenenbaum and Free-

man,2000) (outer product) of different feature

spaces has shown improvements for different mul-

timodal tasks (Fukui et al.,2016;Arevalo et al.,

2017;Kiela et al.,2018), it is not well experi-

mented with multimodally pretrained (aligned fea-

ture space) encoders like CLIP or for the Hateful

Meme Classiﬁcation task.

3 Methodology

Our objective is to develop a simple end-to-end

model for hateful meme classiﬁcation that avoids

the need for sophisticated ensemble approaches and

any external data or labels. We hypothesize that

there is sufﬁciently rich information available in the

CLIP visual and text representations and the miss-

ing link is the failure to model the interactions be-

tween these feature spaces adequately. Hence, we

propose the Hate-CLIPper architecture as shown

in Figure 2. In the proposed Hate-CLIPper archi-

tecture, the image

and text

are passed through

pretrained CLIP image and text encoders (whose

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Hate-CLIPper:MultimodalHatefulMemeClassicationbasedonCross-modalInteractionofCLIPFeaturesGokulKarthikKumarKarthikNandakumarMohamedBinZayedUniversityofArticialIntelligence(MBZUAI)AbuDhabi,UAE{gokul.kumar,karthik.nandakumar}@mbzuai.ac.aeAbstractHatefulmemesareagrowingmenaceonso-cialmedia.Whiletheima...

展开>> 收起<<

Hate-CLIPper Multimodal Hateful Meme Classiﬁcation based on Cross-modal Interaction of CLIP Features Gokul Karthik Kumar Karthik Nandakumar.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hate-CLIPper Multimodal Hateful Meme Classiﬁcation based on Cross-modal Interaction of CLIP Features Gokul Karthik Kumar Karthik Nandakumar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: