Hate-CLIPper Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features Gokul Karthik Kumar Karthik Nandakumar

2025-04-27 0 0 3.76MB 13 页 10玖币
侵权投诉
Hate-CLIPper: Multimodal Hateful Meme Classification
based on Cross-modal Interaction of CLIP Features
Gokul Karthik Kumar Karthik Nandakumar
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Abu Dhabi, UAE
{gokul.kumar, karthik.nandakumar}@mbzuai.ac.ae
Abstract
Hateful memes are a growing menace on so-
cial media. While the image and its corre-
sponding text in a meme are related, they
do not necessarily convey the same mean-
ing when viewed individually. Hence, de-
tecting hateful memes requires careful con-
sideration of both visual and textual informa-
tion. Multimodal pre-training can be benefi-
cial for this task because it effectively captures
the relationship between the image and the
text by representing them in a similar feature
space. Furthermore, it is essential to model
the interactions between the image and text
features through intermediate fusion. Most
existing methods either employ multimodal
pre-training or intermediate fusion, but not
both. In this work, we propose the Hate-
CLIPper architecture, which explicitly mod-
els the cross-modal interactions between the
image and text representations obtained us-
ing Contrastive Language-Image Pre-training
(CLIP) encoders via a feature interaction ma-
trix (FIM). A simple classifier based on the
FIM representation is able to achieve state-
of-the-art performance on the Hateful Memes
Challenge (HMC) dataset with an AUROC of
85.8, which even surpasses the human per-
formance of 82.65. Experiments on other
meme datasets such as Propaganda Memes
and TamilMemes also demonstrate the gen-
eralizability of the proposed approach. Fi-
nally, we analyze the interpretability of the
FIM representation and show that cross-modal
interactions can indeed facilitate the learning
of meaningful concepts. The code for this
work is available at https://github.com/
gokulkarthik/hateclipper.
1 Introduction
Multimodal memes, which can be narrowly de-
fined as images overlaid with text that spread from
person to person, are a popular form of communi-
cation on social media (Kiela et al.,2020). While
Figure 1: Illustrative (not real) examples of multimodal
hateful memes from Kiela et al. (2020). While the
memes on the left column are hateful, the ones in the
middle are non-hateful image confounders, and those
on the right are non-hateful text confounders.
most Internet memes are harmless (and often hu-
morous), some of them can represent hate speech.
Given the scale of the Internet, it is impossible
to manually detect such hateful memes and stop
their spread. However, automated hateful meme
detection is also challenging due to the multimodal
nature of the problem.
Research on automated hateful meme detection
has been recently spurred by the Hateful Memes
Challenge competition (Kiela et al.,2020) held at
NeurIPS 2020 with a focus on identifying multi-
modal hateful memes. The memes in this challenge
were curated in such a way that only a combination
of visual and textual information could succeed.
This was achieved by creating non-hateful “con-
founder” memes by changing only the image or
text in the hateful memes, as shown in Figure 1. In
these examples, an image/text can be harmless or
hateful depending on subtle contextual information
contained in the other modality. Thus, multimodal
(image and text) machine learning (ML) models
are a prerequisite to achieve robust and accurate
detection of such hateful memes.
In a multimodal system, the fusion of different
modalities can occur at various levels. In early fu-
sion schemes (Kiela et al.,2019;Lu et al.,2019;
Li et al.,2019), the raw inputs (e.g., image and
text) are combined and a joint representation of
arXiv:2210.05916v3 [cs.CL] 17 Oct 2022
both modalities is learned. In contrast, late fusion
approaches (Kiela et al.,2020), learn end-to-end
models for each modality and combine their out-
puts. However, both these approaches are not ap-
propriate for hateful memes because the text in a
meme does not play the role of an image caption.
Early fusion schemes are designed for tasks such as
captioning and visual question answering, where
there is a strong underlying assumption that the
associated text describes the contents of the image.
Hateful memes violate this assumption because the
text and image may imply different things. We be-
lieve that this phenomenon makes the early fusion
schemes non-optimal for hateful meme classifica-
tion. In the example shown in the first row of Figure
1, the left meme is hateful because of the interac-
tion between the image feature "skunk" and the
text feature "you" in the context of the text feature
"smell". On the other hand, the middle meme is
non-hateful as "skunk" got replaced by "rose" and
the right meme is also non-hateful because "you"
got replaced by "skunk". Thus, the image and text
features are related via common attribute(s). Since
modeling such relationships is easier in the feature
space, an intermediate fusion of image and text
features is more suitable for hateful meme classifi-
cation.
The ability to model relationships in the feature
space also depends on the nature of the extracted
image and text features. Existing intermediate fu-
sion methods such as ConcatBERT (Kiela et al.,
2020) pretrain the image and text encoders indepen-
dently in a unimodal fashion. This could result in
the divergent image and text feature spaces, making
it difficult to learn any relationship between them.
Thus, there is a need to “align” the image and text
features through multimodal pretraining. Moreover,
hateful meme detection requires faithful characteri-
zation of interactions between fine-grained image
and text attributes. Towards achieving this goal, we
make the following contributions in this paper:
We propose an architecture called Hate-
CLIPper for multimodal hateful meme classifi-
cation, which relies on an intermediate fusion
of aligned image and text representations ob-
tained using the multimodally pretrained Con-
trastive Language-Image Pretraining (CLIP)
encoders (Radford et al.,2021).
We utilize bilinear pooling (outer product) for
the intermediate fusion of the image and text
features in Hate-CLIPper. We refer to this
representation as feature interaction matrix
(FIM) which explicitly models the correla-
tions between the dimensions of the image
and text feature spaces. Due to the expressive-
ness of the FIM representation from the robust
CLIP encoders, we show that a simple classi-
fier with few training epochs is sufficient to
achieve state-of-the-art performance for hate-
ful meme classification on three benchmark
datasets without any additional input features
like object bounding boxes, face detection and
text attributes.
We demonstrate the interpretability of FIM by
identifying salient locations in the FIM that
trigger the classification decision and cluster-
ing the resulting trigger vectors. Results indi-
cate that FIM indeed facilitates the learning
of meaningful concepts.
2 Related Work
The Hateful Memes Challenge (HMC) competi-
tion (Kiela et al.,2020) established a benchmark
dataset for hateful meme detection and evaluated
the performance of humans as well as unimodal
and multimodal ML models. The unimodal mod-
els in the HMC competition include:
Image-Grid
,
based on ResNet-152 (He et al.,2016) features;
Image-Region
, based on Faster RCNN (Ren et al.,
2017) features; and
Text-BERT
, based on the orig-
inal BERT (Devlin et al.,2018) features. The mul-
timodal models include:
Concat BERT
, which
uses a multilayer perceptron classifier based on
the concatenated ResNet-152 (image) and the orig-
inal BERT (text) features;
MMBT
(Kiela et al.,
2019) models, with Image-Grid and Image-Region
features;
ViLBERT
(Lu et al.,2019); and
Visual
BERT
(Li et al.,2019). A late fusion approach
based on the mean of Image-Region and Text-
BERT output scores was also considered. All the
above models were benchmarked on the “test seen”
split based on the area under the receiver operating
characteristic curve (AUROC) (Bradley,1997) met-
ric. The results indicate a large performance gap
between humans (AUROC of 82.65
1
) and the best
baseline using Visual BERT (AUROC of 75.44).
The challenge report (Kiela et al.,2021), which
was released after the end of the competition,
1
https://ai.facebook.com/blog/hateful-memes-challenge-
and-data-set/
Love the way you
smell today
CLIP
Image Encoder
CLIP
Text Encoder
pi1 pi2 ... pin
p
t1
p
t2
...
p
tn
pt1pi1 ... ... pt1pin
... ... ... ...
... ... ... ...
ptnpi1 ... ... ptnpin
Image Projection
Layers
Text Projection
Layers
pt1pi1
pt1pi2
...
pt1pin
pt2pi1
...
...
ptnpin
Pre Output Layers
Output Layer
Frozen
CLIP
Image Encoder
Trainable Trainable
Input Flatten
Explicit Cross-modal
Interaction
fipi
pt
ft
Feature Interaction Matrix (FIM)
Align-fusion Variant
Cross-fusion Variant
Figure 2: Proposed architecture of Hate-CLIPper for Multimodal Hateful Meme Classification.
showed that all the top five submissions (Zhu,2020;
Muennighoff,2020;Velioglu and Rose,2020;
Lippe et al.,2020;Sandulescu,2020) achieve better
AUROC than the baseline methods. This improve-
ment was achieved primarily through the use of en-
semble models and/or external data and additional
input features. For example, Zhu (2020) used a
diverse ensemble of VL-BERT (Su et al.,2019),
UNITER-ITM (Chen et al.,2019), VILLA-ITM
(Gan et al.,2020) and ERNIE-Vil (Yu et al.,2020)
with additional information about entity, race, and
gender extracted using Cloud APIs and other mod-
els. This method achieved the best AUROC of
84.50 on the “test unseen” split.
Mathias et al. (2021) extended the HMC dataset
with fine-grained labels for protected category and
attack type. Protected category labels include race,
disability, religion, nationality, sex, and empty pro-
tected category. Attack types were labeled as con-
tempt, mocking, inferiority, slur, exclusion, dehu-
manizing, inciting violence, and empty attack. Zia
et al. (2021) used CLIP (Radford et al.,2021) en-
coders to obtain image and text features, which
were simply concatenated and passed to a logistic
regression classifier. Separate classification mod-
els were learned for the two multilabel classifica-
tion tasks - protected categories and attack types.
MOMENTA (Pramanick et al.,2021) also uses rep-
resentations generated from CLIP encoders, but
augments them with the additional feature repre-
sentations of objects and faces using VGG-19 (Si-
monyan and Zisserman,2014) and text attributes
using DistilBERT (Sanh et al.,2019). Furthermore,
MOMENTA uses cross-modality attention fusion
(CMAF), which concatenates text and image fea-
tures (weighted by their respective attention scores)
and learns a cross-modal weight matrix to further
modulate the concatenated features. MOMENTA
reports performance only on the HarMeme dataset
(Sandulescu,2020).
Although bilinear pooling (Tenenbaum and Free-
man,2000) (outer product) of different feature
spaces has shown improvements for different mul-
timodal tasks (Fukui et al.,2016;Arevalo et al.,
2017;Kiela et al.,2018), it is not well experi-
mented with multimodally pretrained (aligned fea-
ture space) encoders like CLIP or for the Hateful
Meme Classification task.
3 Methodology
Our objective is to develop a simple end-to-end
model for hateful meme classification that avoids
the need for sophisticated ensemble approaches and
any external data or labels. We hypothesize that
there is sufficiently rich information available in the
CLIP visual and text representations and the miss-
ing link is the failure to model the interactions be-
tween these feature spaces adequately. Hence, we
propose the Hate-CLIPper architecture as shown
in Figure 2. In the proposed Hate-CLIPper archi-
tecture, the image
i
and text
t
are passed through
pretrained CLIP image and text encoders (whose
摘要:

Hate-CLIPper:MultimodalHatefulMemeClassicationbasedonCross-modalInteractionofCLIPFeaturesGokulKarthikKumarKarthikNandakumarMohamedBinZayedUniversityofArticialIntelligence(MBZUAI)AbuDhabi,UAE{gokul.kumar,karthik.nandakumar}@mbzuai.ac.aeAbstractHatefulmemesareagrowingmenaceonso-cialmedia.Whiletheima...

展开>> 收起<<
Hate-CLIPper Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features Gokul Karthik Kumar Karthik Nandakumar.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:3.76MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注