
both modalities is learned. In contrast, late fusion
approaches (Kiela et al.,2020), learn end-to-end
models for each modality and combine their out-
puts. However, both these approaches are not ap-
propriate for hateful memes because the text in a
meme does not play the role of an image caption.
Early fusion schemes are designed for tasks such as
captioning and visual question answering, where
there is a strong underlying assumption that the
associated text describes the contents of the image.
Hateful memes violate this assumption because the
text and image may imply different things. We be-
lieve that this phenomenon makes the early fusion
schemes non-optimal for hateful meme classifica-
tion. In the example shown in the first row of Figure
1, the left meme is hateful because of the interac-
tion between the image feature "skunk" and the
text feature "you" in the context of the text feature
"smell". On the other hand, the middle meme is
non-hateful as "skunk" got replaced by "rose" and
the right meme is also non-hateful because "you"
got replaced by "skunk". Thus, the image and text
features are related via common attribute(s). Since
modeling such relationships is easier in the feature
space, an intermediate fusion of image and text
features is more suitable for hateful meme classifi-
cation.
The ability to model relationships in the feature
space also depends on the nature of the extracted
image and text features. Existing intermediate fu-
sion methods such as ConcatBERT (Kiela et al.,
2020) pretrain the image and text encoders indepen-
dently in a unimodal fashion. This could result in
the divergent image and text feature spaces, making
it difficult to learn any relationship between them.
Thus, there is a need to “align” the image and text
features through multimodal pretraining. Moreover,
hateful meme detection requires faithful characteri-
zation of interactions between fine-grained image
and text attributes. Towards achieving this goal, we
make the following contributions in this paper:
•
We propose an architecture called Hate-
CLIPper for multimodal hateful meme classifi-
cation, which relies on an intermediate fusion
of aligned image and text representations ob-
tained using the multimodally pretrained Con-
trastive Language-Image Pretraining (CLIP)
encoders (Radford et al.,2021).
•
We utilize bilinear pooling (outer product) for
the intermediate fusion of the image and text
features in Hate-CLIPper. We refer to this
representation as feature interaction matrix
(FIM) which explicitly models the correla-
tions between the dimensions of the image
and text feature spaces. Due to the expressive-
ness of the FIM representation from the robust
CLIP encoders, we show that a simple classi-
fier with few training epochs is sufficient to
achieve state-of-the-art performance for hate-
ful meme classification on three benchmark
datasets without any additional input features
like object bounding boxes, face detection and
text attributes.
•
We demonstrate the interpretability of FIM by
identifying salient locations in the FIM that
trigger the classification decision and cluster-
ing the resulting trigger vectors. Results indi-
cate that FIM indeed facilitates the learning
of meaningful concepts.
2 Related Work
The Hateful Memes Challenge (HMC) competi-
tion (Kiela et al.,2020) established a benchmark
dataset for hateful meme detection and evaluated
the performance of humans as well as unimodal
and multimodal ML models. The unimodal mod-
els in the HMC competition include:
Image-Grid
,
based on ResNet-152 (He et al.,2016) features;
Image-Region
, based on Faster RCNN (Ren et al.,
2017) features; and
Text-BERT
, based on the orig-
inal BERT (Devlin et al.,2018) features. The mul-
timodal models include:
Concat BERT
, which
uses a multilayer perceptron classifier based on
the concatenated ResNet-152 (image) and the orig-
inal BERT (text) features;
MMBT
(Kiela et al.,
2019) models, with Image-Grid and Image-Region
features;
ViLBERT
(Lu et al.,2019); and
Visual
BERT
(Li et al.,2019). A late fusion approach
based on the mean of Image-Region and Text-
BERT output scores was also considered. All the
above models were benchmarked on the “test seen”
split based on the area under the receiver operating
characteristic curve (AUROC) (Bradley,1997) met-
ric. The results indicate a large performance gap
between humans (AUROC of 82.65
1
) and the best
baseline using Visual BERT (AUROC of 75.44).
The challenge report (Kiela et al.,2021), which
was released after the end of the competition,
1
https://ai.facebook.com/blog/hateful-memes-challenge-
and-data-set/