
the overall gains in accuracy largely stem from in-
creased performance on previously identified weak-
nesses, demonstrating that the strategies work as
intended.
Overall, our primary contributions are the fol-
lowing:
C1
An error analysis of vanilla NLI-based zero-
shot hate speech detection.
C2
Developing strategies that combine multiple
hypotheses to improve zero-shot hate speech
detection.
C3
An evaluation and error analysis of the pro-
posed strategies.
2 Background and Related Work
Early approaches to hate speech detection have
focused on English social media posts, especially
Twitter, and treated the task as binary or ternary text
classification (Waseem and Hovy,2016;Davidson
et al.,2017;Founta et al.,2018). In more recent
work, additional labels have been introduced that
indicate whether the post is group-directed or not,
who the targeted group is, if the post calls for vi-
olence, is aggressive, contains stereotypes, if the
hate is expressed implicitly, or if sarcasm or irony
is used (Mandl et al.,2019,2020;Sap et al.,2020;
ElSherief et al.,2021;Röttger et al.,2021;Mollas
et al.,2022). Sometimes hate speech is not directly
annotated but instead labels, such as racism,sex-
ism,homophobia that already combine hostility
with a specific target are annotated and predicted
(Waseem and Hovy,2016;Waseem,2016;Saha
et al.,2018;Lavergne et al.,2020).
While early approaches relied on manual fea-
ture engineering (Waseem and Hovy,2016),
most current approaches are based on pre-trained
transformer-based language models that are then
fine-tuned on hate speech datasets (Florio et al.,
2020;Uzan and HaCohen-Kerner,2021;Banerjee
et al.,2021;Lavergne et al.,2020;Das et al.,2021;
Nghiem and Morstatter,2021).
Some work has focused on reducing the need
for labeled data by multi-task learning on differ-
ent sets of hate speech labels (Kapil and Ekbal,
2020;Safi Samghabadi et al.,2020) or adding senti-
ment analysis as an auxiliary task (Plaza-Del-Arco
et al.,2021). Others have worked on reducing
the need for non-English annotations by adapting
hate speech detection models from high- to low-
resource languages in a cross-lingual zero-shot set-
name # examples classes
HateCheck 3,728 hateful (68.8%),
non-hate (31.2%)
ETHOS (binary) 997 hate speech (64.1%),
not-hate speech (25.9%)
Table 1: The number of examples and the class balance
of the datasets.
ting (Stappen et al.,2020;Pamungkas et al.,2021).
However the approach has been criticized for being
unreliable when encountering language-specific
taboo interjections (Nozza,2021).
2.1 Zero-Shot Text Classification
The advent of large language models has en-
abled zero-shot and few-shot text classification ap-
proaches such as prompting (Liu et al.,2021), and
task descriptions (Raffel et al.,2020), which con-
vert the target task to the pre-training objective and
are usually only used in combination with large
language models. Chiu and Alexander (2021) use
the prompts “Is this text racist?” and “Is this text
sexist?” to detect hate speech with GPT-3. Schick
et al. (2021) show that toxicity in large generative
language models can be avoided by using similar
prompts to self-diagnose toxicity during the decod-
ing.
In contrast, NLI-based prediction in which a tar-
get task is converted to an NLI-task and fed into
an NLI model converts the target task to the fine-
tuning task. Here, a model is given a premise and
a hypothesis and tasked to predict if the premise
entails the hypothesis, contradicts it, or is neutral
towards it. Yin et al. (2019) proposed to use an
NLI model for zero-shot topic classification, by
inputting the text to classify as the premise and
constructing for each topic a hypothesis of the form
“This text is about
<
topic
>
”. They map the labels
neutral and contradiction to not-entailment. We
can then interpret a prediction of entailment as pre-
dicting that the input text belongs to the topic in
the given hypothesis. Conversely, not-entailment
implies that the text is not about the topic. Wang
et al. (2021) show for a range of tasks, including
offensive language identification, that this task re-
formulation also benefits few-shot learning scenar-
ios. Recently, AlKhamissi et al. (2022) obtained
large performance improvements in few-shot learn-
ing for hate speech detection by (1) decomposing
the task into four subtasks and (2) additionally train-
ing the few-shot model on a knowledge base.