
then sample from non-hateful entries, so that the
proportion of hate in the two datasets increases to
50.0% and 22.0%, respectively.
2.1.2 Fine-Tuning 2: Target Language
For fine-tuning in the target language, we use one
of five datasets in five different target languages.
BAS19_EScompiled by Basile et al. (2019) for
SemEval 2019 contains 4,950 Spanish tweets, of
which 41.5% are labelled as hateful. FOR19_PT
by Fortuna et al. (2019) contains 5,670 Portuguese
tweets, of which 31.5% are labelled as hateful.
HAS21_HIcompiled by Modha et al. (2021) for
HASOC 2021 contains 4,594 Hindi tweets, of
which 12.3% are labelled as hateful. OUS19_AR
by Ousidhoum et al. (2019) contains 3,353 Ara-
bic tweets, of which 22.5% are labelled as hateful.
SAN20_ITcompiled by Sanguinetti et al. (2020)
for EvalIta 2020 contains 8,100 Italian tweets, of
which 41.8% are labelled as hateful.
From each of these five target-language datasets,
we randomly sample differently-sized subsets for
target-language fine-tuning. Like in English, we
set aside 500 entries for development and 2,000
for testing.
2
From the remaining data, we sample
subsets in 12 different sizes – 10, 20, 30, 40, 50,
100, 200, 300, 400, 500, 1,000 and 2,000 entries –
so that we are able to evaluate the effects of using
more or less labelled data within and across differ-
ent orders of magnitude.
3
Zhao et al. (2021) show
that there can be large sampling effects when fine-
tuning on small amounts of data. To mitigate this
issue, we use 10 different random seeds for each
sample size, so that in total we have 120 different
samples in each language, and 600 samples across
the five non-English languages.
2.2 Models
Multilingual Models
We fine-tune and evaluate
XLM-T (Barbieri et al.,2022), an XLM-R model
(Conneau et al.,2020) pre-trained on an additional
198 million Twitter posts in over 30 languages.
XLM-R is a widely-used architecture for multilin-
gual language modelling, which has been shown to
achieve near state-of-the-art performance on multi-
lingual hate speech detection (Banerjee et al.,2021;
Modha et al.,2021). We chose XLM-T because
it strongly outperformed XLM-R across our target
language test sets in initial experiments.
2
Due to limited dataset size, we only set aside 300 dev and
1,000 test entries for OUS19_AR(n=3,353).
3There is at least one hateful entry in every sample.
Monolingual Models
For each of the five target
languages, we also fine-tune and evaluate a mono-
lingual transformer model from HuggingFace. For
Spanish, we use RoBERTuito (Pérez et al.,2021).
For Portuguese, we use BERTimbau (Souza et al.,
2020). For Hindi, we use Hindi BERT. For Arabic,
we use AraBERT v2 (Antoun et al.,2020). For Ital-
ian, we use UmBERTo. Details on model training
can be found in Appendix A.
Model Notation
We denote all models by an ad-
ditive code. The first part is either M for a monolin-
gual model or X for XLM-T. For XLM-T, the sec-
ond part of the code is DEN, FENor KEN, for mod-
els fine-tuned on 20,000 entries from DYN21_EN,
FOU18_ENor KEN20_EN. For all models, the
final part of the code is ES, PT, HI, ARor IT, cor-
responding to the target language that the model
was finetuned on. For example, M+ITdenotes the
monolingual Italian model, UmBERTo, fine-tuned
on SAN20_IT, and X+KEN+ARdenotes an XLM-
T model fine-tuned first on 20,000 English entries
from KEN20_ENand then on OUS19_AR.
2.3 Evaluation Setup
Held-Out Test Sets + MHC
We test all mod-
els on the held-out test sets corresponding to
their target-language fine-tuning data, to evaluate
their in-domain performance (§2.4). For exam-
ple, we test X+KEN+IT, which was fine-tuned on
SAN20_ITdata, on the SAN20_ITtest set. Addi-
tionally, we test all models on the matching target-
language test suite from Multilingual HateCheck
(MHC). MHC is a collection of around 3,000 test
cases for different kinds of hate as well as chal-
lenging non-hate in each of ten different languages
(Röttger et al.,2022). We use MHC to evaluate
out-of-domain generalisability (§2.5).
Evaluation Metrics
We use macro F1 to evalu-
ate model performance because most of our test
sets as well as MHC are imbalanced. To give con-
text for interpreting performance, we show baseline
model results in all figures: macro F1 for always
predicting the hateful class ("always hate"), for
never predicting the hateful class ("never hate")
and for predicting both classes with equal proba-
bility ("50/50"). We also show bootstrapped 95%
confidence intervals around the average macro F1,
which is calculated across the 10 random seeds for
each sample size. These confidence intervals are
expected to be wider for models fine-tuned on less
data because of larger sampling effects.