
guages, it is likely that similarly annotated data
in the source and target language exist for other
tasks. A language model jointly fine-tuned for this
other task in the two languages can learn some pat-
terns and knowledge, bridging the gap between the
languages, and helping the hate speech detection
model to be transferred between them.
In summary, our work focuses on zero-shots
cross-language multitask architectures where an-
notated hate speech data is available only for one
source language, but some annotated data for other
tasks can be accessed in both the source and target
languages. Using a multitask architecture (van der
Goot et al.,2021b) on top of a multilingual model,
we investigate the impact of auxiliary tasks oper-
ating at different sentence linguistics levels (POS
Tagging, Named Entity Recognition (NER), Depen-
dency Parsing and Sentiment analysis) on the trans-
fer effectiveness. Using Nozza (2021)’s original
set of languages and datasets (hate speech against
women and immigrants, from Twitter datasets in
English, Italian and Spanish), our main contribu-
tions are as follows.
•
Building strictly comparable corpora across
languages,
2
leading to a thorough evaluation
framework, we highlight cases where zero-
shot cross-lingual transfer of hate speech de-
tection models fails and diagnose the effect of
the choice of the multilingual language model.
•
We identify auxiliary tasks with a positive
impact on cross-lingual transfer when trained
jointly with hate speech detection: sentiment
analysis and NER. The impact of syntactic
tasks is more mitigated.
•
Using the HateCheck test suite (Röttger et al.,
2021,2022), we identify which hate speech
classes of functionalities suffer the most from
cross-lingual transfer, highlighting the impact
of slurs; and which ones benefit from joint
training with multilingual auxiliary tasks.
2 Related Work
Intermediate task training.
In order to improve
the efficiency of a pre-trained language model for a
given task, this model can undergo preliminary
fine-tuning on an intermediate task before fine-
tuning again on the downstream task. This idea
2
Our comparable datasets are available at
https://gi
thub.com/ArijRB/Multilingual-Auxiliary-T
asks-Training-Bridging-the-Gap-between-L
anguages-for-Zero-Shot-Transfer-of-/.
was formalized as Supplementary Training on In-
termediate Labeled-data Tasks (STILT) by Phang
et al. (2018), who perform sequential task-to-task
pre-training. More recently, Pruksachatkun et al.
(2020) perform a survey of intermediate and target
task pairs to analyze the usefulness of this inter-
mediary fine-tuning, but only in a monolingual set-
ting. Phang et al. (2020) turn towards cross-lingual
STILT. They fine-tune a language model on nine
intermediate language-understanding tasks in En-
glish and apply it to a set of non-English target
tasks. They show that machine-translating interme-
diate task data for training or using a multilingual
language model does not improve the transfer com-
pared to English training data. However, to the best
of our knowledge, using intermediate task training
data on both the source and the target language for
transfer has not been tested in the literature.
Auxiliary tasks for hate speech detection.
Auxiliary task training for hate speech detection
has been done almost exclusively with the senti-
ment analysis task (Bauwelinck, Nina and Lefever,
Els,2019;Aroyehun and Gelbukh,2021), and only
in monolingual scenarios. But additional informa-
tion is sometimes added to the hate speech clas-
sifier differently. Gambino and Pirrone (2020),
among the best systems on the HaSpeeDe task of
EVALITA 2020, use POS-tagged text as input of
the classification systems, which is highly bene-
ficial for Spanish and a bit less for German and
English. Furthermore, the effect of syntactic infor-
mation is also investigated by Narang and Brew
(2020), using classifiers based on the syntactic
structure of the text for abusive language detection.
Markov et al. (2021) evaluate the impact of manu-
ally extracted POS, stylometric and emotion-based
features on hate speech detection, showing that
the latter two are robust features for hate speech
detection across languages.
Zero-shot cross-lingual transfer for hate speech
detection
Due to the lack of annotated data on
many languages and domains for hate speech detec-
tion, zero-shot cross-lingual transfer has been tack-
led a lot in the literature. Among the most recent
work, Pelicon et al. (2021) investigates the impact
of a preliminary training of a classification model
on hate speech data languages different from the
target language; they show that language models
pre-trained on a small number of languages benefit
more of this intermediate training, and often out-