
tion (Caselli et al.,2021).
The rest of this paper is structured as follows:
Section 2explains the methods used in this study.
The results, critical analysis with XAI and dis-
cussion are in Section 3. Section 4provides an
overview of HS and prior work in the field. Sec-
tion 5gives the conclusion and possible future
work.
2 Methodology
All the experiments were conducted on a shared
DGX-1 machine with 8
×
32GB Nvidia V100
GPUs. The operating system (OS) of the server
is Ubuntu 18 and it has 80 CPU cores. Each exper-
iment is conducted 3 times and the average results
computed. Six is the total number of epochs for
each experiment and the model checkpoint with
the lowest validation loss is saved and used for
evaluation of the test set, where available. Linear
schedule with warm up is used for the learning rate
(LR) adjustment for T5 and RoBERTa. Only lim-
ited hyperparameters are explored, through manual
tuning, for all the models due to time and resource
constraints.
Short details about all the models used are dis-
cussed in appendix B. Appendix Cgives more in-
formation on the data preprocessing, metrics for
evaluation, the ensemble, and cross-task training.
Average time per epoch for training and evaluation
on the validation set is 83.52, 7.82 & 22.29 sec-
onds for the OLID,HASOC 2020 & HASOC 2021
datasets, respectively.3
2.1 Solving OOC Predictions in T5
Raffel et al. (2020) introduced T5 and noted the
possibility of OOC predictions in the model. This
is when the model predicts text (or empty string)
seen during training but is not among the class la-
bels. This issue appears more common in the initial
epochs of training and may not even occur some-
times. We experienced this challenge in the two
libraries we attempted to develop with.
4
In order to
solve this, first we introduced integers (explicitly
type-cast as string) for class labels, which appear to
make the model predictions more stable. The issue
reduced by about 50% in pilot studies, when they
occur. For example, for the HASOC datasets, we
substituted "1" and "0" for the labels "NOT" and
3
Restrictions (cpulimit) were implemented to avoid server
overloading, in fairness to other users. Hence, average time
for the test sets ranges from 2 to over 24 hours.
4HuggingFace & Simple Transfromers
"HOF", respectively. As a second step, a simple
correction we introduced is to replace the OOC
prediction (if it occurs) with the label of the largest
class in the training set.
2.2 Data Augmentation
The objective of data augmentation is to increase
the number of training data samples in order to
improve performance of models on the evaluation
set (Feng et al.,2021). We experimented with 2
techniques: 1) word-level deletion of the start and
end words per sample and 2) conversational AI text
generation (Table 2). Our work may be the first
to use conversational AI for data augmentation.
It doubles the amount of samples and provides
diversity. The average new words generated per
sample prompt is around 16 words. More details
about the 2 techniques are found in appendix C.3.
Type Sample
original Son of a *** wrong "you’re"
augmented
son of a *** wrong youre No, that’s Saint Johns
Chop House. I need a taxi to take me from the
hotel to the restaurant, leaving the first at 5:45.
original
SO EXCITED TO GET MY CovidVaccine I hate
you covid!
augmented so excited to get my covidvaccine i hate you covid
You should probably get that checked out by a
gastroenterology department.
original
ModiKaVaccineJumla Who is responsible for oxy-
gen? ModiResign Do you agree with me? â
¤
ï
¸
Don’t you agree with me?
augmented
modikavaccinejumla who is responsible for oxy-
gen modiresign do you agree with me âï dont you
agree with me Yes, I definitely do not want to
work with them again. I appreciate your help..
Table 2: Original and conversational AI-augmented ex-
amples from the HASOC 2021 dataset.
(offensive words
masked with "***")
3 Results and Discussion
Tables 3,4and 5(Appendix E) show baseline re-
sults, additional results using the best model: T5,
and the cross-task with T5, respectively. Table 6
(Appendix E), shows results for other datasets and
the HateBERT model (Caselli et al.,2021). The
HatEval task is the only comparable one in our
work and that by Caselli et al. (2021).
The Baselines
: The Transformer-based models
(T5 and RoBERTa) generally perform better than
the other baselines (LSTM and CNN) (Zampieri
et al.,2019b), except for RoBERTa on the OLID
subtask B and HASOC 2021 subtask A. The T5
outperforms RoBERTa on all tasks. Based on the
test set results, the LSTM obtains better results than