T5 for Hate Speech Augmented Data and Ensemble Tosin Adewumi Sana Sabah Sabry Nosheen Abid Foteini Liwicki Marcus Liwicki ML Group EISLAB Luleå University of Technology Sweden

2025-04-24 0 0 2.99MB 15 页 10玖币
侵权投诉
T5 for Hate Speech, Augmented Data and Ensemble
Tosin Adewumi, Sana Sabah Sabry, Nosheen Abid, Foteini Liwicki & Marcus Liwicki
ML Group, EISLAB, Luleå University of Technology, Sweden
firstname.lastname@ltu.se
Abstract
We conduct relatively extensive investigations
of automatic hate speech (HS) detection us-
ing different state-of-the-art (SoTA) baselines
over 11 subtasks of 6 different datasets. Our
motivation is to determine which of the re-
cent SoTA models is best for automatic hate
speech detection and what advantage meth-
ods like data augmentation and ensemble may
have on the best model, if any. We carry
out 6 cross-task investigations. We achieve
new SoTA on two subtasks - macro F1 scores
of 91.73% and 53.21% for subtasks A and
B of the HASOC 2020 dataset, where pre-
vious SoTA are 51.52% and 26.52%, respec-
tively. We achieve near-SoTA on two others
- macro F1 scores of 81.66% for subtask A of
the OLID 2019 dataset and 82.54% for subtask
A of the HASOC 2021 dataset, where SoTA
are 82.9% and 83.05%, respectively. We per-
form error analysis and use two explainable ar-
tificial intelligence (XAI) algorithms (IG and
SHAP) to reveal how two of the models (Bi-
LSTM and T5) make the predictions they do
by using examples. Other contributions of
this work are 1) the introduction of a simple,
novel mechanism for correcting out-of-class
(OOC) predictions in T5, 2) a detailed descrip-
tion of the data augmentation methods, 3) the
revelation of the poor data annotations in the
HASOC 2021 dataset by using several exam-
ples and XAI (buttressing the need for bet-
ter quality control), and 4) the public release
of our model checkpoints and codes to foster
transparency.1
1 Introduction
Any disparaging remark targeted at an individual
or group of persons is usually considered as hate
speech (HS) (Nockleby,2000;Brown,2017). It
is considered unethical in many countries and il-
legal in some (Brown,2017;Quintel and Ullrich,
2020;Fortuna and Nunes,2018). Manual detection
1available after anonymity period
of HS content is a tedious task that can result in
delays in stopping harmful behaviour.
2
Automatic
hate speech detection is, therefore, crucial and has
been gaining increasing importance because of the
rising influence of social media in many societies.
It will facilitate the elimination/prevention of un-
desirable characteristics in data, and by extension
AI technologies, such as conversational systems
(Zhang et al.,2020;Adewumi et al.,2021). HS
examples that may incite others to violence in the
offensive language identification dataset (OLID)
data (Zampieri et al.,2019a) are given in Table 1.
id tweet
23352 @USER Antifa simply wants us to k*ll them. By the
way. Most of us carry a back up. And a knife
61110
@USER @USER Her life is crappy because she
is crappy. And she’s threatening to k*ll everyone.
Another nut job...Listen up FBI!
68130
@USER @USER @USER @USER @USER Yes
usually in THOSE countries people k*ll gays cuz
religion advise them to do it and try to point this
out and antifa will beat you. No matter how u try in
america to help gay in those countries it will have no
effect cuz those ppl hate america.
Table 1: Inciteful examples from the OLID 2019 train-
ing set. (parts of offensive words masked with "*")
Short details of the datasets in this work are pro-
vided in appendix A. The datasets were selected
based on the important subtasks covered with re-
gards to HS or abusive language. The architectures
employed include the Bi- Directional Long Short
Term Memory Network (Bi-LSTM), the Convolu-
tional Neural Network (CNN), Robustly optimized
BERT approach (RoBERTa)-Base, Text-to-Text-
Transfer Transformer (T5)-Base, where the last
two are pretrained models from the HuggingFace
hub. As the best-performing baseline model, T5-
Base is then used on the augmentated data for the
HASOC 2021 subtasks A & B and for an ensemble.
In addition, we compare result from HateBERT, a
re-trained BERT model for abusive language detec-
2bbc.com/news/world-europe-35105003
arXiv:2210.05480v1 [cs.CL] 11 Oct 2022
tion (Caselli et al.,2021).
The rest of this paper is structured as follows:
Section 2explains the methods used in this study.
The results, critical analysis with XAI and dis-
cussion are in Section 3. Section 4provides an
overview of HS and prior work in the field. Sec-
tion 5gives the conclusion and possible future
work.
2 Methodology
All the experiments were conducted on a shared
DGX-1 machine with 8
×
32GB Nvidia V100
GPUs. The operating system (OS) of the server
is Ubuntu 18 and it has 80 CPU cores. Each exper-
iment is conducted 3 times and the average results
computed. Six is the total number of epochs for
each experiment and the model checkpoint with
the lowest validation loss is saved and used for
evaluation of the test set, where available. Linear
schedule with warm up is used for the learning rate
(LR) adjustment for T5 and RoBERTa. Only lim-
ited hyperparameters are explored, through manual
tuning, for all the models due to time and resource
constraints.
Short details about all the models used are dis-
cussed in appendix B. Appendix Cgives more in-
formation on the data preprocessing, metrics for
evaluation, the ensemble, and cross-task training.
Average time per epoch for training and evaluation
on the validation set is 83.52, 7.82 & 22.29 sec-
onds for the OLID,HASOC 2020 & HASOC 2021
datasets, respectively.3
2.1 Solving OOC Predictions in T5
Raffel et al. (2020) introduced T5 and noted the
possibility of OOC predictions in the model. This
is when the model predicts text (or empty string)
seen during training but is not among the class la-
bels. This issue appears more common in the initial
epochs of training and may not even occur some-
times. We experienced this challenge in the two
libraries we attempted to develop with.
4
In order to
solve this, first we introduced integers (explicitly
type-cast as string) for class labels, which appear to
make the model predictions more stable. The issue
reduced by about 50% in pilot studies, when they
occur. For example, for the HASOC datasets, we
substituted "1" and "0" for the labels "NOT" and
3
Restrictions (cpulimit) were implemented to avoid server
overloading, in fairness to other users. Hence, average time
for the test sets ranges from 2 to over 24 hours.
4HuggingFace & Simple Transfromers
"HOF", respectively. As a second step, a simple
correction we introduced is to replace the OOC
prediction (if it occurs) with the label of the largest
class in the training set.
2.2 Data Augmentation
The objective of data augmentation is to increase
the number of training data samples in order to
improve performance of models on the evaluation
set (Feng et al.,2021). We experimented with 2
techniques: 1) word-level deletion of the start and
end words per sample and 2) conversational AI text
generation (Table 2). Our work may be the first
to use conversational AI for data augmentation.
It doubles the amount of samples and provides
diversity. The average new words generated per
sample prompt is around 16 words. More details
about the 2 techniques are found in appendix C.3.
Type Sample
original Son of a *** wrong "you’re"
augmented
son of a *** wrong youre No, that’s Saint Johns
Chop House. I need a taxi to take me from the
hotel to the restaurant, leaving the first at 5:45.
original
SO EXCITED TO GET MY CovidVaccine I hate
you covid!
augmented so excited to get my covidvaccine i hate you covid
You should probably get that checked out by a
gastroenterology department.
original
ModiKaVaccineJumla Who is responsible for oxy-
gen? ModiResign Do you agree with me? â
¤
ï
¸
Don’t you agree with me?
augmented
modikavaccinejumla who is responsible for oxy-
gen modiresign do you agree with me âï dont you
agree with me Yes, I definitely do not want to
work with them again. I appreciate your help..
Table 2: Original and conversational AI-augmented ex-
amples from the HASOC 2021 dataset.
(offensive words
masked with "***")
3 Results and Discussion
Tables 3,4and 5(Appendix E) show baseline re-
sults, additional results using the best model: T5,
and the cross-task with T5, respectively. Table 6
(Appendix E), shows results for other datasets and
the HateBERT model (Caselli et al.,2021). The
HatEval task is the only comparable one in our
work and that by Caselli et al. (2021).
The Baselines
: The Transformer-based models
(T5 and RoBERTa) generally perform better than
the other baselines (LSTM and CNN) (Zampieri
et al.,2019b), except for RoBERTa on the OLID
subtask B and HASOC 2021 subtask A. The T5
outperforms RoBERTa on all tasks. Based on the
test set results, the LSTM obtains better results than
Task Weighted F1 (%) Macro F1(%)
Dev (sd) Test (sd) Dev (sd) Test (sd)
Bi-LSTM
OLID A 79.59 (0.89) 83.89 (0.57) 78.48 (1.52) 79.49 (0)
OLID B 82.50 (1.70) 83.46 (0) 46.76 (0) 47.32 (0)
OLID C 49.75 (3.95) 43.82 (9.63) 35.65 (2.81) 36.82 (0)
Hasoc 2021 A 78.05 (0.85) 78.43 (0.84) 77.99 (1.79) 77.19 (0)
Hasoc 2021 B 50.65 (1.34) 52.19 (1.95) 43.19 (2.09) 42.25 (0)
CNN
OLID A 79.10 (0.26) 82.47 (0.56) 77.61 (0.39) 78.46 (0)
OLID B 82.43 (0.49) 83.46 (0) 46.76 (0) 47.88 (0)
OLID C 47.54 (1.36) 38.09 (3.91) 35.65 (0) 36.85 (0)
Hasoc 2021 A 77.22 (0.52) 77.63 (0.70) 74.28 (0.58) 75.67 (0)
Hasoc 2021 B 55.60 (0.61) 59.84 (0.41) 50.41 (0.41) 54.99 (0)
RoBERTa
OLID A 82.70 (0.55) 84.62 (0) 80.51 (0.76) 80.34 (0)
OLID B 82.70 (0.13) 83.46 (0) 46.76 (0.04) 47.02 (0)
Hasoc 2021 A 79.9 (0.57) 76.2 (0) 77.77 (0.75) 74 (0)
T5-Base
OLID A 92.90 (1.37) 85.57 (0) 92.93 (1.42) 81.66 (0)
OLID B 99.75 (0.43) 86.81 (0) 99.77 (0.44) 53.78 (0)
OLID C 58.35 (1.22) 54.99 (0) 33.09 (0.76) 43.12 (0)
Hasoc 2021 A 94.60 (1.98) 82.3 (0) 94.73 (5.26) 80.81 (0)
Hasoc 2021 B 65.40 (0.82) 62.74 (0) 62.43 (6.32) 59.21 (0)
(Zampieri et al.,2019b)best scores
OLID A 82.90 (-)
OLID B 75.50 (-)
OLID C 66 (-)
Table 3: Mean scores of model baselines for different subtasks.
(sd: standard deviation; bold values are best scores for a
given task; ’-’ implies no informaton available)
Task Weighted F1 (%) Macro F1(%)
Dev (sd) Test (sd) Dev (sd) Test (sd)
T5-Base
Hasoc 2020 A 96.77 (0.54) 91.12 (0.2) 96.76 (0.54) 91.12 (0.2)
Hasoc 2020 B 83.36 (1.59) 79.08 (1.15) 56.38 (5.09) 53.21 (2.87)
T5-Base+Augmented Data
Hasoc 2021 A 95.5 (3.27) 83 (0) 92.97 (2.20) 82.54 (0)
Hasoc 2021 B 64.74 (3.84) 66.85 (0) 65.56 (1.48) 62.71 (0)
Ensemble
Hasoc 2021 A 80.78 (0) 79.05 (0)
(Mandl et al.,2020)best scores
Hasoc 2020 A 51.52 (-)
Hasoc 2020 B 26.52 (-)
(Mandl et al.,2021)best scores
Hasoc 2021 A 83.05 (-)
Hasoc 2021 B 66.57 (-)
Table 4: T5 variants’ mean scores over HASOC data.
(sd: standard deviation; bold values are best scores for a given task;
’-’ implies no informaton available)
the CNN in the OLID subtasks A, HASOC 2020
subtask A, and HASOC 2021 subtask A while the
CNN does better than it on the others.
T5 Variants & Augmentation
: The T5-Base
model achieves new best scores on the HASOC
2020 subtasks. The augmented data, using the
conversational AI technique, improves results on
HASOC 2021.5
3.1 The Ensemble
The ensemble macro F1 result (79.05%) is closer
to the T5-Base result (80.81%) and farther from
the RoBERTa result (74%). The deciding factor is
the T5-Small. Hence, a voting ensemble may not
5
The first technique is not reported because there was
no improvement. This may be because the number of total
samples is smaller than that of the conversational AI technique.
perform better than the strongest model in the col-
lection if the other models are weaker at prediction.
3.2 Cross-Task Training
We obtain new SoTA result (91.73%) for the
HASOC 2020 subtask A after initial training on the
OLID subtask A. The reason we outperform the pre-
vious SoTA is that they used an LSTM with GloVe
embeddings (Mandl et al.,2020), instead of a pre-
trained deep model with the attention mechanism
(Bahdanau et al.,2015) that gives transfer learning
advantage. The p-value (p < 0.0001) obtained for
the difference of two means of the two-sample t-
test is smaller than the alpha (0.05), showing that
the results are statistically significant.
摘要:

T5forHateSpeech,AugmentedDataandEnsembleTosinAdewumi,SanaSabahSabry,NosheenAbid,FoteiniLiwicki&MarcusLiwickiMLGroup,EISLAB,LuleåUniversityofTechnology,Swedenfirstname.lastname@ltu.seAbstractWeconductrelativelyextensiveinvestigationsofautomatichatespeech(HS)detectionus-ingdifferentstate-of-the-art(So...

展开>> 收起<<
T5 for Hate Speech Augmented Data and Ensemble Tosin Adewumi Sana Sabah Sabry Nosheen Abid Foteini Liwicki Marcus Liwicki ML Group EISLAB Luleå University of Technology Sweden.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.99MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注