T5 for Hate Speech Augmented Data and Ensemble Tosin Adewumi Sana Sabah Sabry Nosheen Abid Foteini Liwicki Marcus Liwicki ML Group EISLAB Luleå University of Technology Sweden

2025-04-24 1 0 2.99MB 15 页 10玖币

侵权投诉

T5 for Hate Speech, Augmented Data and Ensemble

Tosin Adewumi, Sana Sabah Sabry, Nosheen Abid, Foteini Liwicki & Marcus Liwicki

ML Group, EISLAB, Luleå University of Technology, Sweden

firstname.lastname@ltu.se

Abstract

We conduct relatively extensive investigations

of automatic hate speech (HS) detection us-

ing different state-of-the-art (SoTA) baselines

over 11 subtasks of 6 different datasets. Our

motivation is to determine which of the re-

cent SoTA models is best for automatic hate

speech detection and what advantage meth-

ods like data augmentation and ensemble may

have on the best model, if any. We carry

out 6 cross-task investigations. We achieve

new SoTA on two subtasks - macro F1 scores

of 91.73% and 53.21% for subtasks A and

B of the HASOC 2020 dataset, where pre-

vious SoTA are 51.52% and 26.52%, respec-

tively. We achieve near-SoTA on two others

- macro F1 scores of 81.66% for subtask A of

the OLID 2019 dataset and 82.54% for subtask

A of the HASOC 2021 dataset, where SoTA

are 82.9% and 83.05%, respectively. We per-

form error analysis and use two explainable ar-

tiﬁcial intelligence (XAI) algorithms (IG and

SHAP) to reveal how two of the models (Bi-

LSTM and T5) make the predictions they do

by using examples. Other contributions of

this work are 1) the introduction of a simple,

novel mechanism for correcting out-of-class

(OOC) predictions in T5, 2) a detailed descrip-

tion of the data augmentation methods, 3) the

revelation of the poor data annotations in the

HASOC 2021 dataset by using several exam-

ples and XAI (buttressing the need for bet-

ter quality control), and 4) the public release

of our model checkpoints and codes to foster

transparency.1

1 Introduction

Any disparaging remark targeted at an individual

or group of persons is usually considered as hate

speech (HS) (Nockleby,2000;Brown,2017). It

is considered unethical in many countries and il-

legal in some (Brown,2017;Quintel and Ullrich,

2020;Fortuna and Nunes,2018). Manual detection

1available after anonymity period

of HS content is a tedious task that can result in

delays in stopping harmful behaviour.

Automatic

hate speech detection is, therefore, crucial and has

been gaining increasing importance because of the

rising inﬂuence of social media in many societies.

It will facilitate the elimination/prevention of un-

desirable characteristics in data, and by extension

AI technologies, such as conversational systems

(Zhang et al.,2020;Adewumi et al.,2021). HS

examples that may incite others to violence in the

offensive language identiﬁcation dataset (OLID)

data (Zampieri et al.,2019a) are given in Table 1.

id tweet

23352 @USER Antifa simply wants us to k*ll them. By the

way. Most of us carry a back up. And a knife

61110

@USER @USER Her life is crappy because she

is crappy. And she’s threatening to k*ll everyone.

Another nut job...Listen up FBI!

68130

@USER @USER @USER @USER @USER Yes

usually in THOSE countries people k*ll gays cuz

religion advise them to do it and try to point this

out and antifa will beat you. No matter how u try in

america to help gay in those countries it will have no

effect cuz those ppl hate america.

Table 1: Inciteful examples from the OLID 2019 train-

ing set. (parts of offensive words masked with "*")

Short details of the datasets in this work are pro-

vided in appendix A. The datasets were selected

based on the important subtasks covered with re-

gards to HS or abusive language. The architectures

employed include the Bi- Directional Long Short

Term Memory Network (Bi-LSTM), the Convolu-

tional Neural Network (CNN), Robustly optimized

BERT approach (RoBERTa)-Base, Text-to-Text-

Transfer Transformer (T5)-Base, where the last

two are pretrained models from the HuggingFace

hub. As the best-performing baseline model, T5-

Base is then used on the augmentated data for the

HASOC 2021 subtasks A & B and for an ensemble.

In addition, we compare result from HateBERT, a

re-trained BERT model for abusive language detec-

2bbc.com/news/world-europe-35105003

arXiv:2210.05480v1 [cs.CL] 11 Oct 2022

tion (Caselli et al.,2021).

The rest of this paper is structured as follows:

Section 2explains the methods used in this study.

The results, critical analysis with XAI and dis-

cussion are in Section 3. Section 4provides an

overview of HS and prior work in the ﬁeld. Sec-

tion 5gives the conclusion and possible future

work.

2 Methodology

All the experiments were conducted on a shared

DGX-1 machine with 8

32GB Nvidia V100

GPUs. The operating system (OS) of the server

is Ubuntu 18 and it has 80 CPU cores. Each exper-

iment is conducted 3 times and the average results

computed. Six is the total number of epochs for

each experiment and the model checkpoint with

the lowest validation loss is saved and used for

evaluation of the test set, where available. Linear

schedule with warm up is used for the learning rate

(LR) adjustment for T5 and RoBERTa. Only lim-

ited hyperparameters are explored, through manual

tuning, for all the models due to time and resource

constraints.

Short details about all the models used are dis-

cussed in appendix B. Appendix Cgives more in-

formation on the data preprocessing, metrics for

evaluation, the ensemble, and cross-task training.

Average time per epoch for training and evaluation

on the validation set is 83.52, 7.82 & 22.29 sec-

onds for the OLID,HASOC 2020 & HASOC 2021

datasets, respectively.3

2.1 Solving OOC Predictions in T5

Raffel et al. (2020) introduced T5 and noted the

possibility of OOC predictions in the model. This

is when the model predicts text (or empty string)

seen during training but is not among the class la-

bels. This issue appears more common in the initial

epochs of training and may not even occur some-

times. We experienced this challenge in the two

libraries we attempted to develop with.

In order to

solve this, ﬁrst we introduced integers (explicitly

type-cast as string) for class labels, which appear to

make the model predictions more stable. The issue

reduced by about 50% in pilot studies, when they

occur. For example, for the HASOC datasets, we

substituted "1" and "0" for the labels "NOT" and

Restrictions (cpulimit) were implemented to avoid server

overloading, in fairness to other users. Hence, average time

for the test sets ranges from 2 to over 24 hours.

4HuggingFace & Simple Transfromers

"HOF", respectively. As a second step, a simple

correction we introduced is to replace the OOC

prediction (if it occurs) with the label of the largest

class in the training set.

2.2 Data Augmentation

The objective of data augmentation is to increase

the number of training data samples in order to

improve performance of models on the evaluation

set (Feng et al.,2021). We experimented with 2

techniques: 1) word-level deletion of the start and

end words per sample and 2) conversational AI text

generation (Table 2). Our work may be the ﬁrst

to use conversational AI for data augmentation.

It doubles the amount of samples and provides

diversity. The average new words generated per

sample prompt is around 16 words. More details

about the 2 techniques are found in appendix C.3.

Type Sample

original Son of a *** wrong "you’re"

augmented

son of a *** wrong youre No, that’s Saint Johns

Chop House. I need a taxi to take me from the

hotel to the restaurant, leaving the ﬁrst at 5:45.

original

SO EXCITED TO GET MY CovidVaccine I hate

you covid!

augmented so excited to get my covidvaccine i hate you covid

You should probably get that checked out by a

gastroenterology department.

original

ModiKaVaccineJumla Who is responsible for oxy-

gen? ModiResign Do you agree with me? â

Don’t you agree with me?

augmented

modikavaccinejumla who is responsible for oxy-

gen modiresign do you agree with me âï dont you

agree with me Yes, I deﬁnitely do not want to

work with them again. I appreciate your help..

Table 2: Original and conversational AI-augmented ex-

amples from the HASOC 2021 dataset.

(offensive words

masked with "***")

3 Results and Discussion

Tables 3,4and 5(Appendix E) show baseline re-

sults, additional results using the best model: T5,

and the cross-task with T5, respectively. Table 6

(Appendix E), shows results for other datasets and

the HateBERT model (Caselli et al.,2021). The

HatEval task is the only comparable one in our

work and that by Caselli et al. (2021).

The Baselines

: The Transformer-based models

(T5 and RoBERTa) generally perform better than

the other baselines (LSTM and CNN) (Zampieri

et al.,2019b), except for RoBERTa on the OLID

subtask B and HASOC 2021 subtask A. The T5

outperforms RoBERTa on all tasks. Based on the

test set results, the LSTM obtains better results than

Task Weighted F1 (%) Macro F1(%)

Dev (sd) Test (sd) Dev (sd) Test (sd)

Bi-LSTM

OLID A 79.59 (0.89) 83.89 (0.57) 78.48 (1.52) 79.49 (0)

OLID B 82.50 (1.70) 83.46 (0) 46.76 (0) 47.32 (0)

OLID C 49.75 (3.95) 43.82 (9.63) 35.65 (2.81) 36.82 (0)

Hasoc 2021 A 78.05 (0.85) 78.43 (0.84) 77.99 (1.79) 77.19 (0)

Hasoc 2021 B 50.65 (1.34) 52.19 (1.95) 43.19 (2.09) 42.25 (0)

CNN

OLID A 79.10 (0.26) 82.47 (0.56) 77.61 (0.39) 78.46 (0)

OLID B 82.43 (0.49) 83.46 (0) 46.76 (0) 47.88 (0)

OLID C 47.54 (1.36) 38.09 (3.91) 35.65 (0) 36.85 (0)

Hasoc 2021 A 77.22 (0.52) 77.63 (0.70) 74.28 (0.58) 75.67 (0)

Hasoc 2021 B 55.60 (0.61) 59.84 (0.41) 50.41 (0.41) 54.99 (0)

RoBERTa

OLID A 82.70 (0.55) 84.62 (0) 80.51 (0.76) 80.34 (0)

OLID B 82.70 (0.13) 83.46 (0) 46.76 (0.04) 47.02 (0)

Hasoc 2021 A 79.9 (0.57) 76.2 (0) 77.77 (0.75) 74 (0)

T5-Base

OLID A 92.90 (1.37) 85.57 (0) 92.93 (1.42) 81.66 (0)

OLID B 99.75 (0.43) 86.81 (0) 99.77 (0.44) 53.78 (0)

OLID C 58.35 (1.22) 54.99 (0) 33.09 (0.76) 43.12 (0)

Hasoc 2021 A 94.60 (1.98) 82.3 (0) 94.73 (5.26) 80.81 (0)

Hasoc 2021 B 65.40 (0.82) 62.74 (0) 62.43 (6.32) 59.21 (0)

(Zampieri et al.,2019b)best scores

OLID A 82.90 (-)

OLID B 75.50 (-)

OLID C 66 (-)

Table 3: Mean scores of model baselines for different subtasks.

(sd: standard deviation; bold values are best scores for a

given task; ’-’ implies no informaton available)

Task Weighted F1 (%) Macro F1(%)

Dev (sd) Test (sd) Dev (sd) Test (sd)

T5-Base

Hasoc 2020 A 96.77 (0.54) 91.12 (0.2) 96.76 (0.54) 91.12 (0.2)

Hasoc 2020 B 83.36 (1.59) 79.08 (1.15) 56.38 (5.09) 53.21 (2.87)

T5-Base+Augmented Data

Hasoc 2021 A 95.5 (3.27) 83 (0) 92.97 (2.20) 82.54 (0)

Hasoc 2021 B 64.74 (3.84) 66.85 (0) 65.56 (1.48) 62.71 (0)

Ensemble

Hasoc 2021 A 80.78 (0) 79.05 (0)

(Mandl et al.,2020)best scores

Hasoc 2020 A 51.52 (-)

Hasoc 2020 B 26.52 (-)

(Mandl et al.,2021)best scores

Hasoc 2021 A 83.05 (-)

Hasoc 2021 B 66.57 (-)

Table 4: T5 variants’ mean scores over HASOC data.

(sd: standard deviation; bold values are best scores for a given task;

’-’ implies no informaton available)

the CNN in the OLID subtasks A, HASOC 2020

subtask A, and HASOC 2021 subtask A while the

CNN does better than it on the others.

T5 Variants & Augmentation

: The T5-Base

model achieves new best scores on the HASOC

2020 subtasks. The augmented data, using the

conversational AI technique, improves results on

HASOC 2021.5

3.1 The Ensemble

The ensemble macro F1 result (79.05%) is closer

to the T5-Base result (80.81%) and farther from

the RoBERTa result (74%). The deciding factor is

the T5-Small. Hence, a voting ensemble may not

The ﬁrst technique is not reported because there was

no improvement. This may be because the number of total

samples is smaller than that of the conversational AI technique.

perform better than the strongest model in the col-

lection if the other models are weaker at prediction.

3.2 Cross-Task Training

We obtain new SoTA result (91.73%) for the

HASOC 2020 subtask A after initial training on the

OLID subtask A. The reason we outperform the pre-

vious SoTA is that they used an LSTM with GloVe

embeddings (Mandl et al.,2020), instead of a pre-

trained deep model with the attention mechanism

(Bahdanau et al.,2015) that gives transfer learning

advantage. The p-value (p < 0.0001) obtained for

the difference of two means of the two-sample t-

test is smaller than the alpha (0.05), showing that

the results are statistically signiﬁcant.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

T5forHateSpeech,AugmentedDataandEnsembleTosinAdewumi,SanaSabahSabry,NosheenAbid,FoteiniLiwicki&MarcusLiwickiMLGroup,EISLAB,LuleåUniversityofTechnology,Swedenfirstname.lastname@ltu.seAbstractWeconductrelativelyextensiveinvestigationsofautomatichatespeech(HS)detectionus-ingdifferentstate-of-the-art(So...

展开>> 收起<<

T5 for Hate Speech Augmented Data and Ensemble Tosin Adewumi Sana Sabah Sabry Nosheen Abid Foteini Liwicki Marcus Liwicki ML Group EISLAB Luleå University of Technology Sweden.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

T5 for Hate Speech Augmented Data and Ensemble Tosin Adewumi Sana Sabah Sabry Nosheen Abid Foteini Liwicki Marcus Liwicki ML Group EISLAB Luleå University of Technology Sweden

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: