Hypothesis Engineering for Zero-Shot Hate Speech Detection

2025-04-22 0 0 482.23KB 16 页 10玖币
侵权投诉
Hypothesis Engineering for Zero-Shot Hate Speech Detection
Janis Goldzycher and Gerold Schneider
Department of Computational Linguistics
University of Zurich
{goldzycher,gschneid}@cl.uzh.ch
Abstract
Standard approaches to hate speech detection
rely on sufficient available hate speech anno-
tations. Extending previous work that repur-
poses natural language inference (NLI) mod-
els for zero-shot text classification, we propose
a simple approach that combines multiple hy-
potheses to improve English NLI-based zero-
shot hate speech detection. We first conduct an
error analysis for vanilla NLI-based zero-shot
hate speech detection and then develop four
strategies based on this analysis. The strate-
gies use multiple hypotheses to predict vari-
ous aspects of an input text and combine these
predictions into a final verdict. We find that
the zero-shot baseline used for the initial error
analysis already outperforms commercial sys-
tems and fine-tuned BERT-based hate speech
detection models on HateCheck. The com-
bination of the proposed strategies further in-
creases the zero-shot accuracy of 79.4% on
HateCheck by 7.9 percentage points (pp), and
the accuracy of 69.6% on ETHOS by 10.0pp.1
1 Introduction
With the increasing popularity of social media and
online forums, phenomena such as hate speech, of-
fensive and abusive language, and personal attacks
have gained a powerful medium through which
they can propagate fast. Due to the sheer number
of posts and comments on social media, manual
content moderation has become unfeasible, thus the
automatic detection of harmful content becomes
essential. In natural language processing, there
now exist established tasks with the goal of de-
tecting offensive language (Pradhan et al.,2020),
abusive language (Nakov et al.,2021), hate speech
(Fortuna and Nunes,2018) and other related types
of harmful content (Poletto et al.,2021). In this
work, we focus on the detection of hate speech,
1
The code and instructions to reproduce the experi-
ments are available at
https://github.com/jagol/
nli-for-hate-speech-detection.
which is typically defined as attacking, abusive
or discriminatory language that targets people on
the basis of identity defining group characteris-
tics such as gender, sexual orientation, disability,
race, religion, national origin etc. (Fortuna and
Nunes,2018;Poletto et al.,2021;Yin and Zubi-
aga,2021). Most current hate speech detection
approaches rely on either training models from
scratch or fine-tuning pre-trained language mod-
els (Jahan and Oussalah,2021). Both types of
approaches need large amounts of labeled data
which are only available for a few high-resource
languages (Poletto et al.,2021) and costly to cre-
ate. Therefore, exploring data-efficient methods for
hate speech detection is an attractive alternative.
In this paper, we build on Yin et al. (2019) who
proposed to re-frame text classification tasks as
natural language inference, enabling high accuracy
zero-shot classification. We exploit the fact that we
can create arbitrary hypotheses to predict aspects of
an input text that might be relevant for hate speech
detection. To identify effective hypotheses, we
first find a well-performing hypothesis formulation
that claims that the input text contains hate speech.
An error analysis based on HateCheck (Röttger
et al.,2021) shows that given a well-performing
formulation the model still struggles with multiple
phenomena, including (1) abusive or profane lan-
guage that does not target people based on identity-
defining group characteristics, (2) counterspeech,
(3) reclaimed slurs, and (4) implicit hate speech.
To mitigate these misclassifications, we develop
four strategies. Each strategy consists of multiple
hypotheses and rules that combine these hypothe-
ses in order to address one of the four identified
error types.
We show that the combination of all proposed
strategies improves the accuracy of vanilla NLI-
based zero-shot prediction by 7.9pp on HateCheck
(Röttger et al.,2021) and 10.0pp on ETHOS (Mol-
las et al.,2022). An error analysis shows that
arXiv:2210.00910v1 [cs.CL] 3 Oct 2022
the overall gains in accuracy largely stem from in-
creased performance on previously identified weak-
nesses, demonstrating that the strategies work as
intended.
Overall, our primary contributions are the fol-
lowing:
C1
An error analysis of vanilla NLI-based zero-
shot hate speech detection.
C2
Developing strategies that combine multiple
hypotheses to improve zero-shot hate speech
detection.
C3
An evaluation and error analysis of the pro-
posed strategies.
2 Background and Related Work
Early approaches to hate speech detection have
focused on English social media posts, especially
Twitter, and treated the task as binary or ternary text
classification (Waseem and Hovy,2016;Davidson
et al.,2017;Founta et al.,2018). In more recent
work, additional labels have been introduced that
indicate whether the post is group-directed or not,
who the targeted group is, if the post calls for vi-
olence, is aggressive, contains stereotypes, if the
hate is expressed implicitly, or if sarcasm or irony
is used (Mandl et al.,2019,2020;Sap et al.,2020;
ElSherief et al.,2021;Röttger et al.,2021;Mollas
et al.,2022). Sometimes hate speech is not directly
annotated but instead labels, such as racism,sex-
ism,homophobia that already combine hostility
with a specific target are annotated and predicted
(Waseem and Hovy,2016;Waseem,2016;Saha
et al.,2018;Lavergne et al.,2020).
While early approaches relied on manual fea-
ture engineering (Waseem and Hovy,2016),
most current approaches are based on pre-trained
transformer-based language models that are then
fine-tuned on hate speech datasets (Florio et al.,
2020;Uzan and HaCohen-Kerner,2021;Banerjee
et al.,2021;Lavergne et al.,2020;Das et al.,2021;
Nghiem and Morstatter,2021).
Some work has focused on reducing the need
for labeled data by multi-task learning on differ-
ent sets of hate speech labels (Kapil and Ekbal,
2020;Safi Samghabadi et al.,2020) or adding senti-
ment analysis as an auxiliary task (Plaza-Del-Arco
et al.,2021). Others have worked on reducing
the need for non-English annotations by adapting
hate speech detection models from high- to low-
resource languages in a cross-lingual zero-shot set-
name # examples classes
HateCheck 3,728 hateful (68.8%),
non-hate (31.2%)
ETHOS (binary) 997 hate speech (64.1%),
not-hate speech (25.9%)
Table 1: The number of examples and the class balance
of the datasets.
ting (Stappen et al.,2020;Pamungkas et al.,2021).
However the approach has been criticized for being
unreliable when encountering language-specific
taboo interjections (Nozza,2021).
2.1 Zero-Shot Text Classification
The advent of large language models has en-
abled zero-shot and few-shot text classification ap-
proaches such as prompting (Liu et al.,2021), and
task descriptions (Raffel et al.,2020), which con-
vert the target task to the pre-training objective and
are usually only used in combination with large
language models. Chiu and Alexander (2021) use
the prompts “Is this text racist?” and “Is this text
sexist?” to detect hate speech with GPT-3. Schick
et al. (2021) show that toxicity in large generative
language models can be avoided by using similar
prompts to self-diagnose toxicity during the decod-
ing.
In contrast, NLI-based prediction in which a tar-
get task is converted to an NLI-task and fed into
an NLI model converts the target task to the fine-
tuning task. Here, a model is given a premise and
a hypothesis and tasked to predict if the premise
entails the hypothesis, contradicts it, or is neutral
towards it. Yin et al. (2019) proposed to use an
NLI model for zero-shot topic classification, by
inputting the text to classify as the premise and
constructing for each topic a hypothesis of the form
“This text is about
<
topic
>
”. They map the labels
neutral and contradiction to not-entailment. We
can then interpret a prediction of entailment as pre-
dicting that the input text belongs to the topic in
the given hypothesis. Conversely, not-entailment
implies that the text is not about the topic. Wang
et al. (2021) show for a range of tasks, including
offensive language identification, that this task re-
formulation also benefits few-shot learning scenar-
ios. Recently, AlKhamissi et al. (2022) obtained
large performance improvements in few-shot learn-
ing for hate speech detection by (1) decomposing
the task into four subtasks and (2) additionally train-
ing the few-shot model on a knowledge base.
3 Data
HateCheck
Röttger et al. (2021) introduce this
English, synthetic, evaluation-only dataset, anno-
tated for a binary decision between hate speech
and not-hate speech. It covers 29 functionalities
that are either a type of hate speech or challenging
types of non-hate speech that could be mistaken
for hate speech by a classifier. The examples for
each of these functionalities have been constructed
on the basis of conversations with NGO workers.
Each of these templates contains one blank space
to be filled with a protected group. The authors
fill these templates with seven protected groups,
namely: women, gay people, transgender people,
black people, Muslims, immigrants, and disabled
people. Overall the dataset contains 3,728 exam-
ples.
ETHOS
The ETHOS dataset (Mollas et al.,
2022) is split into two parts: one part is annotated
for the presence of hate speech. The other part con-
tains fine-grained annotations that indicate which
characteristics have been targeted (gender, sexual
orientation, race, ethnicity, religion, national origin,
disability), whether the utterance calls for violence,
and whether it is directed at an individual or a gen-
eral statement about a group. The dataset is based
on English comments from Youtube and Reddit.
For this work, we will only make use of the binary
hate speech annotations. These annotations are
continuous values between
0
(indicating no hate
speech at all) and
1
indicating clear hate speech.
We rounded all annotations to either
0
or
1
using a
threshold of 0.5.
Table 1displays the class balances of the two
datasets.
4 Evaluating Standard Zero-Shot
Prediction
The evaluation of standard zero-shot NLI-based
hate speech detection has two goals: To (1) obtain
an error analysis that serves as the starting point for
developing zero-shot strategies in Section 5, and
(2) establish a baseline for those strategies.
Experiment setup
To test if an input text con-
tains hate speech, we need a hypothesis express-
ing that claim. However, there are many ways
how the claim, that a given text contains hate
2
Google Jigsaw has since released a new version of the
model powering the Perspective API (Lees et al.,2022). We
assume that the new model would score higher on HateCheck.
system acc. (%)
BART-MNLI 0-shot results
That example is hate speech. / That is hateful. 66.6
That contains hate speech. 79.4
average 75.1
Systems evaluated by Röttger et al. (2021)
SiftNinja 33.2
BERT fine-tuned on Davidson et al. (2017) 60.2
BERT fine-tuned on Founta et al. (2018) 63.2
Google Jigsaw Perspective 276.6
Table 2: Evaluation of hypotheses for zero-shot hate
speech detection on HateCheck. The top rows contain
the two lowest scoring hypotheses, the highest scoring
hypothesis and the average score for all tested hypothe-
ses. The bottom rows contain the HateCheck baselines
computed by Röttger et al. (2021). The full results for
all tested hypotheses are listed in Appendix A.
speech, can be expressed. Choosing a sub-optimal
way to express this claim will result in lower ac-
curacy. Wang et al. (2021) already tested four
different hypotheses for hate speech or offensive
language. We conduct an extensive evaluation
by constructing and testing all grammatically cor-
rect sentences built with the following building
blocks: It/That/This + example/text + contains/is
+ hate speech/hateful/hateful content. We con-
duct all experiments with a BART-large model
(Lewis et al.,2020) that was fine-tuned on the
M
ulti-Genre
N
atural
L
anguage
I
nference dataset
(MNLI) (Williams et al.,2018) and has been made
available via the Huggingface transformers library
(Wolf et al.,2020) as
bart-large-mnli
. This
model predicts either contradiction,neutral, or en-
tailment. We follow the recommendation of the
model creators to ignore the logits for neutral and
perform a softmax over the logits of contradiction
and entailment. If the probability for entailment
is equal or higher than 0.5 we consider this a pre-
diction of entailment and thus hate speech.
3
We
evaluate on HateCheck since the functionalities in
this dataset allow for an automatic in-depth error
analysis and compare our results to the baselines
provided by Röttger et al. (2021).
Results
Table 2shows an abbreviated version of
the results. The full results are given in Appendix
A. The hypothesis “That contains hate speech. ob-
tains the highest accuracy and beats the Google-
Jigsaw API by 2.8pp. This is remarkable, since
we can assume that the commercial systems were
all trained to detect hateful content or hate speech,
while this model has not been trained on a single
3
This procedure is equal to taking the argmax over contra-
diction and entailment.
Figure 1: FBT Standard zero-shot entailment predictions would wrongly predict the input text as containing hate
speech. Using additional hypotheses it is possible to check if a protected group is targeted and if necessary to
override the original prediction.
Figure 2: FCS If a text contains quotations the quoted text is replaced with a variable Xusing a regular expression.
Then, then two hypotheses are tested: The first hypothesis serves as a test checking if the text inside the quotes
is hate speech. If that is predicted to be the case, the second hypothesis is used to predict if the quoted text is
supported or denounced by the post.
example of hate speech detection or a similar task.
The two lowest scoring hypotheses lead to an accu-
racy of 66.6% meaning that an unlucky choice of
hypothesis can cost more than 12pp accuracy.
Error Analysis
Column “No Strat. in Table 4
shows the accuracy per HateCheck functionality
for the hypothesis “That contains hate speech.”.
Most notably, the model wrongly predicted all de-
nouncements of hate (F20 and F21) as hate speech.
In four functionalities (F22, F11, F23, F20) the
model predicted hate speech even though no one
or no relevant group was targeted. Finally, we see
that the model often fails at analyzing sentences
with negations (F15) and that it fails at recognizing
when slurs are reclaimed and used in a positive way
(F9). In what follows, we will present and evaluate
strategies to avoid these errors.
5 Methods
In this section, we present four methods, which we
call strategies, that aim to improve zero-shot hate
speech detection. A strategy has the following com-
ponents and structure: The aim is to assign a label
y={0,1}
to input text
t
, where
1
corresponds to
the class hate speech and
0
corresponds to the class
not-hate speech. The input text
t
can be used in one
or multiple a premises
p0
to
pm
, that are used in
conjunction with the main hypothesis
h0
and one
or multiple supporting hypotheses
[h1, ..., hn]
to
obtain NLI model predictions
m(pi, hj)∈ {0,1}
where 0 corresponds to contradiction and
1
corre-
sponds to entailment. The variables
i
and
j
are
defined as:
i[0, ..., m]
and
j[0, ..., n]
. The
rules for how to combine model predictions to ob-
tain the final label
y
are given by the individual
strategies. As the main hypothesis we use “That
contains hate speech.”, since it lead to the highest
accuracy on HateCheck in Section 4. The support-
ing hypotheses used to implement the strategies are
listed in Table 3.
5.1 Filtering By Target (FBT)
The error analysis showed that we can improve
zero-shot classification accuracy significantly by
avoiding predictions of hate speech where no rele-
vant target group occurs. We thus propose to avoid
false positives by constructing a set of supporting
hypotheses
[h1, ..., hn]
to predict if text
t
actually
targets or mentions a protected group or charac-
teristic. If no protected group or characteristic is
摘要:

HypothesisEngineeringforZero-ShotHateSpeechDetectionJanisGoldzycherandGeroldSchneiderDepartmentofComputationalLinguisticsUniversityofZurich{goldzycher,gschneid}@cl.uzh.chAbstractStandardapproachestohatespeechdetectionrelyonsufcientavailablehatespeechanno-tations.Extendingpreviousworkthatrepur-poses...

展开>> 收起<<
Hypothesis Engineering for Zero-Shot Hate Speech Detection.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:482.23KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注