Hypothesis Engineering for Zero-Shot Hate Speech Detection

2025-04-22 0 0 482.23KB 16 页 10玖币

侵权投诉

Janis Goldzycher and Gerold Schneider

Department of Computational Linguistics

University of Zurich

{goldzycher,gschneid}@cl.uzh.ch

Abstract

Standard approaches to hate speech detection

rely on sufﬁcient available hate speech anno-

tations. Extending previous work that repur-

poses natural language inference (NLI) mod-

els for zero-shot text classiﬁcation, we propose

a simple approach that combines multiple hy-

potheses to improve English NLI-based zero-

shot hate speech detection. We ﬁrst conduct an

error analysis for vanilla NLI-based zero-shot

hate speech detection and then develop four

strategies based on this analysis. The strate-

gies use multiple hypotheses to predict vari-

ous aspects of an input text and combine these

predictions into a ﬁnal verdict. We ﬁnd that

the zero-shot baseline used for the initial error

analysis already outperforms commercial sys-

tems and ﬁne-tuned BERT-based hate speech

detection models on HateCheck. The com-

bination of the proposed strategies further in-

creases the zero-shot accuracy of 79.4% on

HateCheck by 7.9 percentage points (pp), and

the accuracy of 69.6% on ETHOS by 10.0pp.1

1 Introduction

With the increasing popularity of social media and

online forums, phenomena such as hate speech, of-

fensive and abusive language, and personal attacks

have gained a powerful medium through which

they can propagate fast. Due to the sheer number

of posts and comments on social media, manual

content moderation has become unfeasible, thus the

automatic detection of harmful content becomes

essential. In natural language processing, there

now exist established tasks with the goal of de-

tecting offensive language (Pradhan et al.,2020),

abusive language (Nakov et al.,2021), hate speech

(Fortuna and Nunes,2018) and other related types

of harmful content (Poletto et al.,2021). In this

work, we focus on the detection of hate speech,

The code and instructions to reproduce the experi-

ments are available at

https://github.com/jagol/

nli-for-hate-speech-detection.

which is typically deﬁned as attacking, abusive

or discriminatory language that targets people on

the basis of identity deﬁning group characteris-

tics such as gender, sexual orientation, disability,

race, religion, national origin etc. (Fortuna and

Nunes,2018;Poletto et al.,2021;Yin and Zubi-

aga,2021). Most current hate speech detection

approaches rely on either training models from

scratch or ﬁne-tuning pre-trained language mod-

els (Jahan and Oussalah,2021). Both types of

approaches need large amounts of labeled data

which are only available for a few high-resource

languages (Poletto et al.,2021) and costly to cre-

ate. Therefore, exploring data-efﬁcient methods for

hate speech detection is an attractive alternative.

In this paper, we build on Yin et al. (2019) who

proposed to re-frame text classiﬁcation tasks as

natural language inference, enabling high accuracy

zero-shot classiﬁcation. We exploit the fact that we

can create arbitrary hypotheses to predict aspects of

an input text that might be relevant for hate speech

detection. To identify effective hypotheses, we

ﬁrst ﬁnd a well-performing hypothesis formulation

that claims that the input text contains hate speech.

An error analysis based on HateCheck (Röttger

et al.,2021) shows that given a well-performing

formulation the model still struggles with multiple

phenomena, including (1) abusive or profane lan-

guage that does not target people based on identity-

deﬁning group characteristics, (2) counterspeech,

(3) reclaimed slurs, and (4) implicit hate speech.

To mitigate these misclassiﬁcations, we develop

four strategies. Each strategy consists of multiple

hypotheses and rules that combine these hypothe-

ses in order to address one of the four identiﬁed

error types.

We show that the combination of all proposed

strategies improves the accuracy of vanilla NLI-

based zero-shot prediction by 7.9pp on HateCheck

(Röttger et al.,2021) and 10.0pp on ETHOS (Mol-

las et al.,2022). An error analysis shows that

arXiv:2210.00910v1 [cs.CL] 3 Oct 2022

the overall gains in accuracy largely stem from in-

creased performance on previously identiﬁed weak-

nesses, demonstrating that the strategies work as

intended.

Overall, our primary contributions are the fol-

lowing:

An error analysis of vanilla NLI-based zero-

shot hate speech detection.

Developing strategies that combine multiple

hypotheses to improve zero-shot hate speech

detection.

An evaluation and error analysis of the pro-

posed strategies.

2 Background and Related Work

Early approaches to hate speech detection have

focused on English social media posts, especially

Twitter, and treated the task as binary or ternary text

classiﬁcation (Waseem and Hovy,2016;Davidson

et al.,2017;Founta et al.,2018). In more recent

work, additional labels have been introduced that

indicate whether the post is group-directed or not,

who the targeted group is, if the post calls for vi-

olence, is aggressive, contains stereotypes, if the

hate is expressed implicitly, or if sarcasm or irony

is used (Mandl et al.,2019,2020;Sap et al.,2020;

ElSherief et al.,2021;Röttger et al.,2021;Mollas

et al.,2022). Sometimes hate speech is not directly

annotated but instead labels, such as racism,sex-

ism,homophobia that already combine hostility

with a speciﬁc target are annotated and predicted

(Waseem and Hovy,2016;Waseem,2016;Saha

et al.,2018;Lavergne et al.,2020).

While early approaches relied on manual fea-

ture engineering (Waseem and Hovy,2016),

most current approaches are based on pre-trained

transformer-based language models that are then

ﬁne-tuned on hate speech datasets (Florio et al.,

2020;Uzan and HaCohen-Kerner,2021;Banerjee

et al.,2021;Lavergne et al.,2020;Das et al.,2021;

Nghiem and Morstatter,2021).

Some work has focused on reducing the need

for labeled data by multi-task learning on differ-

ent sets of hate speech labels (Kapil and Ekbal,

2020;Saﬁ Samghabadi et al.,2020) or adding senti-

ment analysis as an auxiliary task (Plaza-Del-Arco

et al.,2021). Others have worked on reducing

the need for non-English annotations by adapting

hate speech detection models from high- to low-

resource languages in a cross-lingual zero-shot set-

name # examples classes

HateCheck 3,728 hateful (68.8%),

non-hate (31.2%)

ETHOS (binary) 997 hate speech (64.1%),

not-hate speech (25.9%)

Table 1: The number of examples and the class balance

of the datasets.

ting (Stappen et al.,2020;Pamungkas et al.,2021).

However the approach has been criticized for being

unreliable when encountering language-speciﬁc

taboo interjections (Nozza,2021).

2.1 Zero-Shot Text Classiﬁcation

The advent of large language models has en-

abled zero-shot and few-shot text classiﬁcation ap-

proaches such as prompting (Liu et al.,2021), and

task descriptions (Raffel et al.,2020), which con-

vert the target task to the pre-training objective and

are usually only used in combination with large

language models. Chiu and Alexander (2021) use

the prompts “Is this text racist?” and “Is this text

sexist?” to detect hate speech with GPT-3. Schick

et al. (2021) show that toxicity in large generative

language models can be avoided by using similar

prompts to self-diagnose toxicity during the decod-

ing.

In contrast, NLI-based prediction in which a tar-

get task is converted to an NLI-task and fed into

an NLI model converts the target task to the ﬁne-

tuning task. Here, a model is given a premise and

a hypothesis and tasked to predict if the premise

entails the hypothesis, contradicts it, or is neutral

towards it. Yin et al. (2019) proposed to use an

NLI model for zero-shot topic classiﬁcation, by

inputting the text to classify as the premise and

constructing for each topic a hypothesis of the form

“This text is about

topic

”. They map the labels

neutral and contradiction to not-entailment. We

can then interpret a prediction of entailment as pre-

dicting that the input text belongs to the topic in

the given hypothesis. Conversely, not-entailment

implies that the text is not about the topic. Wang

et al. (2021) show for a range of tasks, including

offensive language identiﬁcation, that this task re-

formulation also beneﬁts few-shot learning scenar-

ios. Recently, AlKhamissi et al. (2022) obtained

large performance improvements in few-shot learn-

ing for hate speech detection by (1) decomposing

the task into four subtasks and (2) additionally train-

ing the few-shot model on a knowledge base.

3 Data

HateCheck

Röttger et al. (2021) introduce this

English, synthetic, evaluation-only dataset, anno-

tated for a binary decision between hate speech

and not-hate speech. It covers 29 functionalities

that are either a type of hate speech or challenging

types of non-hate speech that could be mistaken

for hate speech by a classiﬁer. The examples for

each of these functionalities have been constructed

on the basis of conversations with NGO workers.

Each of these templates contains one blank space

to be ﬁlled with a protected group. The authors

ﬁll these templates with seven protected groups,

namely: women, gay people, transgender people,

black people, Muslims, immigrants, and disabled

people. Overall the dataset contains 3,728 exam-

ples.

ETHOS

The ETHOS dataset (Mollas et al.,

2022) is split into two parts: one part is annotated

for the presence of hate speech. The other part con-

tains ﬁne-grained annotations that indicate which

characteristics have been targeted (gender, sexual

orientation, race, ethnicity, religion, national origin,

disability), whether the utterance calls for violence,

and whether it is directed at an individual or a gen-

eral statement about a group. The dataset is based

on English comments from Youtube and Reddit.

For this work, we will only make use of the binary

hate speech annotations. These annotations are

continuous values between

(indicating no hate

speech at all) and

indicating clear hate speech.

We rounded all annotations to either

using a

threshold of 0.5.

Table 1displays the class balances of the two

datasets.

4 Evaluating Standard Zero-Shot

Prediction

The evaluation of standard zero-shot NLI-based

hate speech detection has two goals: To (1) obtain

an error analysis that serves as the starting point for

developing zero-shot strategies in Section 5, and

(2) establish a baseline for those strategies.

Experiment setup

To test if an input text con-

tains hate speech, we need a hypothesis express-

ing that claim. However, there are many ways

how the claim, that a given text contains hate

Google Jigsaw has since released a new version of the

model powering the Perspective API (Lees et al.,2022). We

assume that the new model would score higher on HateCheck.

system acc. (%)

BART-MNLI 0-shot results

That example is hate speech. / That is hateful. 66.6

That contains hate speech. 79.4

average 75.1

Systems evaluated by Röttger et al. (2021)

SiftNinja 33.2

BERT ﬁne-tuned on Davidson et al. (2017) 60.2

BERT ﬁne-tuned on Founta et al. (2018) 63.2

Google Jigsaw Perspective 276.6

Table 2: Evaluation of hypotheses for zero-shot hate

speech detection on HateCheck. The top rows contain

the two lowest scoring hypotheses, the highest scoring

hypothesis and the average score for all tested hypothe-

ses. The bottom rows contain the HateCheck baselines

computed by Röttger et al. (2021). The full results for

all tested hypotheses are listed in Appendix A.

speech, can be expressed. Choosing a sub-optimal

way to express this claim will result in lower ac-

curacy. Wang et al. (2021) already tested four

different hypotheses for hate speech or offensive

language. We conduct an extensive evaluation

by constructing and testing all grammatically cor-

rect sentences built with the following building

blocks: It/That/This + example/text + contains/is

+ hate speech/hateful/hateful content. We con-

duct all experiments with a BART-large model

(Lewis et al.,2020) that was ﬁne-tuned on the

ulti-Genre

atural

anguage

nference dataset

(MNLI) (Williams et al.,2018) and has been made

available via the Huggingface transformers library

(Wolf et al.,2020) as

bart-large-mnli

. This

model predicts either contradiction,neutral, or en-

tailment. We follow the recommendation of the

model creators to ignore the logits for neutral and

perform a softmax over the logits of contradiction

and entailment. If the probability for entailment

is equal or higher than 0.5 we consider this a pre-

diction of entailment and thus hate speech.

evaluate on HateCheck since the functionalities in

this dataset allow for an automatic in-depth error

analysis and compare our results to the baselines

provided by Röttger et al. (2021).

Results

Table 2shows an abbreviated version of

the results. The full results are given in Appendix

A. The hypothesis “That contains hate speech.” ob-

tains the highest accuracy and beats the Google-

Jigsaw API by 2.8pp. This is remarkable, since

we can assume that the commercial systems were

all trained to detect hateful content or hate speech,

while this model has not been trained on a single

This procedure is equal to taking the argmax over contra-

diction and entailment.

Figure 1: FBT Standard zero-shot entailment predictions would wrongly predict the input text as containing hate

speech. Using additional hypotheses it is possible to check if a protected group is targeted and if necessary to

override the original prediction.

Figure 2: FCS If a text contains quotations the quoted text is replaced with a variable Xusing a regular expression.

Then, then two hypotheses are tested: The ﬁrst hypothesis serves as a test checking if the text inside the quotes

is hate speech. If that is predicted to be the case, the second hypothesis is used to predict if the quoted text is

supported or denounced by the post.

example of hate speech detection or a similar task.

The two lowest scoring hypotheses lead to an accu-

racy of 66.6% meaning that an unlucky choice of

hypothesis can cost more than 12pp accuracy.

Error Analysis

Column “No Strat.” in Table 4

shows the accuracy per HateCheck functionality

for the hypothesis “That contains hate speech.”.

Most notably, the model wrongly predicted all de-

nouncements of hate (F20 and F21) as hate speech.

In four functionalities (F22, F11, F23, F20) the

model predicted hate speech even though no one

or no relevant group was targeted. Finally, we see

that the model often fails at analyzing sentences

with negations (F15) and that it fails at recognizing

when slurs are reclaimed and used in a positive way

(F9). In what follows, we will present and evaluate

strategies to avoid these errors.

5 Methods

In this section, we present four methods, which we

call strategies, that aim to improve zero-shot hate

speech detection. A strategy has the following com-

ponents and structure: The aim is to assign a label

y={0,1}

to input text

, where

corresponds to

the class hate speech and

corresponds to the class

not-hate speech. The input text

can be used in one

or multiple a premises

, that are used in

conjunction with the main hypothesis

and one

or multiple supporting hypotheses

[h1, ..., hn]

obtain NLI model predictions

m(pi, hj)∈ {0,1}

where 0 corresponds to contradiction and

corre-

sponds to entailment. The variables

and

are

deﬁned as:

i∈[0, ..., m]

and

j∈[0, ..., n]

. The

rules for how to combine model predictions to ob-

tain the ﬁnal label

are given by the individual

strategies. As the main hypothesis we use “That

contains hate speech.”, since it lead to the highest

accuracy on HateCheck in Section 4. The support-

ing hypotheses used to implement the strategies are

listed in Table 3.

5.1 Filtering By Target (FBT)

The error analysis showed that we can improve

zero-shot classiﬁcation accuracy signiﬁcantly by

avoiding predictions of hate speech where no rele-

vant target group occurs. We thus propose to avoid

false positives by constructing a set of supporting

hypotheses

[h1, ..., hn]

to predict if text

actually

targets or mentions a protected group or charac-

teristic. If no protected group or characteristic is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HypothesisEngineeringforZero-ShotHateSpeechDetectionJanisGoldzycherandGeroldSchneiderDepartmentofComputationalLinguisticsUniversityofZurich{goldzycher,gschneid}@cl.uzh.chAbstractStandardapproachestohatespeechdetectionrelyonsufcientavailablehatespeechanno-tations.Extendingpreviousworkthatrepur-poses...

展开>> 收起<<

Hypothesis Engineering for Zero-Shot Hate Speech Detection.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hypothesis Engineering for Zero-Shot Hate Speech Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: