Detecting Unintended Social Bias in Toxic Language Datasets Nihar Sahoo Himanshu Gupta Pushpak Bhattacharyya CFILT Indian Institute of Technology Bombay India

2025-05-06 0 0 671.69KB 12 页 10玖币

侵权投诉

Detecting Unintended Social Bias in Toxic Language Datasets

Nihar Sahoo∗, Himanshu Gupta∗, Pushpak Bhattacharyya

CFILT, Indian Institute of Technology Bombay, India

{nihar, himanshug, pb @cse.iitb.ac.in}

Abstract

Warning: This paper has contents which may

be offensive, or upsetting however this cannot

be avoided owing to the nature of the work.

With the rise of online hate speech, automatic

detection of Hate Speech, Offensive texts as

a natural language processing task is getting

popular. However, very little research has

been done to detect unintended social bias

from these toxic language datasets. This pa-

per introduces a new dataset ToxicBias cu-

rated from the existing dataset of Kaggle com-

petition named "Jigsaw Unintended Bias in

Toxicity Classiﬁcation". We aim to detect

social biases, their categories, and targeted

groups. The dataset contains instances anno-

tated for ﬁve different bias categories, viz.,

gender, race/ethnicity, religion, political, and

LGBTQ. We train transformer-based models

using our curated datasets and report baseline

performance for bias identiﬁcation, target gen-

eration, and bias implications. Model biases

and their mitigation are also discussed in detail.

Our study motivates a systematic extraction of

social bias data from toxic language datasets.

All the codes and dataset used for experiments

in this work are publicly available1.

1 Introduction

In the age of social media and communications, it is

simpler than ever to openly express one’s opinions

on a wide range of issues. This openness results in a

ﬂood of useful information that can assist people in

being more productive and making better decisions.

According to statista

, the global number of active

social media users has just surpassed four billion,

accounting for more than half of the world’s popu-

lation. The user base is expected to grow steadily

over the next ﬁve years. Various studies (Plaisime

*These authors contributed equally to this work

1https://github.com/sahoonihar/ToxicBias_

CoNLL_2022

2https://www.statista.com/statistics/278414/

number-of-worldwide-social-network-users/

Figure 1: An illustrative example of ToxicBias. Dur-

ing the annotation process, hate speech/offensive text

is provided without context. Annotators are asked to

mark it as biased/neutral and to provide category, tar-

get, and implication if it has biases.

et al.,2020) say that children and teenagers, who

are susceptible, make up a big share of social me-

dia users. Unfortunately, this increasing number

of social media users also leads to an increase in

toxicity (Matamoros-Fernández and Farkas,2021).

Sometimes this toxicity gives birth to violence and

hate crimes. It does not just harm an individual;

most of the time, the entire community suffers as

due to its intensity.

We have different perspectives based on race,

gender, religion, sexual orientation, and many other

factors. These perspectives sometimes lead to bi-

ases that inﬂuence how we see the world, even if we

are unaware of them. Biases like this can lead us to

make decisions that are neither intelligent nor just.

Furthermore, when these biases are expressed as

hate speech and offensive texts, it becomes painful

for speciﬁc communities. While some of these bi-

ases are implied, most explicit biases can be found

in the form of hate speech and offensive texts.

The use of hate speech incites violence and

sometimes leads to societal and political instability.

arXiv:2210.11762v1 [cs.CL] 21 Oct 2022

BLM (Black Lives Matter) movement is the conse-

quence of one such bias in America. So, to address

these biases, we must ﬁrst identify them. While

the concepts of Social Bias and Hate Speech may

appear to be the same, there are subtle differences.

This paper expands on the above ideas and pro-

poses a new dataset

ToxicBias

for detecting social

bias from toxic language datasets. The main contri-

butions can be summarized as follows:

•

To the best of our knowledge, this is the ﬁrst

study to extract social biases from toxic lan-

guage datasets in English.

•

We release a curated dataset of 5409 instances

for detection of social bias, its categories, tar-

gets and bias reasoning.

•

We present methods to reduce lexical overﬁt-

ting using counter-narrative data augmenta-

tion.

In the following section we discuss various es-

tablished works which are aligned with our work.

Section 3provides information about our dataset,

terminology, annotation procedure, and challenges.

In section 3, we describe our tests and results, fol-

lowed by a discussion of lexical overﬁtting reduc-

tion via data augmentation in section 5. Section 6

discusses the conclusion and future works.

2 Related Work

Offensive Text:

Unfortunately, offensive content

poses some unique challenges to researchers and

practitioners. First and foremost, determining what

constitutes abuse/offensive behaviour is difﬁcult.

Unlike other types of malicious activity, e.g., spam

or malware, the accounts carrying out this type of

behavior are usually controlled by humans, not bots

(Founta et al.,2018).The term “offensive language”

refers to a broad range of content, including hate

speech, vulgarity, threats, cyberbully, and other

ethnic and racial insults (Kaur et al.,2021). There

is no single deﬁnition of abuse, and phrases like

"harassment," "abusive language," and "damaging

speech" are frequently used interchangeably.

Hate Speech:

Hate Speech is deﬁned as speech

that targets disadvantaged social groups in a way

that may be damaging to them. (Davidson et al.,

2017). Fortuna and Nunes (2018) deﬁnes Hate

speech as follows: "Hate speech is a language that

attacks or diminishes, that incites violence or hate

against groups, based on speciﬁc characteristics

such as physical appearance, religion, national or

ethnic origin, sexual orientation, gender identity

or other, and it can occur with different linguistic

styles, even in subtle forms or when humor is used".

Bias in Embedding:

The initial works to explore

bias in language representations aimed at detecting

gender, race, religion biases in word representa-

tions (Bolukbasi et al.,2016;Caliskan et al.,2017;

Manzini et al.,2019). Some of recent works have

focused on bias detection from sentence represen-

tations (May et al.,2019;Kurita et al.,2019) using

BERT embedding.

In addition, there have been a lot of notable ef-

forts towards detection of data bias in hate speech

and offensive languages (Waseem and Hovy,2016;

Davidson et al.,2019;Sap et al.,2019;Mozafari

et al.,2020). Borkan et al. (2019) has discussed the

presence of unintended bias in hate speech detec-

tion models for identity terms like islam, lesbian,

bisexual, etc. The biased association of different

marginalized groups is still a major challenge in the

models trained for toxic language detection (Kim

et al.,2020;Xia et al.,2020). This is mainly due to

the bias in annotated data which creates the wrong

associations of many lexical features with speciﬁc

labels (Dixon et al.,2018). Lack of social context

of the post creator also affect the annotation pro-

cess leading to bias against certain communities in

the dataset (Sap et al.,2019).

Social bias datasets:

More recently, many

datasets (Nadeem et al.,2021;Nangia et al.,2020)

have been created to measure and detect social bi-

ases like gender, race, profession, religion, age,

etc. However, Blodgett et al. (2021) has reported

that many of these datasets lack clear deﬁnitions

and have ambiguities and inconsistencies in anno-

tations. A similar study have been done in (Sap

et al.,2020), where dataset has both categorical and

free-text annotation and generation framework as

core model.

There have been few studies on data augmen-

tation (Nozza et al.,2019;Bartl et al.,2020) to

decrease the incorrect association of lexical charac-

teristics in these datasets. Hartvigsen et al. (2022)

proposed a prompt based framework to generate

large dataset of toxic and neutral statements to re-

duce the spurious correlation for Hate Speech de-

tection.

However, no study has been done for detect-

ing social biases from toxic languages, which is

a challenging task due to the conceptual overlap

between hate speech and social bias. Using a thor-

ough guideline, we attempt to uncover harmful bi-

ases in toxic language datasets. The curated dataset

is discussed in length in the next section, as are the

deﬁnitions of each category label and the annota-

tion procedure.

3 ToxicBias Dataset

We develop the manually annotated ToxicBias

dataset to enable the algorithm to correctly iden-

tify social biases from a publicly available toxic-

ity dataset. Below, we deﬁne social bias and the

categories taken into account in our dataset. The

comprehensive annotation process that we use for

dataset acquisition is then covered.

3.1 Social Bias

People typically have preconceptions, stereotypes,

and discrimination against other who do not belong

to their social group. Positive and negative social

bias refers to a preference for or against persons or

groups based on their social identities (e.g., race,

gender, etc.). Only the negative biases, however,

have the capacity to harm target groups (Crawford,

2017). As a result, in our study, we focus on iden-

tifying negative biases in order to prevent harmful

repercussions on targeted groups. Members of spe-

ciﬁc social groups (e.g., Women, Muslims, and

Transgender individuals) are more likely to face

prejudice as a result of living in a culture that does

not sufﬁciently support fairness. In this work, we

have considered ﬁve prevalent social biases:

•Gender:

Favoritism towards one gender over

other. It can be of the following types: Alpha,

Beta or Sexism (Park et al.,2018).

•Religion:

Bias against individuals on the ba-

sis of religion or religious belief. e.g. Chris-

tianity, Islam, Scientology etc (Muralidhar,

2021).

•Race:

Favouritism for a group of people hav-

ing common visible physical traits, common

origins, language etc. It is related to dialect,

color, appearance, regional or societal percep-

tion (Sap et al.,2019).

•LGBTQ:

Prejudice towards LGBTQ commu-

nity people. It can be due to societal percep-

tion or physical appearance.

•Political:

Prejudice against/towards individu-

als on the basis of their political beliefs. For

example: liberals, conservatives, etc.

Categories Targets

Political liberal, conservative, feminist, etc.

Religion christian, jew, hindu, atheist, etc.

Gender men, women

LGBTQ gay, lesbian, homosexual, etc.

Race black, white, asian, canadians, etc.

Table 1: Bias categories and corresponding targets.

For all of these categories, target terms are the

communities towards which bias is targeted.

3.2 Social Bias Vs Hate Speech

While Social Bias and Hate Speech may appear the

same at ﬁrst look, they are not. The differences

between them are quite subtle. While hate speech

is always associated with negative sentiment, social

bias can also have positive sentiments. Social bias

is preconceived belief toward or against speciﬁc

social identities, whereas hate speech is an explicit

comment expressing hatred against an individual

or a group. Not all hate speech is biased, and not

all biased speech is hate speech. We will use the

following examples to demonstrate the differences:

•

Some comments are merely toxic without con-

taining any social biases in them, e.g.

IM FREEEEE!!!! WORST EXPERIENCE OF

MY F**K-ING LIFE

•

Toxic comments can be hate speech but not

necessarily biased, e.g.

you gotta be kidding. trump a Christian, nope,

he is the devil, he hates blacks, Hispanics,

muslims, gays, Asians, etc.

•

Some comments are just biased with negating

sentiment without containing any toxicity or

hate speech in them, e.g.

All Asian people are bad drivers.

3.3 Annotation Process

The dataset we used for annotation is collected

from a Kaggle competition named "Jigsaw Unin-

tended Bias in Toxicity Classiﬁcation (jig,2019;

Research Data,2018)". It has around two mil-

lion Wikipedia comments annotated for toxicity.

The data also has several other toxicity subtype

attributes such as severe toxicity, obscene, threat,

insult, identity_attack, and sexual_explicit.

We discovered that, with the exception of the

identity_attack column, all of the columns in this

dataset are redundant for the social bias detection

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectingUnintendedSocialBiasinToxicLanguageDatasetsNiharSahoo,HimanshuGupta,PushpakBhattacharyyaCFILT,IndianInstituteofTechnologyBombay,India{nihar,himanshug,pb@cse.iitb.ac.in}AbstractWarning:Thispaperhascontentswhichmaybeoffensive,orupsettinghoweverthiscannotbeavoidedowingtothenatureofthework.Wi...

展开>> 收起<<

Detecting Unintended Social Bias in Toxic Language Datasets Nihar Sahoo Himanshu Gupta Pushpak Bhattacharyya CFILT Indian Institute of Technology Bombay India.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detecting Unintended Social Bias in Toxic Language Datasets Nihar Sahoo Himanshu Gupta Pushpak Bhattacharyya CFILT Indian Institute of Technology Bombay India

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: