Detecting Unintended Social Bias in Toxic Language Datasets Nihar Sahoo Himanshu Gupta Pushpak Bhattacharyya CFILT Indian Institute of Technology Bombay India

2025-05-06 0 0 671.69KB 12 页 10玖币
侵权投诉
Detecting Unintended Social Bias in Toxic Language Datasets
Nihar Sahoo, Himanshu Gupta, Pushpak Bhattacharyya
CFILT, Indian Institute of Technology Bombay, India
{nihar, himanshug, pb @cse.iitb.ac.in}
Abstract
Warning: This paper has contents which may
be offensive, or upsetting however this cannot
be avoided owing to the nature of the work.
With the rise of online hate speech, automatic
detection of Hate Speech, Offensive texts as
a natural language processing task is getting
popular. However, very little research has
been done to detect unintended social bias
from these toxic language datasets. This pa-
per introduces a new dataset ToxicBias cu-
rated from the existing dataset of Kaggle com-
petition named "Jigsaw Unintended Bias in
Toxicity Classification". We aim to detect
social biases, their categories, and targeted
groups. The dataset contains instances anno-
tated for five different bias categories, viz.,
gender, race/ethnicity, religion, political, and
LGBTQ. We train transformer-based models
using our curated datasets and report baseline
performance for bias identification, target gen-
eration, and bias implications. Model biases
and their mitigation are also discussed in detail.
Our study motivates a systematic extraction of
social bias data from toxic language datasets.
All the codes and dataset used for experiments
in this work are publicly available1.
1 Introduction
In the age of social media and communications, it is
simpler than ever to openly express one’s opinions
on a wide range of issues. This openness results in a
flood of useful information that can assist people in
being more productive and making better decisions.
According to statista
2
, the global number of active
social media users has just surpassed four billion,
accounting for more than half of the world’s popu-
lation. The user base is expected to grow steadily
over the next five years. Various studies (Plaisime
*These authors contributed equally to this work
1https://github.com/sahoonihar/ToxicBias_
CoNLL_2022
2https://www.statista.com/statistics/278414/
number-of-worldwide-social-network-users/
Figure 1: An illustrative example of ToxicBias. Dur-
ing the annotation process, hate speech/offensive text
is provided without context. Annotators are asked to
mark it as biased/neutral and to provide category, tar-
get, and implication if it has biases.
et al.,2020) say that children and teenagers, who
are susceptible, make up a big share of social me-
dia users. Unfortunately, this increasing number
of social media users also leads to an increase in
toxicity (Matamoros-Fernández and Farkas,2021).
Sometimes this toxicity gives birth to violence and
hate crimes. It does not just harm an individual;
most of the time, the entire community suffers as
due to its intensity.
We have different perspectives based on race,
gender, religion, sexual orientation, and many other
factors. These perspectives sometimes lead to bi-
ases that influence how we see the world, even if we
are unaware of them. Biases like this can lead us to
make decisions that are neither intelligent nor just.
Furthermore, when these biases are expressed as
hate speech and offensive texts, it becomes painful
for specific communities. While some of these bi-
ases are implied, most explicit biases can be found
in the form of hate speech and offensive texts.
The use of hate speech incites violence and
sometimes leads to societal and political instability.
arXiv:2210.11762v1 [cs.CL] 21 Oct 2022
BLM (Black Lives Matter) movement is the conse-
quence of one such bias in America. So, to address
these biases, we must first identify them. While
the concepts of Social Bias and Hate Speech may
appear to be the same, there are subtle differences.
This paper expands on the above ideas and pro-
poses a new dataset
ToxicBias
for detecting social
bias from toxic language datasets. The main contri-
butions can be summarized as follows:
To the best of our knowledge, this is the first
study to extract social biases from toxic lan-
guage datasets in English.
We release a curated dataset of 5409 instances
for detection of social bias, its categories, tar-
gets and bias reasoning.
We present methods to reduce lexical overfit-
ting using counter-narrative data augmenta-
tion.
In the following section we discuss various es-
tablished works which are aligned with our work.
Section 3provides information about our dataset,
terminology, annotation procedure, and challenges.
In section 3, we describe our tests and results, fol-
lowed by a discussion of lexical overfitting reduc-
tion via data augmentation in section 5. Section 6
discusses the conclusion and future works.
2 Related Work
Offensive Text:
Unfortunately, offensive content
poses some unique challenges to researchers and
practitioners. First and foremost, determining what
constitutes abuse/offensive behaviour is difficult.
Unlike other types of malicious activity, e.g., spam
or malware, the accounts carrying out this type of
behavior are usually controlled by humans, not bots
(Founta et al.,2018).The term “offensive language”
refers to a broad range of content, including hate
speech, vulgarity, threats, cyberbully, and other
ethnic and racial insults (Kaur et al.,2021). There
is no single definition of abuse, and phrases like
"harassment," "abusive language," and "damaging
speech" are frequently used interchangeably.
Hate Speech:
Hate Speech is defined as speech
that targets disadvantaged social groups in a way
that may be damaging to them. (Davidson et al.,
2017). Fortuna and Nunes (2018) defines Hate
speech as follows: "Hate speech is a language that
attacks or diminishes, that incites violence or hate
against groups, based on specific characteristics
such as physical appearance, religion, national or
ethnic origin, sexual orientation, gender identity
or other, and it can occur with different linguistic
styles, even in subtle forms or when humor is used".
Bias in Embedding:
The initial works to explore
bias in language representations aimed at detecting
gender, race, religion biases in word representa-
tions (Bolukbasi et al.,2016;Caliskan et al.,2017;
Manzini et al.,2019). Some of recent works have
focused on bias detection from sentence represen-
tations (May et al.,2019;Kurita et al.,2019) using
BERT embedding.
In addition, there have been a lot of notable ef-
forts towards detection of data bias in hate speech
and offensive languages (Waseem and Hovy,2016;
Davidson et al.,2019;Sap et al.,2019;Mozafari
et al.,2020). Borkan et al. (2019) has discussed the
presence of unintended bias in hate speech detec-
tion models for identity terms like islam, lesbian,
bisexual, etc. The biased association of different
marginalized groups is still a major challenge in the
models trained for toxic language detection (Kim
et al.,2020;Xia et al.,2020). This is mainly due to
the bias in annotated data which creates the wrong
associations of many lexical features with specific
labels (Dixon et al.,2018). Lack of social context
of the post creator also affect the annotation pro-
cess leading to bias against certain communities in
the dataset (Sap et al.,2019).
Social bias datasets:
More recently, many
datasets (Nadeem et al.,2021;Nangia et al.,2020)
have been created to measure and detect social bi-
ases like gender, race, profession, religion, age,
etc. However, Blodgett et al. (2021) has reported
that many of these datasets lack clear definitions
and have ambiguities and inconsistencies in anno-
tations. A similar study have been done in (Sap
et al.,2020), where dataset has both categorical and
free-text annotation and generation framework as
core model.
There have been few studies on data augmen-
tation (Nozza et al.,2019;Bartl et al.,2020) to
decrease the incorrect association of lexical charac-
teristics in these datasets. Hartvigsen et al. (2022)
proposed a prompt based framework to generate
large dataset of toxic and neutral statements to re-
duce the spurious correlation for Hate Speech de-
tection.
However, no study has been done for detect-
ing social biases from toxic languages, which is
a challenging task due to the conceptual overlap
between hate speech and social bias. Using a thor-
ough guideline, we attempt to uncover harmful bi-
ases in toxic language datasets. The curated dataset
is discussed in length in the next section, as are the
definitions of each category label and the annota-
tion procedure.
3 ToxicBias Dataset
We develop the manually annotated ToxicBias
dataset to enable the algorithm to correctly iden-
tify social biases from a publicly available toxic-
ity dataset. Below, we define social bias and the
categories taken into account in our dataset. The
comprehensive annotation process that we use for
dataset acquisition is then covered.
3.1 Social Bias
People typically have preconceptions, stereotypes,
and discrimination against other who do not belong
to their social group. Positive and negative social
bias refers to a preference for or against persons or
groups based on their social identities (e.g., race,
gender, etc.). Only the negative biases, however,
have the capacity to harm target groups (Crawford,
2017). As a result, in our study, we focus on iden-
tifying negative biases in order to prevent harmful
repercussions on targeted groups. Members of spe-
cific social groups (e.g., Women, Muslims, and
Transgender individuals) are more likely to face
prejudice as a result of living in a culture that does
not sufficiently support fairness. In this work, we
have considered five prevalent social biases:
Gender:
Favoritism towards one gender over
other. It can be of the following types: Alpha,
Beta or Sexism (Park et al.,2018).
Religion:
Bias against individuals on the ba-
sis of religion or religious belief. e.g. Chris-
tianity, Islam, Scientology etc (Muralidhar,
2021).
Race:
Favouritism for a group of people hav-
ing common visible physical traits, common
origins, language etc. It is related to dialect,
color, appearance, regional or societal percep-
tion (Sap et al.,2019).
LGBTQ:
Prejudice towards LGBTQ commu-
nity people. It can be due to societal percep-
tion or physical appearance.
Political:
Prejudice against/towards individu-
als on the basis of their political beliefs. For
example: liberals, conservatives, etc.
Categories Targets
Political liberal, conservative, feminist, etc.
Religion christian, jew, hindu, atheist, etc.
Gender men, women
LGBTQ gay, lesbian, homosexual, etc.
Race black, white, asian, canadians, etc.
Table 1: Bias categories and corresponding targets.
For all of these categories, target terms are the
communities towards which bias is targeted.
3.2 Social Bias Vs Hate Speech
While Social Bias and Hate Speech may appear the
same at first look, they are not. The differences
between them are quite subtle. While hate speech
is always associated with negative sentiment, social
bias can also have positive sentiments. Social bias
is preconceived belief toward or against specific
social identities, whereas hate speech is an explicit
comment expressing hatred against an individual
or a group. Not all hate speech is biased, and not
all biased speech is hate speech. We will use the
following examples to demonstrate the differences:
Some comments are merely toxic without con-
taining any social biases in them, e.g.
IM FREEEEE!!!! WORST EXPERIENCE OF
MY F**K-ING LIFE
Toxic comments can be hate speech but not
necessarily biased, e.g.
you gotta be kidding. trump a Christian, nope,
he is the devil, he hates blacks, Hispanics,
muslims, gays, Asians, etc.
Some comments are just biased with negating
sentiment without containing any toxicity or
hate speech in them, e.g.
All Asian people are bad drivers.
3.3 Annotation Process
The dataset we used for annotation is collected
from a Kaggle competition named "Jigsaw Unin-
tended Bias in Toxicity Classification (jig,2019;
Research Data,2018)". It has around two mil-
lion Wikipedia comments annotated for toxicity.
The data also has several other toxicity subtype
attributes such as severe toxicity, obscene, threat,
insult, identity_attack, and sexual_explicit.
We discovered that, with the exception of the
identity_attack column, all of the columns in this
dataset are redundant for the social bias detection
摘要:

DetectingUnintendedSocialBiasinToxicLanguageDatasetsNiharSahoo,HimanshuGupta,PushpakBhattacharyyaCFILT,IndianInstituteofTechnologyBombay,India{nihar,himanshug,pb@cse.iitb.ac.in}AbstractWarning:Thispaperhascontentswhichmaybeoffensive,orupsettinghoweverthiscannotbeavoidedowingtothenatureofthework.Wi...

展开>> 收起<<
Detecting Unintended Social Bias in Toxic Language Datasets Nihar Sahoo Himanshu Gupta Pushpak Bhattacharyya CFILT Indian Institute of Technology Bombay India.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:671.69KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注