A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter Kyra Yee

2025-04-27 1 0 391.07KB 13 页 10玖币

侵权投诉

A Keyword Based Approach to Understanding the Overpenalization of

Marginalized Groups by English Marginal Abuse Models on Twitter

Kyra Yee

Twitter

Alice

Schoenauer Sebag

Twitter

Olivia Redﬁeld ∗Emily Sheng ∗Matthias Eck

Twitter

Luca Belli

Twitter

Abstract

Content warning: contains references to offen-

sive language

Harmful content detection models tend to

have higher false positive rates for content

from marginalized groups. In the context of

marginal abuse modeling on Twitter, such dis-

proportionate penalization poses the risk of re-

duced visibility, where marginalized communi-

ties lose the opportunity to voice their opinion

on the platform. Current approaches to algo-

rithmic harm mitigation, and bias detection for

NLP models are often very ad hoc and subject

to human bias. We make two main contribu-

tions in this paper. First, we design a novel

methodology, which provides a principled ap-

proach to detecting and measuring the sever-

ity of potential harms associated with a text-

based model. Second, we apply our methodol-

ogy to audit Twitter’s English marginal abuse

model, which is used for removing ampliﬁ-

cation eligibility of marginally abusive con-

tent. Without utilizing demographic labels

or dialect classiﬁers, we are still able to de-

tect and measure the severity of issues re-

lated to the over-penalization of the speech of

marginalized communities, such as the use of

reclaimed speech, counterspeech, and identity

related terms. In order to mitigate the associ-

ated harms, we experiment with adding addi-

tional true negative examples and ﬁnd that do-

ing so provides improvements to our fairness

metrics without large degradations in model

performance.

1 Introduction

Because of the sheer volume of content, automatic

content governance has been a crucial tool to avoid

amplifying abusive content on Twitter. Harmful

content detection models are used to reduce the am-

pliﬁcation of harmful content online. These models

are especially important to historically marginal-

ized groups, who are more frequently the target of

∗Work done while at Twitter

online harassment and hate speech (International,

2018;Vogels,2021). However, previous research

indicates that these models often have higher false

positive rates for marginalized communities, such

as the Black community, women, and the LGBTQ

community (Sap et al.,2019;Oliva et al.,2021;

Park et al.,2018). Within the context of social

media, higher false positive rates for a speciﬁc sub-

group pose the risk of reduced visibility, where

the community loses the opportunity to voice their

opinion on the platform. Unfortunately, there are

many contributing factors to over-penalization, in-

cluding linguistic variation, sampling bias, anno-

tator bias, label subjectivity, and modeling deci-

sions (Park et al.,2018;Sap et al.,2019;Wich

et al.,2020;Ball-Burack et al.,2021). This type

of over-penalization risks hurting the very com-

munities content governance is meant to protect.

Algorithmic audits have become an important tool

to surface these types of problems. However, deter-

mining the proper subgroups for analysis in global

settings, and collecting high quality demographic

information can be extremely challenging and pose

the risk of misuse (Andrus et al.,2021;Holstein

et al.,2019). Current approaches to harm mitiga-

tion are often reactive and subject to human bias

(Holstein et al.,2019). In this work, we present a

more principled and proactive approach to detect-

ing and measuring the severity of potential harms

associated with a text-based model, and conduct an

audit of one of the English marginal abuse models

used by Twitter for preventing potentially harmful

out-of-network recommendations. We develop a

list of keywords for evaluation by analyzing the text

of previous false positives to understand trends in

the model’s errors. This allows us to alleviate con-

cerns of false positive bias in content concerning

or created by marginalized groups without using

demographic data.

arXiv:2210.06351v1 [cs.CL] 7 Oct 2022

2 Related Work

2.1 Challenges in Algorithmic Auditing in

Industry

As issues of algorithmic bias have become more

prominent, algorithmic auditing has received in-

creasing attention both in academia and by indus-

try practitioners (Yee et al.,2021;Raji et al.,2020;

Buolamwini and Gebru,2018). However, substan-

tial challenges still remain for successfully being

able to proactively detect and mitigate problems:

Determining the appropriate subgroups for

bias analysis: Although algorithmic auditing

has become a crucial tool to uncover issues

of bias in algorithmic systems, audits can of-

ten suffer major blindspots and fail to uncover

crucial problems that are not caught until af-

ter deployment or public outcry (Shen et al.,

2021;Holstein et al.,2019;Yee et al.,2021).

This is often due to limited positionality and

cultural blindspots of the auditors involved, or

sociotechnical considerations that are difﬁcult

to anticipate before the system is deployed

(Shen et al.,2021;Holstein et al.,2019). Cur-

rent approaches to bias detection often rely

on predetermining an axis of injustice and

acquiring demographic data, or for NLP mod-

els, pre-deﬁning a lexicon of terms that are

relevant to different subgroups (Dixon et al.,

2018;Ghosh et al.,2021;Sap et al.,2019).

Without domain expertise and nuanced local

cultural knowledge, it may be difﬁcult to an-

ticipate problems or to know what relevant cat-

egories or combinations of categories should

be focused on (Andrus et al.,2021;Holstein

et al.,2019). For products such as Twitter

that have global reach, this problem is exacer-

bated due to the huge amount of cultural and

demographic diversity globally, and "efforts

to recruit more diverse teams may be helpful

yet insufﬁcient" (Holstein et al.,2019). Even

in cases where audits are conducted proac-

tively, inquiries into problem areas are often

subject to human bias. Biases in non-Western

contexts are also frequently overlooked (Sam-

basivan et al.,2021).

Sensitivity of demographic data: Most metrics

used to measure disparate impact of algorith-

mic systems rely on demographic information

(Barocas et al.,2017;Narayanan,2018). How-

ever, in industry settings, high quality demo-

graphic information can be difﬁcult to procure

(Andrus et al.,2021).

Additionally, many scholars have called into

question harms associated with the uncriti-

cal conceptualization of demographic traits

such as gender, race, and disability (Hanna

et al.,2020;Keyes,2018;Hamidi et al.,

2018;Khan and Fu,2021;Hu and Kohler-

Hausmann,2020;Bennett and Keyes,2020).

There are fundamental concerns that the use

of demographic data poses the risk of natu-

ralizing or essentializing socially constructed

categories (Benthall and Haynes,2019;Hanna

et al.,2020;Fields and Fields,2014;Keyes,

2018). Lastly, in industry settings, clients or

users may be uncomfortable with organiza-

tions collecting or inferring sensitive infor-

mation about them due to misuse or privacy

concerns (Andrus et al.,2021). Additionally,

inferring demographic information may pose

dignitary concerns or risks of stereotyping

(Keyes,2018;Hamidi et al.,2018;Andrus

et al.,2021). Despite these risks and limita-

tions, this is not to suggest that demographic

data should never be used. Demographic data

can certainly be appropriate and even neces-

sary for addressing fairness related concerns

in many cases. However, because of the chal-

lenges discussed here, there is increasing inter-

est in developing strategies to detect and mit-

igate bias without demographic labels (Ben-

thall and Haynes,2019;Lazovich et al.,2022;

Rios,2020).

2.2 Bias in automated content governance

One key challenge in quantifying bias in machine

learning systems is the lack of a universal for-

malized notion of fairness; rather, different fair-

ness metrics imply different normative values and

have different appropriate use cases and limitations

(Narayanan,21;Barocas et al.,2017). For the pur-

poses of this study, we are primarily concerned

with false positive bias in marginal abuse modeling.

Previous research indicates that models used to de-

tect harmful content often have higher false positive

rates for content about and produced by marginal-

ized groups. Previous work has demonstrated this

can happen for several reasons. Because they ap-

pear more frequently in abusive comments than

non-abusive ones, identity terms such as "muslim"

and "gay", as well as terms associated with dis-

ability (Hutchinson et al.,2020), and gender (Park

et al.,2018;Borkan et al.,2019), exhibit false posi-

tive bias (Dixon et al.,2018;Borkan et al.,2019).

Research also indicates that annotator bias against

content written in AAVE (African-American Ver-

nacular English) is also likely a contributing factor

to model bias against the Black community. (Sap

et al.,2019;Ball-Burack et al.,2021;Halevy et al.,

2021). Harris et al. (2022) ﬁnd evidence that the

use of profanity and different word choice con-

ventions are a stronger contributor to bias against

AAVE than other grammatical features of AAVE.

Counterspeech (Haimson et al.,2021) and re-

claimed speech (Halevy et al.,2021;Sap et al.,

2019) from marginalized communities are also

commonly penalized by models. In summary, false

positive bias on social media is a type of repre-

sentational harm, where both content concerning

marginalized communities (in the case of counter-

speech or identity terms) or produced by marginal-

ized communities (in the case of dialect bias or

reclaimed speech) receives less ampliﬁcation than

other content. This can also lead to downstream

allocative harms, such as fewer impressions or fol-

lowers for content creators.

Determining what counts as harmful is an inher-

ently a subjective task, which poses challenges for

equitable content governance. The operationaliza-

tion of abstract theoretical constructs into observ-

able properties is frequently the source of many

fairness related harms (Jacobs and Wallach,2021).

Annotators’ country of origin (Salminen et al.,

2018), socio-demographic traits (Prabhakaran et al.,

2021;Goyal et al.,2022), political views (Waseem,

2016) and lived experiences (Waseem,2016;Prab-

hakaran et al.,2021) can affect their interpretations.

Hate speech annotations have notoriously low inter-

annotator agreement, suggesting that increasing the

quality and detail of annotation guidelines is cru-

cial for improving predictions (Ross et al.,2017).

This problem is exacerbated for borderline content,

as inter-annotator agreement tends to be lower for

content that that was deemed moderately hateful

in comparison with content rated as more severely

hateful (Salminen et al.,2019).

3 Methodology

3.1 English marginal abuse modeling at

Twitter

While Twitter does remove content that violates

rules on abusive behavior and hateful conduct, con-

tent that falls into the margins (known as "marginal

training set abusive non-abusive overall

FDR 39,018 89,050 128,068

prevalence 8,175 378,415 386,590

baseline model total 47,193 467,465 514,658

mitigation sample 7,987 36,039 46,414

mitigated model total 55,180 503,504 561,072

Test set (table 3) 916 20,770 21,686

Table 1: Size of the training data for the baseline model

and mitigated model, split by sampling type. The base-

line model is trained only on the FDR and prevalence

samples, whereas the mitigated model also includes the

mitigation sample.

abuse") often stays on the platform and risks posing

harm to some users.

Twitter uses a machine learning model for En-

glish to try to prevent marginally abusive content

from being recommended to users who do not

follow the author of such content. The model is

trained to predict whether or not a Tweet qualiﬁes

as one of the following content types

:advocate

for violence, dehumanization or incitement of fear,

sexual harassment, allegation of criminal behavior,

advocates for other consequences (e.g., job loss or

imprisonment), malicious cursing/profanity/slurs,

claims of mental inferiority, claims of moral inferi-

ority, other insult.

Twitter regularly samples Tweets in English to

be reviewed by human annotators for whether or

not they fall into one of the content categories listed

above, and these annotations are used as ground-

truth labels to train the marginal abuse model. Each

Tweet sampled for human annotation is reviewed

by 5 separate annotators and the majority vote la-

bel is used. The training and evaluation data Twit-

ter uses for the marginal abuse model is primarily

sampled via two mechanisms: FDR (false discov-

ery rate) sampling and prevalence based sampling.

Prevalence based sampling is random sampling

based on a weighting from how many times the

tweet was viewed, and is generally used to measure

the prevalence of marginally abusive content being

viewed on the platform. In contrast, FDR sam-

pling is sampling Tweets that have a high predicted

marginal abuse score (using the current marginal

abuse model in production) or high probability of

being reported. This helps collect marginally abu-

While they are collected, labels from the following cate-

gories are not subject to de-ampliﬁcation: allegation of crimi-

nal behavior, claims of moral inferiority, advocates for other

consequences, and other insult

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AKeywordBasedApproachtoUnderstandingtheOverpenalizationofMarginalizedGroupsbyEnglishMarginalAbuseModelsonTwitterKyraYeeTwitterAliceSchoenauerSebagTwitterOliviaRedeldEmilyShengMatthiasEckTwitterLucaBelliTwitterAbstractContentwarning:containsreferencestooffen-sivelanguageHarmfulcontentdetectionmode...

展开>> 收起<<

A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter Kyra Yee.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter Kyra Yee

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: