A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter Kyra Yee

2025-04-27 0 0 391.07KB 13 页 10玖币
侵权投诉
A Keyword Based Approach to Understanding the Overpenalization of
Marginalized Groups by English Marginal Abuse Models on Twitter
Kyra Yee
Twitter
Alice
Schoenauer Sebag
Twitter
Olivia Redfield Emily Sheng Matthias Eck
Twitter
Luca Belli
Twitter
Abstract
Content warning: contains references to offen-
sive language
Harmful content detection models tend to
have higher false positive rates for content
from marginalized groups. In the context of
marginal abuse modeling on Twitter, such dis-
proportionate penalization poses the risk of re-
duced visibility, where marginalized communi-
ties lose the opportunity to voice their opinion
on the platform. Current approaches to algo-
rithmic harm mitigation, and bias detection for
NLP models are often very ad hoc and subject
to human bias. We make two main contribu-
tions in this paper. First, we design a novel
methodology, which provides a principled ap-
proach to detecting and measuring the sever-
ity of potential harms associated with a text-
based model. Second, we apply our methodol-
ogy to audit Twitter’s English marginal abuse
model, which is used for removing amplifi-
cation eligibility of marginally abusive con-
tent. Without utilizing demographic labels
or dialect classifiers, we are still able to de-
tect and measure the severity of issues re-
lated to the over-penalization of the speech of
marginalized communities, such as the use of
reclaimed speech, counterspeech, and identity
related terms. In order to mitigate the associ-
ated harms, we experiment with adding addi-
tional true negative examples and find that do-
ing so provides improvements to our fairness
metrics without large degradations in model
performance.
1 Introduction
Because of the sheer volume of content, automatic
content governance has been a crucial tool to avoid
amplifying abusive content on Twitter. Harmful
content detection models are used to reduce the am-
plification of harmful content online. These models
are especially important to historically marginal-
ized groups, who are more frequently the target of
Work done while at Twitter
online harassment and hate speech (International,
2018;Vogels,2021). However, previous research
indicates that these models often have higher false
positive rates for marginalized communities, such
as the Black community, women, and the LGBTQ
community (Sap et al.,2019;Oliva et al.,2021;
Park et al.,2018). Within the context of social
media, higher false positive rates for a specific sub-
group pose the risk of reduced visibility, where
the community loses the opportunity to voice their
opinion on the platform. Unfortunately, there are
many contributing factors to over-penalization, in-
cluding linguistic variation, sampling bias, anno-
tator bias, label subjectivity, and modeling deci-
sions (Park et al.,2018;Sap et al.,2019;Wich
et al.,2020;Ball-Burack et al.,2021). This type
of over-penalization risks hurting the very com-
munities content governance is meant to protect.
Algorithmic audits have become an important tool
to surface these types of problems. However, deter-
mining the proper subgroups for analysis in global
settings, and collecting high quality demographic
information can be extremely challenging and pose
the risk of misuse (Andrus et al.,2021;Holstein
et al.,2019). Current approaches to harm mitiga-
tion are often reactive and subject to human bias
(Holstein et al.,2019). In this work, we present a
more principled and proactive approach to detect-
ing and measuring the severity of potential harms
associated with a text-based model, and conduct an
audit of one of the English marginal abuse models
used by Twitter for preventing potentially harmful
out-of-network recommendations. We develop a
list of keywords for evaluation by analyzing the text
of previous false positives to understand trends in
the model’s errors. This allows us to alleviate con-
cerns of false positive bias in content concerning
or created by marginalized groups without using
demographic data.
arXiv:2210.06351v1 [cs.CL] 7 Oct 2022
2 Related Work
2.1 Challenges in Algorithmic Auditing in
Industry
As issues of algorithmic bias have become more
prominent, algorithmic auditing has received in-
creasing attention both in academia and by indus-
try practitioners (Yee et al.,2021;Raji et al.,2020;
Buolamwini and Gebru,2018). However, substan-
tial challenges still remain for successfully being
able to proactively detect and mitigate problems:
1.
Determining the appropriate subgroups for
bias analysis: Although algorithmic auditing
has become a crucial tool to uncover issues
of bias in algorithmic systems, audits can of-
ten suffer major blindspots and fail to uncover
crucial problems that are not caught until af-
ter deployment or public outcry (Shen et al.,
2021;Holstein et al.,2019;Yee et al.,2021).
This is often due to limited positionality and
cultural blindspots of the auditors involved, or
sociotechnical considerations that are difficult
to anticipate before the system is deployed
(Shen et al.,2021;Holstein et al.,2019). Cur-
rent approaches to bias detection often rely
on predetermining an axis of injustice and
acquiring demographic data, or for NLP mod-
els, pre-defining a lexicon of terms that are
relevant to different subgroups (Dixon et al.,
2018;Ghosh et al.,2021;Sap et al.,2019).
Without domain expertise and nuanced local
cultural knowledge, it may be difficult to an-
ticipate problems or to know what relevant cat-
egories or combinations of categories should
be focused on (Andrus et al.,2021;Holstein
et al.,2019). For products such as Twitter
that have global reach, this problem is exacer-
bated due to the huge amount of cultural and
demographic diversity globally, and "efforts
to recruit more diverse teams may be helpful
yet insufficient" (Holstein et al.,2019). Even
in cases where audits are conducted proac-
tively, inquiries into problem areas are often
subject to human bias. Biases in non-Western
contexts are also frequently overlooked (Sam-
basivan et al.,2021).
2.
Sensitivity of demographic data: Most metrics
used to measure disparate impact of algorith-
mic systems rely on demographic information
(Barocas et al.,2017;Narayanan,2018). How-
ever, in industry settings, high quality demo-
graphic information can be difficult to procure
(Andrus et al.,2021).
Additionally, many scholars have called into
question harms associated with the uncriti-
cal conceptualization of demographic traits
such as gender, race, and disability (Hanna
et al.,2020;Keyes,2018;Hamidi et al.,
2018;Khan and Fu,2021;Hu and Kohler-
Hausmann,2020;Bennett and Keyes,2020).
There are fundamental concerns that the use
of demographic data poses the risk of natu-
ralizing or essentializing socially constructed
categories (Benthall and Haynes,2019;Hanna
et al.,2020;Fields and Fields,2014;Keyes,
2018). Lastly, in industry settings, clients or
users may be uncomfortable with organiza-
tions collecting or inferring sensitive infor-
mation about them due to misuse or privacy
concerns (Andrus et al.,2021). Additionally,
inferring demographic information may pose
dignitary concerns or risks of stereotyping
(Keyes,2018;Hamidi et al.,2018;Andrus
et al.,2021). Despite these risks and limita-
tions, this is not to suggest that demographic
data should never be used. Demographic data
can certainly be appropriate and even neces-
sary for addressing fairness related concerns
in many cases. However, because of the chal-
lenges discussed here, there is increasing inter-
est in developing strategies to detect and mit-
igate bias without demographic labels (Ben-
thall and Haynes,2019;Lazovich et al.,2022;
Rios,2020).
2.2 Bias in automated content governance
One key challenge in quantifying bias in machine
learning systems is the lack of a universal for-
malized notion of fairness; rather, different fair-
ness metrics imply different normative values and
have different appropriate use cases and limitations
(Narayanan,21;Barocas et al.,2017). For the pur-
poses of this study, we are primarily concerned
with false positive bias in marginal abuse modeling.
Previous research indicates that models used to de-
tect harmful content often have higher false positive
rates for content about and produced by marginal-
ized groups. Previous work has demonstrated this
can happen for several reasons. Because they ap-
pear more frequently in abusive comments than
non-abusive ones, identity terms such as "muslim"
and "gay", as well as terms associated with dis-
ability (Hutchinson et al.,2020), and gender (Park
et al.,2018;Borkan et al.,2019), exhibit false posi-
tive bias (Dixon et al.,2018;Borkan et al.,2019).
Research also indicates that annotator bias against
content written in AAVE (African-American Ver-
nacular English) is also likely a contributing factor
to model bias against the Black community. (Sap
et al.,2019;Ball-Burack et al.,2021;Halevy et al.,
2021). Harris et al. (2022) find evidence that the
use of profanity and different word choice con-
ventions are a stronger contributor to bias against
AAVE than other grammatical features of AAVE.
Counterspeech (Haimson et al.,2021) and re-
claimed speech (Halevy et al.,2021;Sap et al.,
2019) from marginalized communities are also
commonly penalized by models. In summary, false
positive bias on social media is a type of repre-
sentational harm, where both content concerning
marginalized communities (in the case of counter-
speech or identity terms) or produced by marginal-
ized communities (in the case of dialect bias or
reclaimed speech) receives less amplification than
other content. This can also lead to downstream
allocative harms, such as fewer impressions or fol-
lowers for content creators.
Determining what counts as harmful is an inher-
ently a subjective task, which poses challenges for
equitable content governance. The operationaliza-
tion of abstract theoretical constructs into observ-
able properties is frequently the source of many
fairness related harms (Jacobs and Wallach,2021).
Annotators’ country of origin (Salminen et al.,
2018), socio-demographic traits (Prabhakaran et al.,
2021;Goyal et al.,2022), political views (Waseem,
2016) and lived experiences (Waseem,2016;Prab-
hakaran et al.,2021) can affect their interpretations.
Hate speech annotations have notoriously low inter-
annotator agreement, suggesting that increasing the
quality and detail of annotation guidelines is cru-
cial for improving predictions (Ross et al.,2017).
This problem is exacerbated for borderline content,
as inter-annotator agreement tends to be lower for
content that that was deemed moderately hateful
in comparison with content rated as more severely
hateful (Salminen et al.,2019).
3 Methodology
3.1 English marginal abuse modeling at
Twitter
While Twitter does remove content that violates
rules on abusive behavior and hateful conduct, con-
tent that falls into the margins (known as "marginal
training set abusive non-abusive overall
FDR 39,018 89,050 128,068
prevalence 8,175 378,415 386,590
baseline model total 47,193 467,465 514,658
mitigation sample 7,987 36,039 46,414
mitigated model total 55,180 503,504 561,072
Test set (table 3) 916 20,770 21,686
Table 1: Size of the training data for the baseline model
and mitigated model, split by sampling type. The base-
line model is trained only on the FDR and prevalence
samples, whereas the mitigated model also includes the
mitigation sample.
abuse") often stays on the platform and risks posing
harm to some users.
Twitter uses a machine learning model for En-
glish to try to prevent marginally abusive content
from being recommended to users who do not
follow the author of such content. The model is
trained to predict whether or not a Tweet qualifies
as one of the following content types
1
:advocate
for violence, dehumanization or incitement of fear,
sexual harassment, allegation of criminal behavior,
advocates for other consequences (e.g., job loss or
imprisonment), malicious cursing/profanity/slurs,
claims of mental inferiority, claims of moral inferi-
ority, other insult.
Twitter regularly samples Tweets in English to
be reviewed by human annotators for whether or
not they fall into one of the content categories listed
above, and these annotations are used as ground-
truth labels to train the marginal abuse model. Each
Tweet sampled for human annotation is reviewed
by 5 separate annotators and the majority vote la-
bel is used. The training and evaluation data Twit-
ter uses for the marginal abuse model is primarily
sampled via two mechanisms: FDR (false discov-
ery rate) sampling and prevalence based sampling.
Prevalence based sampling is random sampling
based on a weighting from how many times the
tweet was viewed, and is generally used to measure
the prevalence of marginally abusive content being
viewed on the platform. In contrast, FDR sam-
pling is sampling Tweets that have a high predicted
marginal abuse score (using the current marginal
abuse model in production) or high probability of
being reported. This helps collect marginally abu-
1
While they are collected, labels from the following cate-
gories are not subject to de-amplification: allegation of crimi-
nal behavior, claims of moral inferiority, advocates for other
consequences, and other insult
摘要:

AKeywordBasedApproachtoUnderstandingtheOverpenalizationofMarginalizedGroupsbyEnglishMarginalAbuseModelsonTwitterKyraYeeTwitterAliceSchoenauerSebagTwitterOliviaRedeldEmilyShengMatthiasEckTwitterLucaBelliTwitterAbstractContentwarning:containsreferencestooffen-sivelanguageHarmfulcontentdetectionmode...

展开>> 收起<<
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter Kyra Yee.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:391.07KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注