
et al.,2018;Borkan et al.,2019), exhibit false posi-
tive bias (Dixon et al.,2018;Borkan et al.,2019).
Research also indicates that annotator bias against
content written in AAVE (African-American Ver-
nacular English) is also likely a contributing factor
to model bias against the Black community. (Sap
et al.,2019;Ball-Burack et al.,2021;Halevy et al.,
2021). Harris et al. (2022) find evidence that the
use of profanity and different word choice con-
ventions are a stronger contributor to bias against
AAVE than other grammatical features of AAVE.
Counterspeech (Haimson et al.,2021) and re-
claimed speech (Halevy et al.,2021;Sap et al.,
2019) from marginalized communities are also
commonly penalized by models. In summary, false
positive bias on social media is a type of repre-
sentational harm, where both content concerning
marginalized communities (in the case of counter-
speech or identity terms) or produced by marginal-
ized communities (in the case of dialect bias or
reclaimed speech) receives less amplification than
other content. This can also lead to downstream
allocative harms, such as fewer impressions or fol-
lowers for content creators.
Determining what counts as harmful is an inher-
ently a subjective task, which poses challenges for
equitable content governance. The operationaliza-
tion of abstract theoretical constructs into observ-
able properties is frequently the source of many
fairness related harms (Jacobs and Wallach,2021).
Annotators’ country of origin (Salminen et al.,
2018), socio-demographic traits (Prabhakaran et al.,
2021;Goyal et al.,2022), political views (Waseem,
2016) and lived experiences (Waseem,2016;Prab-
hakaran et al.,2021) can affect their interpretations.
Hate speech annotations have notoriously low inter-
annotator agreement, suggesting that increasing the
quality and detail of annotation guidelines is cru-
cial for improving predictions (Ross et al.,2017).
This problem is exacerbated for borderline content,
as inter-annotator agreement tends to be lower for
content that that was deemed moderately hateful
in comparison with content rated as more severely
hateful (Salminen et al.,2019).
3 Methodology
3.1 English marginal abuse modeling at
Twitter
While Twitter does remove content that violates
rules on abusive behavior and hateful conduct, con-
tent that falls into the margins (known as "marginal
training set abusive non-abusive overall
FDR 39,018 89,050 128,068
prevalence 8,175 378,415 386,590
baseline model total 47,193 467,465 514,658
mitigation sample 7,987 36,039 46,414
mitigated model total 55,180 503,504 561,072
Test set (table 3) 916 20,770 21,686
Table 1: Size of the training data for the baseline model
and mitigated model, split by sampling type. The base-
line model is trained only on the FDR and prevalence
samples, whereas the mitigated model also includes the
mitigation sample.
abuse") often stays on the platform and risks posing
harm to some users.
Twitter uses a machine learning model for En-
glish to try to prevent marginally abusive content
from being recommended to users who do not
follow the author of such content. The model is
trained to predict whether or not a Tweet qualifies
as one of the following content types
1
:advocate
for violence, dehumanization or incitement of fear,
sexual harassment, allegation of criminal behavior,
advocates for other consequences (e.g., job loss or
imprisonment), malicious cursing/profanity/slurs,
claims of mental inferiority, claims of moral inferi-
ority, other insult.
Twitter regularly samples Tweets in English to
be reviewed by human annotators for whether or
not they fall into one of the content categories listed
above, and these annotations are used as ground-
truth labels to train the marginal abuse model. Each
Tweet sampled for human annotation is reviewed
by 5 separate annotators and the majority vote la-
bel is used. The training and evaluation data Twit-
ter uses for the marginal abuse model is primarily
sampled via two mechanisms: FDR (false discov-
ery rate) sampling and prevalence based sampling.
Prevalence based sampling is random sampling
based on a weighting from how many times the
tweet was viewed, and is generally used to measure
the prevalence of marginally abusive content being
viewed on the platform. In contrast, FDR sam-
pling is sampling Tweets that have a high predicted
marginal abuse score (using the current marginal
abuse model in production) or high probability of
being reported. This helps collect marginally abu-
1
While they are collected, labels from the following cate-
gories are not subject to de-amplification: allegation of crimi-
nal behavior, claims of moral inferiority, advocates for other
consequences, and other insult