Cao et al. (2022b) focuses on identifying stereo-
typed group-trait associations in language mod-
els, by introducing a sensitivity test for measur-
ing stereotypical associations. They compare US-
based human judgements to language model stereo-
types, and discover moderate correlations. They
also extend their framework to measure language
model stereotyping of intersectional identities, find-
ing problems with identifying emergent intersec-
tional stereotypes. Our work is unique from this
in that we have additionally performed debiasing
informed by the SCM.
Overall, our methodology and approach differs
from most other contributions in this field as it
focuses on targeting social bias specifically, and we
propose a fine-tuning debiasing approach which
requires little in the way of human or computer
resources and is not limited to a small number of
demographics.
3 Data sets and tasks
3.1 Data for Debiasing Procedure
3.1.1 Identity terms (targets)
We established two sets of identity terms (targets)
for use with the context debiasing algorithm. The
first set relates to racial bias (bias against people
of colour based on their (perceived) race). BERT
has been shown to demonstrate racial bias in both
intrinsic (Guo and Caliskan,2021) and extrinsic
measures (Nadeem et al.,2021;Sheng et al.,2019).
To reduce bias against Black people compared to
white, we created a list of 20 African American
(AA) and 20 European American (EA), 10 male
and 10 female names for each, to use in the debi-
asing procedure. We used names from Guo and
Caliskan (2021) (excluding any included in the
CEAT tests we deploy, see Section 3.2) and sup-
plemented these lists with common names from a
database of US first names (Tzioumis,2018). Ex-
cluding names from the CEAT tests was crucial to
ensure a reduction in bias was due to a restructur-
ing of the embedding space and an overall change
in how Black individuals were represented, and not
due to bias reduction for the specific names we ran
the debiasing procedure with.
The second set relates to intersectional bias
against Mexican American (MA) women, that is
bias against women based on both patriarchal be-
liefs about their gender and prejudice against their
ethnicity. This intersectional bias is evident in the
contextualised embeddings BERT produces (Guo
and Caliskan,2021). To reduce bias against MA
women compared to white men, we additionally
took 10 common Hispanic female names (and man-
ually confirmed that each was used by the Mexican
American community through a Google search)
from Tzioumis (2018).
The validity of using names to represent demo-
graphic groups has been questioned (Blodgett et al.,
2021). However, we assume that reducing bias
present in the representations of these names will
go some way to reducing racial bias in the model.
3.1.2 Stereotype Content terms (attributes)
As with Fraser et al. (2021), we use the Stereotype
Content terms from Nicolas et al. (2021), whereby
the high morality, high sociability terms are taken
to indicate warmth; low morality, low sociability
to indicate coldness; high ability, high agency to
indicate competence; and low ability, low agency
to indicate incompetence. We selected the top 32
most frequent terms from each list (as measured
using the Brown Corpus and the NLTK toolkit),
to increase the likelihood we would find a large
number of example sentences for each. During
finetuning, we wish for these terms to maintain
their projection in the warmth/coldness or compe-
tence/incompetence space, respectively, whilst re-
moving projection in these directions for the target
terms (see Section 4and Figure 1).
Whilst the exact “position” of demographic
groups in this conceptual space would vary depend-
ing on who is describing them, in this work we
always assume the minority group will be repre-
sented in the original model as cold and incom-
petent, in other words the most disfavoured and
most likely to experience harm (Cuddy et al.,2008).
This minimises workload (no need to establish
likely predictions for every demographic consid-
ered, beyond identifying the more marginalised
group) and centers our approach around improving
results for the most negatively represented iden-
tity terms. Note, there is no harm in running our
debiasing procedure on identities that are already
equally associated with one concept i.e. warmth,
whilst also reducing stereotyped associations with
the other concept i.e. competence.
3.1.3 Fine-tuning data
Having established the list of attribute and target
terms, we follow an adapted version of Kaneko
and Bollegala (2021)’s procedure for generating
fine-tuning development data. During early analy-