
resentation space, wherein similarity measures be-
tween sentences better correspond to their seman-
tic meanings (Gao et al.,2021). Meanwhile, our
proposed alignment loss, which pulls identical sen-
tences along contrasting gender directions closer,
is well-suited to learning a fairer semantic space.
We systematically evaluate MABEL on a com-
prehensive suite of intrinsic and extrinsic measures
spanning language modeling, text classification,
NLI, and coreference resolution. MABEL per-
forms well against existing gender debiasing ef-
forts in terms of both fairness and downstream task
performance, and it also preserves language under-
standing on the GLUE benchmark (Wang et al.,
2019). Altogether, these results demonstrate the
effectiveness of harnessing NLI data for bias at-
tenuation, and underscore MABEL’s potential as a
general-purpose fairer encoder.
Lastly, we identify two major issues in existing
gender bias mitigation literature. First, many pre-
vious approaches solely quantify bias through the
Sentence Encoding Association Test (SEAT) (May
et al.,2019), a metric that compares the geometric
relations between sentence representations. De-
spite scoring well on SEAT, many debiasing meth-
ods do not show the same fairness gains across
other evaluation settings. Second, previous ap-
proaches evaluate on extrinsic benchmarks in an
inconsistent manner. For a fairer comparison, we
either reproduce or summarize the performance of
many recent methodologies on major evaluation
tasks. We believe that unifying the evaluation set-
tings lays the groundwork for more meaningful
methodological comparisons in future research.
2 Background
2.1 Debiasing Contextualized
Representations
Debiasing attempts in NLP can be divided into two
categories. In the first category, the model learns to
disregard the influence of sensitive attributes in rep-
resentations during fine-tuning, through projection-
based (Ravfogel et al.,2020,2022), adversar-
ial (Han et al.,2021a,b) or contrastive (Shen et al.,
2021;Chi et al.,2022) downstream objectives. This
approach is task-specific as it requires fine-tuning
data that is annotated for the sensitive attribute.
The second type, task-agnostic training, mitigates
bias by leveraging textual information from gen-
eral corpora. This can involve computing a gender
subspace and eliminating it from encoded represen-
tations (Dev et al.,2020;Liang et al.,2020;Dev
et al.,2021;Kaneko and Bollegala,2021), or by
re-training the encoder with a higher dropout (Web-
ster et al.,2020) or equalizing objectives (Cheng
et al.,2021;Guo et al.,2022) to alleviate unwanted
gender associations.
We summarize recent efforts of both task-
specific and task-agnostic approaches in Table 1.
Compared to task-specific approaches that only
debias for the task at hand, task-agnostic models
produce fair encoded representations that can be
used toward a variety of applications. MABEL
is task-agnostic, as it produces a general-purpose
debiased model. Some recent efforts have broad-
ened the scope of task-specific approaches. For in-
stance, Meade et al. (2022) adapt the task-specific
Iterative Nullspace Linear Projection (INLP) (Rav-
fogel et al.,2020) algorithm to rely on Wikipedia
data for language model probing. While non-task-
agnostic approaches can potentially be adapted to
general-purpose debiasing, we primarily consider
other task-agnostic approaches in this work.
2.2 Evaluating Biases in NLP
The recent surge of interest in fairer NLP systems
has surfaced a key question: how should bias be
quantified? Intrinsic metrics directly probe the up-
stream language model, whether by measuring the
geometry of the embedding space (Caliskan et al.,
2017;May et al.,2019;Guo and Caliskan,2021),
or through likelihood-scoring (Kurita et al.,2019;
Nangia et al.,2020;Nadeem et al.,2021). Extrinsic
metrics evaluate for fairness by comparing the sys-
tem’s predictions across different populations on a
downstream task (De-Arteaga et al.,2019a;Zhao
et al.,2019;Dev et al.,2020). Though opaque,
intrinsic metrics are fast and cheap to compute,
which makes them popular among contemporary
works (Meade et al.,2022;Qian et al.,2022). Com-
paratively, though extrinsic metrics are more inter-
pretable and reflect tangible social harms, they are
often time- and compute-intensive, and so tend to
be less frequently used.2
To date, the most popular bias metric among
task-agnostic approaches is the Sentence Encoder
Association Test (SEAT) (May et al.,2019), which
compares the relative distance between the encoded
representations. Recent studies have cast doubt on
the predictive power of these intrinsic indicators.
2
As Table 17 in Appendix F indicates, many previous bias
mitigation approaches limit evaluation to 1 or 2 metrics.