
Language
Model
The situation makes [PERSON]
feel [EMOTION WORD].
[PERSON] is feeling [EMOTION
WORD] due to the situation.
Original Template
Bias Benchmark
The situation
makes her feel
angry. Show statistically significant bias
Does not show statistically
significant bias
Modified Template
Sentiment Analysis
The situation
makes him feel
angry.
She is feeling
angry due to
the situation.
He is feeling
angry due to
the situation.
Paired T-Test
(p-value)
0.01
0.7
Figure 1: Example of the fragility of bias measurements for sentiment analysis. Although the senti-
ment analysis model demonstrates significant bias on the original template, the modified template
(modifying the original template while preserving content) instead results in a different conclusion!
as “The situation makes [PERSON] feel [EMOTION WORD].” to analyze whether
sentiment analysis systems exhibit statistically significant gender bias.
Although templates are a convenient, easy-to-use, and scalable diagnostic tool for model biases,
these very benefits can also lead to notable limitations. Due to the fill-in-the-blank nature of tem-
plates, they tend to be extremely short and convey a single idea. Therefore, templates may not
represent structural and stylistic variations that occur in natural text. Furthermore, the scalable na-
ture of templates means that most works tend to include a small set of templates (often single digits),
as opposed to a more diverse, comprehensive set. While each template captures a specific idea or
behavior, it is often unclear why template datasets are constructed the way they are, i.e., why certain
templates are included vs. excluded and why templates are phrased in a specific way. Therefore,
template evaluation may depict a limited and misleading picture of model bias. As highlighted in
Figure 1, the sentiment analysis model demonstrates statistically significant bias on an original tem-
plate from Kiritchenko and Mohammad [2018]. On the other hand, slightly modifying this template
results in a completely different conclusion.
In this paper, we ask: How brittle is template data evaluation for assessing model fairness? To
answer this question, we examine how sensitive bias measures are to meaning-preserving changes in
templates. Ideally, we would expect the original and modified templates, conveying similar content
and containing identical fill-in-the-blank terms, to result in close predictions and therefore capture
similar bias. We consider four tasks — sentiment analysis, toxicity detection, natural language
inference (NLI), and masked language modeling (MLM) — and draw on existing template datasets
for each. Template modifications are done manually and held fixed, instead of using an adversarial
or human-in-the-loop procedure (an example modification is shown in Figure 1). The reasoning
behind this choice is both to ensure modified templates remain coherent and similar to the original
versions, as well as to generate model-agnostic modifications.
We find, however, that bias varies considerably across modified templates and differs from original
measurements on 4 different NLP tasks. For example, by categorizing examples based on statis-
tical test outcomes for gender bias, we observe that 33% of modified templates result in different
categorizations for sentiment analysis. We also observe that task-specific bias measures change up
to 81% in NLI, 127% in toxicity detection, and 162% in MLM after modifications. Our results
raise important questions about how fairness is being evaluated in LLMs currently. They indicate
that bias measurements, and any subsequent conclusions made from these measurements, are in-
consistent and highly template-specific. As a result, the process of comparing models and choosing
the “least biased” model to deploy can lead to different decisions just based on subtle wording and
phrasing choices in templates. We strongly advise researchers to leverage handcrafted fairness eval-
uation datasets when available and appropriate, or to place greater emphasis on generating more
comprehensive and diverse sets of templates for bias evaluation.
2 Behavioral Testing for Fairness
In this section, we provide an overview of template-based bias evaluation for different NLP tasks,
as well as the template modification and training procedures.
2