Cascading Biases Investigating the Effect of Heuristic Annotation Strategies on Data and Models Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar

2025-04-30 0 0 1.35MB 16 页 10玖币
侵权投诉
Cascading Biases: Investigating the Effect of Heuristic Annotation
Strategies on Data and Models
Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar
University of Pennsylvania
{cmalaviy, bhatiasu, myatskar}@upenn.edu
Abstract
Cognitive psychologists have documented that
humans use cognitive heuristics, or mental
shortcuts, to make quick decisions while ex-
pending less effort. While performing anno-
tation work on crowdsourcing platforms, we
hypothesize that such heuristic use among
annotators cascades on to data quality and
model robustness. In this work, we study
cognitive heuristic use in the context of an-
notating multiple-choice reading comprehen-
sion datasets. We propose tracking annotator
heuristic traces, where we tangibly measure
low-effort annotation strategies that could indi-
cate usage of various cognitive heuristics. We
find evidence that annotators might be using
multiple such heuristics, based on correlations
with a battery of psychological tests. Impor-
tantly, heuristic use among annotators deter-
mines data quality along several dimensions:
(1) known biased models, such as partial in-
put models, more easily solve examples au-
thored by annotators that rate highly on heuris-
tic use, (2) models trained on annotators scor-
ing highly on heuristic use don’t generalize
as well, and (3) heuristic-seeking annotators
tend to create qualitatively less challenging ex-
amples. Our findings suggest that tracking
heuristic usage among annotators can poten-
tially help with collecting challenging datasets
and diagnosing model biases.
1 Introduction
While crowdsourcing is an effective and widely-
used data collection method in NLP, it comes with
caveats. Crowdsourced datasets have been found
to contain artifacts from the annotation process,
and models trained on such data, can be brittle and
fail to generalize to distribution shifts (Gururangan
et al.,2018;Kaushik and Lipton,2018;McCoy
et al.,2019). In this work, we ask whether sys-
tematic patterns in annotator behavior influence the
quality of collected data.
We hypothesize that usage of cognitive heuris-
tics, which are mental shortcuts that humans em-
ploy in everyday life, can cascade on to data quality
and model robustness. For example, an annotator
asked to write a question based on a passage might
not read the entire passage or might use just one
sentence to frame a question. Annotators may seek
shortcuts to economize on the amount of time and
effort they put into a task. This behavior in annota-
tors, characterized by examples that are acceptable
but not high-quality, can be problematic.
We analyze the extent to which annotators en-
gage in various low-effort strategies, akin to cog-
nitive heuristics, by tracking indicative features
from their annotation data in the form of
annota-
tor heuristic traces
. First, we crowdsource read-
ing comprehension questions where we instruct
workers to write hard questions. Inspired by re-
search on human cognition (Simon,1956;Tversky
and Kahneman,1974), we identify several heuris-
tics that could be employed by annotators for our
task, such as satisficing (Simon,1956), availability
(Tversky and Kahneman,1973) and representative-
ness (Kahneman and Tversky,1972). We measure
their potential usage by featurizing the collected
data and annotation metadata (e.g., time spent and
keystrokes entered) (§4). Further, we identify in-
stantiations of these heuristics that correlate well
with psychological tests measuring heuristic think-
ing tendencies in humans, such as the cognitive re-
flection test (Frederick,2005;Toplak et al.,2014;
Sirota et al.,2021). Our psychologically plausible
measures of heuristic use during annotation can be
aggregated per annotator, forming a holistic sum-
mary of the data they produce.
Based on these statistics, we analyze differences
between examples created by annotators who en-
gage in different levels of heuristics use. Our
first finding is that examples created by strongly
heuristic-seeking annotators are also easier for
models to solve using heuristics (§5). We eval-
arXiv:2210.13439v2 [cs.CL] 23 Jan 2023
uate models that exploit a few known biases and
find that examples from annotators who use cogni-
tive heuristics are more easily solvable by biased
models. We also examine what impact heuristics
have on trained models. Previous work (Geva et al.,
2019) shows that models generalize poorly when
datasets are split randomly by annotators, likely
due to the existence of artifacts. We replicate this
result and find that models generalize even worse
when trained on examples from heuristic-seeking
annotators.
To understand which parts of the annotation
pipeline contribute to heuristic-seeking behavior in
annotators, we also tease apart the effect of compo-
nents inherent to the task (e.g., passage difficulty)
as opposed to the annotators themselves (e.g., an-
notator fatigue) (§6). Unfortunately, we don’t dis-
cover simple predictors (i.e., passage difficulty) of
when annotators are likely to use heuristics.
A qualitative analysis of the collected data re-
veals that heuristic-seeking annotators are more
likely to create examples that are not valid, and
require simpler word-matching on explicitly stated
information (§7). Crucially, this suggests that mea-
surements of heuristic usage, such as those exam-
ined in this paper, can provide a general method
to find unreliable examples in crowdsourced data,
and direct our search for discovering artifacts in the
data. Because we implicate heuristic use in terms
of robustness and data quality, we suggest future
dataset creators track similar features and evaluate
model sensitivity to annotator heuristic use.1
2 Background and Related Work
Cognitive Heuristics.
The study of heuristics in
human judgment, decision making, and reasoning
is a popular and influential topic of research (Si-
mon,1956;Tversky and Kahneman,1974). Heuris-
tics can be defined as mental shortcuts, that we use
in everyday tasks for fast decision-making. For
example, Tversky and Kahneman (1974) asked par-
ticipants whether more English words begin with
the letter Kor contain Kas the
3rd
letter, and more
than 70% participants chose the former because
words that begin with Kare easier to recall, al-
though that is incorrect. This is an example of
the availability heuristic. Systematic use of such
heuristics can lead to cognitive biases, which are
irrational patterns in our thinking.
1
Our code and collected data is available at
https://github.com/chaitanyamalaviya/annotator-heuristics.
At first glance, it may seem that heuristics are
always suboptimal, but previous work has argued
that heuristics can lead to accurate inferences under
uncertainty, compared to optimization (Gigerenzer
and Gaissmaier,2011). We hypothesize that heuris-
tics can play a considerable role in determining
data quality and their impact depends on the exact
nature of the heuristic. Previous work has shown
that crowdworkers are susceptible to cognitive bi-
ases in a relevance judgement task (Eickhoff,2018),
and has provided a checklist to combat these biases
(Draws et al.,2021). In contrast, our work focuses
on how potential use of such heuristics can be mea-
sured in a writing task, and provides evidence that
heuristic use is linked to model brittleness.
Features of annotator behavior have previously
been useful in estimating annotator task accuracies
(Rzeszotarski and Kittur,2011;Goyal et al.,2018).
Annotator identities have also been found to influ-
ence their annotations (Hube et al.,2019;Sap et al.,
2022). Our work builds on these results and esti-
mates heuristic use with features to capture implicit
clues about data quality.
Mitigating and discovering biases.
The pres-
ence of artifacts or biases in datasets is well-
documented in NLP, in tasks such as natural lan-
guage inference, question answering and argu-
ment comprehension (Gururangan et al.,2018;
McCoy et al.,2019;Niven and Kao,2019,in-
ter alia). These artifacts allow models to solve
NLP problems using unreliable shortcuts (Geirhos
et al.,2020). Several researchers have proposed
approaches to achieve robustness against known
biases. We refer the reader to Wang et al. (2022)
for a comprehensive review of these methods. Tar-
geting biases that are unknown continues to be a
challenge, and our work can help find examples
which are likely to contain artifacts, by identifying
heuristic-seeking annotators.
Prior work has proposed methods to discover
shortcuts using explanations of model predictions
(Lertvittayakumjorn and Toni,2021), including
sample-based explanations (Han et al.,2020) and
input feature attributions (Bastings et al.,2021;
Pezeshkpour et al.,2022). Other techniques that
can be helpful in diagnosing model biases include
building a checklist of test cases (Ribeiro et al.,
2020;Ribeiro and Lundberg,2022), constructing
contrastive (Gardner et al.,2020) or counterfactual
(Wu et al.,2021) examples and statistical tests (Gu-
rurangan et al.,2018;Gardner et al.,2021). Our
work is complementary to these approaches, as we
provide an alternative approach to bias discovery
that is tied to annotators.
Improved crowdsourcing.
A related line of
work has studied modifications to crowdsourcing
protocols to improve data quality (Bowman et al.,
2020;Nangia et al.,2021). In addition, model-
in-the-loop crowdsourcing methods such as adver-
sarial data collection (Nie et al.,2020) and the
use of generative models (Bartolo et al.,2022;Liu
et al.,2022) have been shown to be helpful in cre-
ating more challenging examples. We believe that
tracking annotator heuristics use can help make
informed adjustments to crowdsourcing protocols.
3 Annotation Protocol
We consider multiple-choice reading comprehen-
sion as our crowdsourcing task, because of the rich-
ness of responses and interaction we can get from
annotators, which allows us to explore a range
of hypothetical heuristics. We describe here the
methodology for our data collection.
We provided annotators on Amazon Mechani-
cal Turk with passages and ask them to write a
multiple-choice question with four options. We
used the first paragraphs of ‘vital articles’ from the
English Wikipedia
2
, and ensured that passages are
at least 50 words long and at most 250 words long.
Passages spanned 11 genres including arts, history,
physical sciences, and others, and passages were
randomly sampled from the 10K passages. Anno-
tators were asked to write challenging questions
that cannot be answered by reading just the ques-
tion or passage alone, and have a single correct
answer. Further, they were asked to ensure that
passages provided sufficient information to answer
the question while allowing questions to require
basic inferences using commonsense or causality.
Annotators were first qualified to avoid spam-
ming behavior. This qualification checked for
spamming behavior in the form of invalid questions,
and not example quality. Annotators were then
asked to write a multiple-choice question to 4 pas-
sages in a single HIT on MTurk. Annotators were
asked to not work on more than 8 HITs. We col-
lected 1225 multiple-choice question-answer pairs
from 73 annotators. In addition, we also logged
their keystrokes and the time taken to complete an
2
Wikipedia Level 4 vital articles:
https:
//en.wikipedia.org/wiki/Wikipedia:
Vital_articles/Level/4
example (ensuring that time away from the screen
was not counted). Our annotation interface was
built upon Nangia et al. (2021). For other details
about our annotation protocol, please refer to Ap-
pendix A.
4 Cognitive Heuristics in Crowdsourcing
Cognitive heuristics are mental shortcuts, that hu-
mans employ in problem-solving tasks to make
quick judgments (Simon,1956;Tversky and Kah-
neman,1974). Annotators, tasked with authoring
natural language examples, are not infallible to us-
ing such heuristics. We hypothesize that, in writing
tasks, reliance on heuristics is a traceable indicator
of poor data quality. In this section, we identify
several heuristics, their consequences in annotator
behavior, and features to track them. Later, we also
show they are predictors of qualitatively important
dimensions of data.
4.1 Methodology
To test the above hypothesis, we consider several
known cognitive heuristics which could be rele-
vant for our task. This list is not comprehensive,
and we refer the readers to prior work for a thor-
ough overview of cognitive biases (Shah and Op-
penheimer,2008;Draws et al.,2021). To tangibly
measure the potential usage of a heuristic, we fea-
turize each heuristic into a measurable quantity that
can be computed automatically for an example (see
Table 1). While we do not conclusively determine
that an annotator is using a heuristic, we explore
various featurizations that align with the intuition
behind each heuristic. These featurizations can
sometimes be mapped to multiple heuristics that
interact together, but for ease of presentation, we
list them under the most related cognitive heuristic.
These help us create
annotator heuristic traces
,
which contain average heuristic values across all of
an annotator’s examples.
To verify if our instantiation of a heuristic aligns
with heuristic-seeking tendencies in annotators, we
measure correlations of heuristic values with anno-
tator performances on a battery of psychological
tests (Frederick,2005;Toplak et al.,2014;Sirota
et al.,2021), described in §4.4.
4.2 Heuristics Studied
Satisficing:
Satisficing is a cognitive heuristic
that involves making a satisfactory choice, rather
than an optimal one (Simon,1956). In terms of
Consequence of cognitive heuristic Featurization
Satisficing (lowtime) (1) time, (2) log (time), (3) time / doc length, (4) log (time / doc length)
Satisficing (loweffort) (1) question length, (2) keystroke length, (3) question+ops length,
(4) question+ops length / keystroke length
Availability (first option bias) First option is marked as correct answer
Availability (serial position) Correct answer matches span in first or last sentence of passage
Representativeness (word overlap) Average word overlap in all pairs of examples by annotator
Representativeness (copying) (1) Length of longest common subsequence (lcs) b/w doc & question,
(2) Max of normalized length of lcs between doc & {question, options},
(3) Normalized avg of length of lcs between doc & {question, options}
Table 1: Consequences of cognitive heuristics and featurizations for multiple-choice reading comprehension data.
mental process, strong satisficing can involve inat-
tention to information and lack of information syn-
thesis. In social cognition, Krosnick (1991) de-
scribed how satisficing can manifest in various
patterns in survey responses. For example, survey-
takers might pick the same response to several ques-
tions in sequence, pick a random response, or use
the acquiescence bias (where they always choose
to agree with the given statement). A potential out-
come of satisficing in our task is low time spent on
the task and low effort put into forming a question.
Assuming the working time is
t
and number of
tokens in a passage
d
is
ld
, we consider the follow-
ing lowtime featurizations: (1)
t
, (2)
log t
, (3)
t/ld
,
(4) log(t/ld).3
We estimate an annotator’s amount of effort
through their responses. An annotator who is con-
sistently editing their work or writing long ques-
tions might be attempting to thoughtfully draft their
question. While this may not always be true (for
eg, a worker might spend time thinking about their
question and only start writing later), we hypoth-
esize that often, short responses can be indicators
of satisficing. Given the number of words found
in a stream of keystrokes,
k
, the question
q
, and
all options
oi
is
lk
,
lq
and
lo
, we consider these
loweffort featurizations: (1)
lq
, (2)
lk
, (3)
lq+lo
,
(4) (lq+lo)/lk.
Availability heuristic:
The tendency to rely
upon information that is more readily retrievable
from our memory is the availability heuristic (Tver-
sky and Kahneman,1973). For example, after hear-
ing about a plane crash on the news, people may
overstate the dangers of flying. For our task, once
an annotator has read a passage and formulated a
question, the question and the correct answer are
likely to be readily available in their mind. This
could cause them to write that information before
3
Previous work shows that taking the logarithm normalizes
the response time distribution (Whelan,2008).
any of the distractor options. Therefore, we check
whether the first option specified for an example is
also the correct answer (first option bias).
Another consequence of this heuristic is the
serial-position effect. When presented with a series
of items like a list of words or items in a grocery
list, people recall the first and last few items from
the series better than the middle ones (Murdock Jr,
1962;Ebbinghaus,1964) because of their easier
availability. This effect can also be explained as
a combination of the primacy effect and recency
effect. To test if an annotator anchors their ques-
tions on the first or last sentence of the passage
due to this heuristic, we check if the correct answer
marked for an example matches a span in the first
or last sentence of the passage (serial position).
Representativeness heuristic:
The representa-
tiveness heuristic is our tendency to use the simi-
larity of items to make decisions (Kahneman and
Tversky,1972). For example, if a person is picking
a movie to watch, they might think of movies they
previously liked and look for those attributes in
a new movie. Similarly, an annotator may repeat
the same construction in their questions to ease
decision-making (e.g., "which of the following is
true?" or "what year did [event] happen?"). This
could either mean that they are not fully engaged,
or, they found a writing strategy that works well
and they choose to stick to it. We measure this
tendency by computing the average word overlap
across all pairs of questions from an annotator.
A different manner in which this heuristic can
manifest is using similarity with the provided con-
text, i.e., through copying. Copying, or imita-
tion, is a common building block that guides hu-
man behavior and decision making. In deciding
what clothes to buy or which book to read, hu-
mans use imitation-of-the-majority to make quicker
inferences with lesser cognitive effort (Garcia-
Retamero et al.,2009;Gigerenzer and Gaissmaier,
摘要:

CascadingBiases:InvestigatingtheEffectofHeuristicAnnotationStrategiesonDataandModelsChaitanyaMalaviyaandSudeepBhatiaandMarkYatskarUniversityofPennsylvania{cmalaviy,bhatiasu,myatskar}@upenn.eduAbstractCognitivepsychologistshavedocumentedthathumansusecognitiveheuristics,ormentalshortcuts,tomakequickde...

展开>> 收起<<
Cascading Biases Investigating the Effect of Heuristic Annotation Strategies on Data and Models Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.35MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注