Cascading Biases Investigating the Effect of Heuristic Annotation Strategies on Data and Models Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar

2025-04-30 0 0 1.35MB 16 页 10玖币

侵权投诉

Cascading Biases: Investigating the Effect of Heuristic Annotation

Strategies on Data and Models

Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar

University of Pennsylvania

{cmalaviy, bhatiasu, myatskar}@upenn.edu

Abstract

Cognitive psychologists have documented that

humans use cognitive heuristics, or mental

shortcuts, to make quick decisions while ex-

pending less effort. While performing anno-

tation work on crowdsourcing platforms, we

hypothesize that such heuristic use among

annotators cascades on to data quality and

model robustness. In this work, we study

cognitive heuristic use in the context of an-

notating multiple-choice reading comprehen-

sion datasets. We propose tracking annotator

heuristic traces, where we tangibly measure

low-effort annotation strategies that could indi-

cate usage of various cognitive heuristics. We

ﬁnd evidence that annotators might be using

multiple such heuristics, based on correlations

with a battery of psychological tests. Impor-

tantly, heuristic use among annotators deter-

mines data quality along several dimensions:

(1) known biased models, such as partial in-

put models, more easily solve examples au-

thored by annotators that rate highly on heuris-

tic use, (2) models trained on annotators scor-

ing highly on heuristic use don’t generalize

as well, and (3) heuristic-seeking annotators

tend to create qualitatively less challenging ex-

amples. Our ﬁndings suggest that tracking

heuristic usage among annotators can poten-

tially help with collecting challenging datasets

and diagnosing model biases.

1 Introduction

While crowdsourcing is an effective and widely-

used data collection method in NLP, it comes with

caveats. Crowdsourced datasets have been found

to contain artifacts from the annotation process,

and models trained on such data, can be brittle and

fail to generalize to distribution shifts (Gururangan

et al.,2018;Kaushik and Lipton,2018;McCoy

et al.,2019). In this work, we ask whether sys-

tematic patterns in annotator behavior inﬂuence the

quality of collected data.

We hypothesize that usage of cognitive heuris-

tics, which are mental shortcuts that humans em-

ploy in everyday life, can cascade on to data quality

and model robustness. For example, an annotator

asked to write a question based on a passage might

not read the entire passage or might use just one

sentence to frame a question. Annotators may seek

shortcuts to economize on the amount of time and

effort they put into a task. This behavior in annota-

tors, characterized by examples that are acceptable

but not high-quality, can be problematic.

We analyze the extent to which annotators en-

gage in various low-effort strategies, akin to cog-

nitive heuristics, by tracking indicative features

from their annotation data in the form of

annota-

tor heuristic traces

. First, we crowdsource read-

ing comprehension questions where we instruct

workers to write hard questions. Inspired by re-

search on human cognition (Simon,1956;Tversky

and Kahneman,1974), we identify several heuris-

tics that could be employed by annotators for our

task, such as satisﬁcing (Simon,1956), availability

(Tversky and Kahneman,1973) and representative-

ness (Kahneman and Tversky,1972). We measure

their potential usage by featurizing the collected

data and annotation metadata (e.g., time spent and

keystrokes entered) (§4). Further, we identify in-

stantiations of these heuristics that correlate well

with psychological tests measuring heuristic think-

ing tendencies in humans, such as the cognitive re-

ﬂection test (Frederick,2005;Toplak et al.,2014;

Sirota et al.,2021). Our psychologically plausible

measures of heuristic use during annotation can be

aggregated per annotator, forming a holistic sum-

mary of the data they produce.

Based on these statistics, we analyze differences

between examples created by annotators who en-

gage in different levels of heuristics use. Our

ﬁrst ﬁnding is that examples created by strongly

heuristic-seeking annotators are also easier for

models to solve using heuristics (§5). We eval-

arXiv:2210.13439v2 [cs.CL] 23 Jan 2023

uate models that exploit a few known biases and

ﬁnd that examples from annotators who use cogni-

tive heuristics are more easily solvable by biased

models. We also examine what impact heuristics

have on trained models. Previous work (Geva et al.,

2019) shows that models generalize poorly when

datasets are split randomly by annotators, likely

due to the existence of artifacts. We replicate this

result and ﬁnd that models generalize even worse

when trained on examples from heuristic-seeking

annotators.

To understand which parts of the annotation

pipeline contribute to heuristic-seeking behavior in

annotators, we also tease apart the effect of compo-

nents inherent to the task (e.g., passage difﬁculty)

as opposed to the annotators themselves (e.g., an-

notator fatigue) (§6). Unfortunately, we don’t dis-

cover simple predictors (i.e., passage difﬁculty) of

when annotators are likely to use heuristics.

A qualitative analysis of the collected data re-

veals that heuristic-seeking annotators are more

likely to create examples that are not valid, and

require simpler word-matching on explicitly stated

information (§7). Crucially, this suggests that mea-

surements of heuristic usage, such as those exam-

ined in this paper, can provide a general method

to ﬁnd unreliable examples in crowdsourced data,

and direct our search for discovering artifacts in the

data. Because we implicate heuristic use in terms

of robustness and data quality, we suggest future

dataset creators track similar features and evaluate

model sensitivity to annotator heuristic use.1

2 Background and Related Work

Cognitive Heuristics.

The study of heuristics in

human judgment, decision making, and reasoning

is a popular and inﬂuential topic of research (Si-

mon,1956;Tversky and Kahneman,1974). Heuris-

tics can be deﬁned as mental shortcuts, that we use

in everyday tasks for fast decision-making. For

example, Tversky and Kahneman (1974) asked par-

ticipants whether more English words begin with

the letter Kor contain Kas the

3rd

letter, and more

than 70% participants chose the former because

words that begin with Kare easier to recall, al-

though that is incorrect. This is an example of

the availability heuristic. Systematic use of such

heuristics can lead to cognitive biases, which are

irrational patterns in our thinking.

Our code and collected data is available at

https://github.com/chaitanyamalaviya/annotator-heuristics.

At ﬁrst glance, it may seem that heuristics are

always suboptimal, but previous work has argued

that heuristics can lead to accurate inferences under

uncertainty, compared to optimization (Gigerenzer

and Gaissmaier,2011). We hypothesize that heuris-

tics can play a considerable role in determining

data quality and their impact depends on the exact

nature of the heuristic. Previous work has shown

that crowdworkers are susceptible to cognitive bi-

ases in a relevance judgement task (Eickhoff,2018),

and has provided a checklist to combat these biases

(Draws et al.,2021). In contrast, our work focuses

on how potential use of such heuristics can be mea-

sured in a writing task, and provides evidence that

heuristic use is linked to model brittleness.

Features of annotator behavior have previously

been useful in estimating annotator task accuracies

(Rzeszotarski and Kittur,2011;Goyal et al.,2018).

Annotator identities have also been found to inﬂu-

ence their annotations (Hube et al.,2019;Sap et al.,

2022). Our work builds on these results and esti-

mates heuristic use with features to capture implicit

clues about data quality.

Mitigating and discovering biases.

The pres-

ence of artifacts or biases in datasets is well-

documented in NLP, in tasks such as natural lan-

guage inference, question answering and argu-

ment comprehension (Gururangan et al.,2018;

McCoy et al.,2019;Niven and Kao,2019,in-

ter alia). These artifacts allow models to solve

NLP problems using unreliable shortcuts (Geirhos

et al.,2020). Several researchers have proposed

approaches to achieve robustness against known

biases. We refer the reader to Wang et al. (2022)

for a comprehensive review of these methods. Tar-

geting biases that are unknown continues to be a

challenge, and our work can help ﬁnd examples

which are likely to contain artifacts, by identifying

heuristic-seeking annotators.

Prior work has proposed methods to discover

shortcuts using explanations of model predictions

(Lertvittayakumjorn and Toni,2021), including

sample-based explanations (Han et al.,2020) and

input feature attributions (Bastings et al.,2021;

Pezeshkpour et al.,2022). Other techniques that

can be helpful in diagnosing model biases include

building a checklist of test cases (Ribeiro et al.,

2020;Ribeiro and Lundberg,2022), constructing

contrastive (Gardner et al.,2020) or counterfactual

(Wu et al.,2021) examples and statistical tests (Gu-

rurangan et al.,2018;Gardner et al.,2021). Our

work is complementary to these approaches, as we

provide an alternative approach to bias discovery

that is tied to annotators.

Improved crowdsourcing.

A related line of

work has studied modiﬁcations to crowdsourcing

protocols to improve data quality (Bowman et al.,

2020;Nangia et al.,2021). In addition, model-

in-the-loop crowdsourcing methods such as adver-

sarial data collection (Nie et al.,2020) and the

use of generative models (Bartolo et al.,2022;Liu

et al.,2022) have been shown to be helpful in cre-

ating more challenging examples. We believe that

tracking annotator heuristics use can help make

informed adjustments to crowdsourcing protocols.

3 Annotation Protocol

We consider multiple-choice reading comprehen-

sion as our crowdsourcing task, because of the rich-

ness of responses and interaction we can get from

annotators, which allows us to explore a range

of hypothetical heuristics. We describe here the

methodology for our data collection.

We provided annotators on Amazon Mechani-

cal Turk with passages and ask them to write a

multiple-choice question with four options. We

used the ﬁrst paragraphs of ‘vital articles’ from the

English Wikipedia

, and ensured that passages are

at least 50 words long and at most 250 words long.

Passages spanned 11 genres including arts, history,

physical sciences, and others, and passages were

randomly sampled from the 10K passages. Anno-

tators were asked to write challenging questions

that cannot be answered by reading just the ques-

tion or passage alone, and have a single correct

answer. Further, they were asked to ensure that

passages provided sufﬁcient information to answer

the question while allowing questions to require

basic inferences using commonsense or causality.

Annotators were ﬁrst qualiﬁed to avoid spam-

ming behavior. This qualiﬁcation checked for

spamming behavior in the form of invalid questions,

and not example quality. Annotators were then

asked to write a multiple-choice question to 4 pas-

sages in a single HIT on MTurk. Annotators were

asked to not work on more than 8 HITs. We col-

lected 1225 multiple-choice question-answer pairs

from 73 annotators. In addition, we also logged

their keystrokes and the time taken to complete an

Wikipedia Level 4 vital articles:

https:

//en.wikipedia.org/wiki/Wikipedia:

Vital_articles/Level/4

example (ensuring that time away from the screen

was not counted). Our annotation interface was

built upon Nangia et al. (2021). For other details

about our annotation protocol, please refer to Ap-

pendix A.

4 Cognitive Heuristics in Crowdsourcing

Cognitive heuristics are mental shortcuts, that hu-

mans employ in problem-solving tasks to make

quick judgments (Simon,1956;Tversky and Kah-

neman,1974). Annotators, tasked with authoring

natural language examples, are not infallible to us-

ing such heuristics. We hypothesize that, in writing

tasks, reliance on heuristics is a traceable indicator

of poor data quality. In this section, we identify

several heuristics, their consequences in annotator

behavior, and features to track them. Later, we also

show they are predictors of qualitatively important

dimensions of data.

4.1 Methodology

To test the above hypothesis, we consider several

known cognitive heuristics which could be rele-

vant for our task. This list is not comprehensive,

and we refer the readers to prior work for a thor-

ough overview of cognitive biases (Shah and Op-

penheimer,2008;Draws et al.,2021). To tangibly

measure the potential usage of a heuristic, we fea-

turize each heuristic into a measurable quantity that

can be computed automatically for an example (see

Table 1). While we do not conclusively determine

that an annotator is using a heuristic, we explore

various featurizations that align with the intuition

behind each heuristic. These featurizations can

sometimes be mapped to multiple heuristics that

interact together, but for ease of presentation, we

list them under the most related cognitive heuristic.

These help us create

annotator heuristic traces

which contain average heuristic values across all of

an annotator’s examples.

To verify if our instantiation of a heuristic aligns

with heuristic-seeking tendencies in annotators, we

measure correlations of heuristic values with anno-

tator performances on a battery of psychological

tests (Frederick,2005;Toplak et al.,2014;Sirota

et al.,2021), described in §4.4.

4.2 Heuristics Studied

Satisﬁcing:

Satisﬁcing is a cognitive heuristic

that involves making a satisfactory choice, rather

than an optimal one (Simon,1956). In terms of

Consequence of cognitive heuristic Featurization

Satisﬁcing (lowtime) (1) time, (2) log (time), (3) time / doc length, (4) log (time / doc length)

Satisﬁcing (loweffort) (1) question length, (2) keystroke length, (3) question+ops length,

(4) question+ops length / keystroke length

Availability (ﬁrst option bias) First option is marked as correct answer

Availability (serial position) Correct answer matches span in ﬁrst or last sentence of passage

Representativeness (word overlap) Average word overlap in all pairs of examples by annotator

Representativeness (copying) (1) Length of longest common subsequence (lcs) b/w doc & question,

(2) Max of normalized length of lcs between doc & {question, options},

(3) Normalized avg of length of lcs between doc & {question, options}

Table 1: Consequences of cognitive heuristics and featurizations for multiple-choice reading comprehension data.

mental process, strong satisﬁcing can involve inat-

tention to information and lack of information syn-

thesis. In social cognition, Krosnick (1991) de-

scribed how satisﬁcing can manifest in various

patterns in survey responses. For example, survey-

takers might pick the same response to several ques-

tions in sequence, pick a random response, or use

the acquiescence bias (where they always choose

to agree with the given statement). A potential out-

come of satisﬁcing in our task is low time spent on

the task and low effort put into forming a question.

Assuming the working time is

and number of

tokens in a passage

, we consider the follow-

ing lowtime featurizations: (1)

, (2)

log t

, (3)

t/ld

(4) log(t/ld).3

We estimate an annotator’s amount of effort

through their responses. An annotator who is con-

sistently editing their work or writing long ques-

tions might be attempting to thoughtfully draft their

question. While this may not always be true (for

eg, a worker might spend time thinking about their

question and only start writing later), we hypoth-

esize that often, short responses can be indicators

of satisﬁcing. Given the number of words found

in a stream of keystrokes,

, the question

, and

all options

and

, we consider these

loweffort featurizations: (1)

, (2)

, (3)

lq+lo

(4) (lq+lo)/lk.

Availability heuristic:

The tendency to rely

upon information that is more readily retrievable

from our memory is the availability heuristic (Tver-

sky and Kahneman,1973). For example, after hear-

ing about a plane crash on the news, people may

overstate the dangers of ﬂying. For our task, once

an annotator has read a passage and formulated a

question, the question and the correct answer are

likely to be readily available in their mind. This

could cause them to write that information before

Previous work shows that taking the logarithm normalizes

the response time distribution (Whelan,2008).

any of the distractor options. Therefore, we check

whether the ﬁrst option speciﬁed for an example is

also the correct answer (ﬁrst option bias).

Another consequence of this heuristic is the

serial-position effect. When presented with a series

of items like a list of words or items in a grocery

list, people recall the ﬁrst and last few items from

the series better than the middle ones (Murdock Jr,

1962;Ebbinghaus,1964) because of their easier

availability. This effect can also be explained as

a combination of the primacy effect and recency

effect. To test if an annotator anchors their ques-

tions on the ﬁrst or last sentence of the passage

due to this heuristic, we check if the correct answer

marked for an example matches a span in the ﬁrst

or last sentence of the passage (serial position).

Representativeness heuristic:

The representa-

tiveness heuristic is our tendency to use the simi-

larity of items to make decisions (Kahneman and

Tversky,1972). For example, if a person is picking

a movie to watch, they might think of movies they

previously liked and look for those attributes in

a new movie. Similarly, an annotator may repeat

the same construction in their questions to ease

decision-making (e.g., "which of the following is

true?" or "what year did [event] happen?"). This

could either mean that they are not fully engaged,

or, they found a writing strategy that works well

and they choose to stick to it. We measure this

tendency by computing the average word overlap

across all pairs of questions from an annotator.

A different manner in which this heuristic can

manifest is using similarity with the provided con-

text, i.e., through copying. Copying, or imita-

tion, is a common building block that guides hu-

man behavior and decision making. In deciding

what clothes to buy or which book to read, hu-

mans use imitation-of-the-majority to make quicker

inferences with lesser cognitive effort (Garcia-

Retamero et al.,2009;Gigerenzer and Gaissmaier,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CascadingBiases:InvestigatingtheEffectofHeuristicAnnotationStrategiesonDataandModelsChaitanyaMalaviyaandSudeepBhatiaandMarkYatskarUniversityofPennsylvania{cmalaviy,bhatiasu,myatskar}@upenn.eduAbstractCognitivepsychologistshavedocumentedthathumansusecognitiveheuristics,ormentalshortcuts,tomakequickde...

展开>> 收起<<

Cascading Biases Investigating the Effect of Heuristic Annotation Strategies on Data and Models Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cascading Biases Investigating the Effect of Heuristic Annotation Strategies on Data and Models Chaitanya Malaviya and Sudeep Bhatia and Mark Yatskar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: