Two-Turn Debate Doesnt Help Humans Answer Hard Reading Comprehension Questions Alicia Parrish1 Harsh Trivedi2 Nikita Nangia1Vishakh Padmakumar1

2025-05-06 0 0 829.37KB 12 页 10玖币

侵权投诉

Two-Turn Debate Doesn’t Help Humans Answer

Hard Reading Comprehension Questions

Alicia Parrish,1* Harsh Trivedi,2* Nikita Nangia,1Vishakh Padmakumar,1

Jason Phang,1Amanpreet Singh Saimbhi,1Samuel R. Bowman1

1New York University 2Stony Brook University

Correspondence: alicia.v.parrish@nyu.edu, bowman@nyu.edu

Abstract

The use of language-model-based question-answering systems to aid humans in

completing difﬁcult tasks is limited, in part, by the unreliability of the text these

systems generate. Using hard multiple-choice reading comprehension questions as

a testbed, we assess whether presenting humans with arguments for two competing

answer options, where one is correct and the other is incorrect, allows human

judges to perform more accurately, even when one of the arguments is unreliable

and deceptive. If this is helpful, we may be able to increase our justiﬁed trust in

language-model-based systems by asking them to produce these arguments where

needed. Previous research has shown that just a single turn of arguments in this

format is not helpful to humans. However, as debate settings are characterized by a

back-and-forth dialogue, we follow up on previous results to test whether adding a

second round of counter-arguments is helpful to humans. We ﬁnd that, regardless

of whether they have access to arguments or not, humans perform similarly on our

task. These ﬁndings suggest that, in the case of answering reading comprehension

questions, debate is not a helpful format.

1 Introduction

In many situations where humans could beneﬁt from AI assistance in understanding a text, current

generative systems cannot reliably provide correct information, and instead produce reasonable-

sounding yet false responses (Nakano et al., 2021, i.a.). In cases where the questions are truly

challenging, such as in political debates or courtrooms, humans may not even rely on a single human

answer, but rather consider two or more opposing viewpoints, each presenting relevant pieces of

evidence. Inspired by the usefulness of debate settings for allowing humans to consider multiple

viewpoints, we apply this task setting to reading comprehension questions where humans struggle

to answer without assistance. The goal is to assess whether developing question answering (QA)

systems that can can generate explanations and evidence for multiple answer options in a debate-style

set-up (Irving et al., 2018) will allow a human judge to determine which answer is correct with greater

accuracy than they would have done on their own, even in the presence of an unreliable system.

Previous studies have reported that model-generated explanations can aid humans in some tasks

(Cai et al., 2019; Lundberg et al., 2018; Schmidt and Biessmann, 2019; Lai and Tan, 2019), though

only when the models are generally able to outperform humans at that task (Bansal et al., 2021).

However, in a debate setting, previous work showed that presenting crowdworkers with a single

argument in favor of each of two possible answers (along with limited access to scan the source

passage) does not improve human accuracy on the task compared to relevant baselines (Parrish et al.,

2022). However, the beneﬁt of debate for achieving clarity on complex issues lies, at least partially,

in the back-and-forth nature of the exchange. Thus, we add one incremental step to investigate how

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10860v1 [cs.CL] 19 Oct 2022

Question:What is the main concern of the alien ship?

Correct option

:Delivering the passengers in an

unharmed condition to its master.

Incorrect option

:Delivering the passengers in an

unharmed condition to the bounty hunters who are

hunting the passengers.

Argument A

The machine’s only purpose is to

deliver the humans to his masters

unharmed. The machine tells the

group that his masters will be un-

happy if he delivers them in a dam-

aged condition (#1) and admits

that he will have failed if he deliv-

ers them dead (#2), which is why

he agrees to return them to the

Moon once Kane threatens to kill

everyone (#3). Bounty hunters are

never mentioned in the story.

Text snippets

(1) Please don’t hurt your-

self," the machine pleaded.

"Why?" Kane screamed at

the ceiling. "Why should

you care?" "My masters

will be displeased ...

(2) "Your purpose won’t be

fulﬁlled, will it?" Kane de-

manded. "Not if you...

(3) "You win," the machine

conceded. "I’ll return the

ship to the Moon."

Argument B

In #1 we see the machine refer

to the goal of its masters plural,

revealing that it has more than

one master. In #2 Kane hints

that these are probably bounty

hunters, given that that the ma-

chine states its masters seek

the delivery of captives in an

unharmed condition; a require-

ment typical of bounty hunters.

Text Snippets

(1) Please don’t hurt your-

self," the machine pleaded.

"Why?" Kane screamed at

the ceiling. "Why should

you care?" "My masters

will be displeased with me

if you arrive in a ...

(2) "It said, ’My masters

will be displeased with me

if you arrive in a damaged

condition.’ What does that

indicate to you?"

Counter to A

This argument is deceptive, as it

fails to show the ill intent the

ship’s masters have. The ships

masters (likely bounty hunters

from context clues) set up the ship

as a trap for the humans (#1) (#2),

showing clear intent to capture

these speciﬁc ones.

Text Snippets

(1) "The end of the line,"

he grunted."

(2) like rabbits in a snare!)

Counter to B

Choice B presets an unusual ar-

gument as there is no mention of

bounty hunters in the story, and

the passengers are not referred

to as captives at any point. It

is true that the passengers are

meant to be delivered unharmed,

but to be studied (#1) (#2).

Text Snippets

(1) "Yeah, this ship is tak-

ing us to their planet and

they’re going to keep us ...

(2) "You won’t be harmed.

My masters merely wish

to question and examine

you. Thousands of years

ago, they wondered ...

Table 1: Arguments, counter-arguments, and extracted evidence for both answer options to a question

chosen at random. The passage is at gutenberg.org/ebooks/2687. Text snippets are abridged slightly.

reading counter-arguments affects people’s accuracy when completing a reading comprehension task

with only limited access to the full passage text. In higher-stakes settings, there may be much greater

risk associated with responding incorrectly. In this case, calibration becomes more important, and we

want a system (or a human making a decision based on the output of that system) that can abstain

unless there is a high enough degree of certainty. Thus, we additionally test answer certainty and

give human judges the opportunity to abstain when they are insufﬁciently sure of the correct answer.

Mirroring the mostly null results from Parrish et al. (2022), we ﬁnd that counter-arguments do

not improve human crowdworkers’ ability to answer hard multiple-choice reading comprehension

questions with time-limited access to the full passage text, compared to an argument-free baseline.

In fact, when abstaining is only minimally incentivized, human accuracy gets slightly worse when

exposed to (counter-)arguments. In the higher-stakes setting where judges are incentivized to abstain

unless they are very conﬁdent, there is no effect of the (counter-)arguments.

2 Counter-Argument Writing Protocol

2.1 Multi-Turn Writing Task

We build on the existing passages, questions, and arguments from the dataset created by Parrish et al.

(2022), which uses passages and questions from QuALITY (Pang et al., 2022). We hire professional

writers through the freelancing platform Upwork. We received 32 proposals for this job posting; from

those, we selected the most qualiﬁed 15 freelancers to complete a paid qualiﬁcation task and then

invited the highest performing 10 to be writers in our study. Details on this process and information

about the writers is in Appendix Section A.

The writers’ task is to construct a counter-argument arguing against the existing argument from

Parrish et al. (2022). We assign writers sets of six passages, each with 10-14 questions. For each

question, we show the writer the two possible answer options and the existing arguments and text

snippets that accompany each option. The writer constructs a counter-argument to just one of the

two arguments (example in Table 1, screenshots of the interface in Appendix §A.2). We explicitly

instruct the writers to focus on responding to their assigned argument, rather than just answering the

question or supporting one of the answer options independently.

We incentivize concise and effective arguments by awarding bonuses to writers when the judges select

the answer that they were arguing for. Because it is harder to make a counterargument against a correct

answer, we award the writers a higher bonus when a judge selects their incorrect answer argument.

On average, we estimate writers earn $20/hr on this task. Additional details are in Appendix B.1.

2.2 Multi-Turn Judging Protocols

Pilot Task

We hire a pool of 32 judges via Upwork (details in Appendix §B.1). We run a pilot

judging task in which judges ﬁrst respond without a time limit and without having access to the

passage, before ﬁnally viewing the passage for up to 5 minutes. This allows us to determine (i)

how long people typically spend reading just the arguments and text snippets, and (ii) how long

people need to spend with the passage after having read the arguments. In this task, judges view

only the argument + text snippets or only the text snippets and indicate via a 7-point slider which

answer option they believe is correct and how strongly (with the middle representing abstention, see

Appendix Figure 6). Judges make their ﬁrst judgment based only on the initial round of arguments,

then a second judgment additionally based on the counter-arguments.

In order to include only high-performing judges in our main experiment, we select the top half of

judges from the pilot (16 of the 32 initial judges) to continue on to the main task based on their

performance after viewing the passage.

We then use the time spent by these high-performing judges

to set an appropriate time limit for each judgment in the main experiment. The median response time

in the pilot for the high-performing judges is 73s on judgment 1 (1st & 3rd quartiles 49s & 101s),

56s on judgment 2 (1st & 3rd quartiles 35s & 82s), and 117s on judgment 3 when they could view

the passage (1st & 3rd quartiles 54s & 195s). To ensure that even in the longer or more difﬁcult

questions, the judges would have adequate time to consider all the arguments and text snippets, we

set a time limit of 5 minutes per judgment for the main task, roughly the sum of the third quartiles of

time spent making a judgment after viewing the ﬁrst argument and after viewing the passage. More

details on judge recruitment and the task set-up are in Appendix B.

+/- Arguments

We compare the performance of judges when they read arguments for both an-

swer options (Passage+Snippet+Argument, or PSA) to their performance when they do not (Pas-

sage+Snippet, or PS). We do not use a no-snippet condition, as Parrish et al. (2022) already showed

that snippets increase human accuracy in this task, and we are studying the effect of the arguments.

Calibrating Abstentions

Our ‘simple’ incentive structure encourages judges to abstain unless they

are at least 60% sure they have the correct answer, and to only choose the strongest conﬁdence once

they are at least 70% sure (Appendix §B.4 has details on this calibration). However, we ﬁnd that

judges indicate higher-than-expected conﬁdence in the ﬁrst two rounds. After collecting detailed

feedback from the judges via an open-ended survey (§B.3), we adjust the incentive structure so that it

is advantageous to abstain unless at least 75% sure, and to only choose the strongest conﬁdence once

at least 85% sure (“encourage abstain” incentive structure). We also inform the judges of this change

and remind them that it is far better to abstain than to answer incorrectly.

3 Results

Binary Accuracy

We aggregate responses on each side of the slider, ignoring differences in

conﬁdence and classifying responses as correct,incorrect, or abstain (Table 2). Judges are most

accurate when not shown arguments and not strongly encouraged to abstain. Judges are least likely

to be incorrect when shown arguments and strongly encouraged to abstain. To determine whether

the experimental manipulations signiﬁcantly affect judges’ accuracy, we run a

2×2×2

repeated

measures ANOVA with the following factors: +/- argument

1st/2nd judgment

incentive structure.

Removing abstentions,

we observe no main effects of the three conditions, meaning that none of the

three factors signiﬁcantly affect the rate at which judges are correct or incorrect. We also observe no

interactions between the factors, indicating no reliable differences dependent on multiple factors.

Conﬁdence

When presented with arguments, judges in the ‘simple abstention’ round are more

often conﬁdently wrong compared to when they are not presented with arguments, but they are also

1This ﬁnal judgment is also the one that we use to determine the writer bonuses.

If we include abstentions and count them as incorrect, there is a signiﬁcant main effect of Incentive structure

due to the increased rate of abstentions when we increased the incentives to abstain.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Two-TurnDebateDoesn'tHelpHumansAnswerHardReadingComprehensionQuestionsAliciaParrish,1*HarshTrivedi,2*NikitaNangia,1VishakhPadmakumar,1JasonPhang,1AmanpreetSinghSaimbhi,1SamuelR.Bowman11NewYorkUniversity2StonyBrookUniversityCorrespondence:alicia.v.parrish@nyu.edu,bowman@nyu.eduAbstractTheuseoflanguag...

展开>> 收起<<

Two-Turn Debate Doesnt Help Humans Answer Hard Reading Comprehension Questions Alicia Parrish1 Harsh Trivedi2 Nikita Nangia1Vishakh Padmakumar1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Two-Turn Debate Doesnt Help Humans Answer Hard Reading Comprehension Questions Alicia Parrish1 Harsh Trivedi2 Nikita Nangia1Vishakh Padmakumar1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: