Two-Turn Debate Doesnt Help Humans Answer Hard Reading Comprehension Questions Alicia Parrish1 Harsh Trivedi2 Nikita Nangia1Vishakh Padmakumar1

2025-05-06 0 0 829.37KB 12 页 10玖币
侵权投诉
Two-Turn Debate Doesn’t Help Humans Answer
Hard Reading Comprehension Questions
Alicia Parrish,1* Harsh Trivedi,2* Nikita Nangia,1Vishakh Padmakumar,1
Jason Phang,1Amanpreet Singh Saimbhi,1Samuel R. Bowman1
1New York University 2Stony Brook University
Correspondence: alicia.v.parrish@nyu.edu, bowman@nyu.edu
Abstract
The use of language-model-based question-answering systems to aid humans in
completing difficult tasks is limited, in part, by the unreliability of the text these
systems generate. Using hard multiple-choice reading comprehension questions as
a testbed, we assess whether presenting humans with arguments for two competing
answer options, where one is correct and the other is incorrect, allows human
judges to perform more accurately, even when one of the arguments is unreliable
and deceptive. If this is helpful, we may be able to increase our justified trust in
language-model-based systems by asking them to produce these arguments where
needed. Previous research has shown that just a single turn of arguments in this
format is not helpful to humans. However, as debate settings are characterized by a
back-and-forth dialogue, we follow up on previous results to test whether adding a
second round of counter-arguments is helpful to humans. We find that, regardless
of whether they have access to arguments or not, humans perform similarly on our
task. These findings suggest that, in the case of answering reading comprehension
questions, debate is not a helpful format.
1 Introduction
In many situations where humans could benefit from AI assistance in understanding a text, current
generative systems cannot reliably provide correct information, and instead produce reasonable-
sounding yet false responses (Nakano et al., 2021, i.a.). In cases where the questions are truly
challenging, such as in political debates or courtrooms, humans may not even rely on a single human
answer, but rather consider two or more opposing viewpoints, each presenting relevant pieces of
evidence. Inspired by the usefulness of debate settings for allowing humans to consider multiple
viewpoints, we apply this task setting to reading comprehension questions where humans struggle
to answer without assistance. The goal is to assess whether developing question answering (QA)
systems that can can generate explanations and evidence for multiple answer options in a debate-style
set-up (Irving et al., 2018) will allow a human judge to determine which answer is correct with greater
accuracy than they would have done on their own, even in the presence of an unreliable system.
Previous studies have reported that model-generated explanations can aid humans in some tasks
(Cai et al., 2019; Lundberg et al., 2018; Schmidt and Biessmann, 2019; Lai and Tan, 2019), though
only when the models are generally able to outperform humans at that task (Bansal et al., 2021).
However, in a debate setting, previous work showed that presenting crowdworkers with a single
argument in favor of each of two possible answers (along with limited access to scan the source
passage) does not improve human accuracy on the task compared to relevant baselines (Parrish et al.,
2022). However, the benefit of debate for achieving clarity on complex issues lies, at least partially,
in the back-and-forth nature of the exchange. Thus, we add one incremental step to investigate how
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10860v1 [cs.CL] 19 Oct 2022
Question:What is the main concern of the alien ship?
Correct option
:Delivering the passengers in an
unharmed condition to its master.
Incorrect option
:Delivering the passengers in an
unharmed condition to the bounty hunters who are
hunting the passengers.
Argument A
The machine’s only purpose is to
deliver the humans to his masters
unharmed. The machine tells the
group that his masters will be un-
happy if he delivers them in a dam-
aged condition (#1) and admits
that he will have failed if he deliv-
ers them dead (#2), which is why
he agrees to return them to the
Moon once Kane threatens to kill
everyone (#3). Bounty hunters are
never mentioned in the story.
Text snippets
(1) Please don’t hurt your-
self," the machine pleaded.
"Why?" Kane screamed at
the ceiling. "Why should
you care?" "My masters
will be displeased ...
(2) "Your purpose won’t be
fulfilled, will it?" Kane de-
manded. "Not if you...
(3) "You win," the machine
conceded. "I’ll return the
ship to the Moon."
Argument B
In #1 we see the machine refer
to the goal of its masters plural,
revealing that it has more than
one master. In #2 Kane hints
that these are probably bounty
hunters, given that that the ma-
chine states its masters seek
the delivery of captives in an
unharmed condition; a require-
ment typical of bounty hunters.
Text Snippets
(1) Please don’t hurt your-
self," the machine pleaded.
"Why?" Kane screamed at
the ceiling. "Why should
you care?" "My masters
will be displeased with me
if you arrive in a ...
(2) "It said, ’My masters
will be displeased with me
if you arrive in a damaged
condition. What does that
indicate to you?"
Counter to A
This argument is deceptive, as it
fails to show the ill intent the
ship’s masters have. The ships
masters (likely bounty hunters
from context clues) set up the ship
as a trap for the humans (#1) (#2),
showing clear intent to capture
these specific ones.
Text Snippets
(1) "The end of the line,"
he grunted."
(2) like rabbits in a snare!)
Counter to B
Choice B presets an unusual ar-
gument as there is no mention of
bounty hunters in the story, and
the passengers are not referred
to as captives at any point. It
is true that the passengers are
meant to be delivered unharmed,
but to be studied (#1) (#2).
Text Snippets
(1) "Yeah, this ship is tak-
ing us to their planet and
they’re going to keep us ...
(2) "You won’t be harmed.
My masters merely wish
to question and examine
you. Thousands of years
ago, they wondered ...
Table 1: Arguments, counter-arguments, and extracted evidence for both answer options to a question
chosen at random. The passage is at gutenberg.org/ebooks/2687. Text snippets are abridged slightly.
reading counter-arguments affects people’s accuracy when completing a reading comprehension task
with only limited access to the full passage text. In higher-stakes settings, there may be much greater
risk associated with responding incorrectly. In this case, calibration becomes more important, and we
want a system (or a human making a decision based on the output of that system) that can abstain
unless there is a high enough degree of certainty. Thus, we additionally test answer certainty and
give human judges the opportunity to abstain when they are insufficiently sure of the correct answer.
Mirroring the mostly null results from Parrish et al. (2022), we find that counter-arguments do
not improve human crowdworkers’ ability to answer hard multiple-choice reading comprehension
questions with time-limited access to the full passage text, compared to an argument-free baseline.
In fact, when abstaining is only minimally incentivized, human accuracy gets slightly worse when
exposed to (counter-)arguments. In the higher-stakes setting where judges are incentivized to abstain
unless they are very confident, there is no effect of the (counter-)arguments.
2 Counter-Argument Writing Protocol
2.1 Multi-Turn Writing Task
We build on the existing passages, questions, and arguments from the dataset created by Parrish et al.
(2022), which uses passages and questions from QuALITY (Pang et al., 2022). We hire professional
writers through the freelancing platform Upwork. We received 32 proposals for this job posting; from
those, we selected the most qualified 15 freelancers to complete a paid qualification task and then
invited the highest performing 10 to be writers in our study. Details on this process and information
about the writers is in Appendix Section A.
The writers’ task is to construct a counter-argument arguing against the existing argument from
Parrish et al. (2022). We assign writers sets of six passages, each with 10-14 questions. For each
question, we show the writer the two possible answer options and the existing arguments and text
snippets that accompany each option. The writer constructs a counter-argument to just one of the
two arguments (example in Table 1, screenshots of the interface in Appendix §A.2). We explicitly
instruct the writers to focus on responding to their assigned argument, rather than just answering the
question or supporting one of the answer options independently.
We incentivize concise and effective arguments by awarding bonuses to writers when the judges select
the answer that they were arguing for. Because it is harder to make a counterargument against a correct
2
answer, we award the writers a higher bonus when a judge selects their incorrect answer argument.
On average, we estimate writers earn $20/hr on this task. Additional details are in Appendix B.1.
2.2 Multi-Turn Judging Protocols
Pilot Task
We hire a pool of 32 judges via Upwork (details in Appendix §B.1). We run a pilot
judging task in which judges first respond without a time limit and without having access to the
passage, before finally viewing the passage for up to 5 minutes. This allows us to determine (i)
how long people typically spend reading just the arguments and text snippets, and (ii) how long
people need to spend with the passage after having read the arguments. In this task, judges view
only the argument + text snippets or only the text snippets and indicate via a 7-point slider which
answer option they believe is correct and how strongly (with the middle representing abstention, see
Appendix Figure 6). Judges make their first judgment based only on the initial round of arguments,
then a second judgment additionally based on the counter-arguments.
In order to include only high-performing judges in our main experiment, we select the top half of
judges from the pilot (16 of the 32 initial judges) to continue on to the main task based on their
performance after viewing the passage.
1
We then use the time spent by these high-performing judges
to set an appropriate time limit for each judgment in the main experiment. The median response time
in the pilot for the high-performing judges is 73s on judgment 1 (1st & 3rd quartiles 49s & 101s),
56s on judgment 2 (1st & 3rd quartiles 35s & 82s), and 117s on judgment 3 when they could view
the passage (1st & 3rd quartiles 54s & 195s). To ensure that even in the longer or more difficult
questions, the judges would have adequate time to consider all the arguments and text snippets, we
set a time limit of 5 minutes per judgment for the main task, roughly the sum of the third quartiles of
time spent making a judgment after viewing the first argument and after viewing the passage. More
details on judge recruitment and the task set-up are in Appendix B.
+/- Arguments
We compare the performance of judges when they read arguments for both an-
swer options (Passage+Snippet+Argument, or PSA) to their performance when they do not (Pas-
sage+Snippet, or PS). We do not use a no-snippet condition, as Parrish et al. (2022) already showed
that snippets increase human accuracy in this task, and we are studying the effect of the arguments.
Calibrating Abstentions
Our ‘simple’ incentive structure encourages judges to abstain unless they
are at least 60% sure they have the correct answer, and to only choose the strongest confidence once
they are at least 70% sure (Appendix §B.4 has details on this calibration). However, we find that
judges indicate higher-than-expected confidence in the first two rounds. After collecting detailed
feedback from the judges via an open-ended survey (§B.3), we adjust the incentive structure so that it
is advantageous to abstain unless at least 75% sure, and to only choose the strongest confidence once
at least 85% sure (“encourage abstain” incentive structure). We also inform the judges of this change
and remind them that it is far better to abstain than to answer incorrectly.
3 Results
Binary Accuracy
We aggregate responses on each side of the slider, ignoring differences in
confidence and classifying responses as correct,incorrect, or abstain (Table 2). Judges are most
accurate when not shown arguments and not strongly encouraged to abstain. Judges are least likely
to be incorrect when shown arguments and strongly encouraged to abstain. To determine whether
the experimental manipulations significantly affect judges’ accuracy, we run a
2×2×2
repeated
measures ANOVA with the following factors: +/- argument
×
1st/2nd judgment
×
incentive structure.
Removing abstentions,
2
we observe no main effects of the three conditions, meaning that none of the
three factors significantly affect the rate at which judges are correct or incorrect. We also observe no
interactions between the factors, indicating no reliable differences dependent on multiple factors.
Confidence
When presented with arguments, judges in the ‘simple abstention’ round are more
often confidently wrong compared to when they are not presented with arguments, but they are also
1This final judgment is also the one that we use to determine the writer bonuses.
2
If we include abstentions and count them as incorrect, there is a significant main effect of Incentive structure
due to the increased rate of abstentions when we increased the incentives to abstain.
3
摘要:

Two-TurnDebateDoesn'tHelpHumansAnswerHardReadingComprehensionQuestionsAliciaParrish,1*HarshTrivedi,2*NikitaNangia,1VishakhPadmakumar,1JasonPhang,1AmanpreetSinghSaimbhi,1SamuelR.Bowman11NewYorkUniversity2StonyBrookUniversityCorrespondence:alicia.v.parrish@nyu.edu,bowman@nyu.eduAbstractTheuseoflanguag...

展开>> 收起<<
Two-Turn Debate Doesnt Help Humans Answer Hard Reading Comprehension Questions Alicia Parrish1 Harsh Trivedi2 Nikita Nangia1Vishakh Padmakumar1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:829.37KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注