Preprint. WIKIWHY ANSWERING AND EXPLAINING CAUSE -AND -EFFECT QUESTIONS

2025-05-06 0 0 2.45MB 19 页 10玖币
侵权投诉
Preprint.
WIKIWHY: ANSWERING AND EXPLAINING
CAUSE-AND-EFFECT QUESTIONS
Matthew Ho
, Aditya Sharma
, Justin Chang
,
Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang
Department of Computer Science, University of California, Santa Barbara, USA
{msho,aditya sharma,justin chang}@ucsb.edu,
{saxon,sharonlevy,yujielu}@ucsb.edu,william @cs.ucsb.edu
ABSTRACT
As large language models (LLMs) grow larger and more sophisticated, assessing
their “reasoning” capabilities in natural language grows more challenging. Re-
cent question answering (QA) benchmarks that attempt to assess reasoning are
often limited by a narrow scope of covered situations and subject matters. We in-
troduce WIKIWHY, a QA dataset built around a novel auxiliary task: explaining
why an answer is true in natural language. WIKIWHY contains over 9,000 “why”
question-answer-rationale triples, grounded on Wikipedia facts across a diverse set
of topics. Each rationale is a set of supporting statements connecting the question
to the answer. WIKIWHY serves as a benchmark for the reasoning capabilities of
LLMs because it demands rigorous explicit rationales for each answer to demon-
strate the acquisition of implicit commonsense knowledge, which is unlikely to be
easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correct-
ness in the end-to-end answer & explain condition, leaving significant room for
future improvements.
1 INTRODUCTION
Error analyses of practical NLP systems in recent history demonstrate that some of the mistakes
made by state-of-the-art models would be avoided by basic human intuition (Shuster et al., 2022),
and some of the most challenging tasks for models are the same ones that might be trivial to hu-
man children. With modern systems’ impressive performance on tasks such as grammar correction
showing that manipulating language is not the issue, LLMs seem to face a fundamental lack of com-
mon sense– an understanding of everyday phenomena and how they interact with each other and the
world at large. As striking gains in subjective performance on summarization, creative text genera-
tion, and apparent language understanding continue to be called into question, the development of
strong benchmarks to assess reasoning capabilities for these LLMs grows more important.
One popular approach to measuring reasoning capability is through performance on question an-
swering (QA) benchmark tasks where direct queries for information act as a straightforward exam-
ination of a system’s “understanding. Classic QA datasets, however, are primarily concerned with
retrieving factoids to answer questions of “Who”, “What”, “When”, and “Where”. These questions
have been shown to be answerable (with high accuracy) by simple pattern-matching approaches
(Wadhwa et al., 2018), thereby limiting their ability to measure the aforementioned reasoning capa-
bility. Looking to maintain the breadth of topics covered while increasing the difficulty of the QA
task, researchers introduced multi-hop QA datasets like HotpotQA (Yang et al., 2018). While chal-
lenging, the task’s extra complexity mostly leads to unnatural questions that can be addressed with
iterated factoid retrieval and entity resolution, rather than a necessary understanding of how different
entities interact. Noticeably absent in these prior datasets are “why” questions, which prompt for
not factoids, but explanations– reasoning made explicit.
The task of explanation uses reasoning and produces explicit, interpretable “thought” processes.
Capitalizing on these properties, this paper introduces WIKIWHY, a novel dataset containing “why”
Equal contribution
1
arXiv:2210.12152v2 [cs.CL] 30 Nov 2022
Preprint.
“... Numerous plans for the Second Avenue
Subway appeared throughout the 20th century,
but these were usually
due to
lack of funds…”
QUESTION
:
Why were numerous plans for the Second
Avenue Subway of New York City deferred
throughout the 20th century?
ANSWER:
Lack of Funds.
Contractors complete construction
Contractors need to be compensated.
Numerous plans for the Second Avenue Subway of
New York City were deferred throughout the 20th
century.
PASSAGE
QA
REASONING
Lack of Funds.
CAUSE
EFFECT
Figure 1: A simple example of an entry from WIKIWHY; a cause and effect sourced from a
Wikipedia passage, a “why” question and its answer about this relation, and most importantly
rationale that explains why cause leads to effect.
question-answer pairs. Each WIKIWHY entry contains a rationale explaining the QA pair’s causal
relation (Figure 1), summing to a total of 14,238 explanation elements. In the context of recent
multimodal, self-supervised approaches aiming to capture intuitions unlearnable from text alone
(Chadha & Jain, 2021), WIKIWHY presents an opportunity to investigate a specific kind of infor-
mation absent in text: implicit commonsense assumptions. Compared to other QA datasets with
rationales, WIKIWHY covers a significantly broader range of 11 topics which may prove valuable
for developing the skill of applied reasoning on various specific situations.
Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art
generative models struggle with producing satisfying explanations for WIKIWHY cause-effect rela-
tions. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of
“understanding” in certain relations. Our key contributions are thus:
We propose explanation within cause-effect relations as a novel problem formulation for exploring
LLM reasoning ability.
We create WIKIWHY, the first question-answering dataset focusing on reasoning within causal
relations, spanning 11 topics.
We perform experiments on state-of-the-art, generative models to investigate various settings and
establish baseline results with sizable room for improvement.
We introduce idea-level evaluation metrics for free-form text (explanation) generation and a hu-
man judgment correlation analysis, demonstrating that (1) reference similarity is strongly corre-
lated with explanation correctness, and (2) the metrics we introduced correlate with this proxy.
2 RELATED WORK
Cause and Effect. Causality has been a subject of rigorous work in various fields. In science phi-
losophy, Pearl (2009) has contributed seminal work relating to causal models, Bayesian networks,
and causal strength via interventions and counterfactuals. These ideas have even been incorporated
into QA tasks through Knowledge Graph approaches, such as filtering spurious latent correlations
(Sui et al., 2022). While our work emphasizes cause-and-effect, we are unconcerned with causal
strength as we begin with Wikipedia-grounded relations and are interested in the information en-
coded into LLMs rather than augmented structures such as knowledge graphs.
Multi-hop Question Answering. While datasets such as HotpotQA (Yang et al., 2018) and Hy-
bridQA (Chen et al., 2020) are instrumental in gauging models’ ability to handle multiple sources
and modalities, they are focused on iterated factoid retrieval. Although chaining multiple facts into
a multi-hop answer is useful for products, WIKIWHY focuses on in-filling rationales to demonstrate
reasoning.
2
Preprint.
Table 1: A comparison of WIKIWHY with previous QA datasets relating to explanation
Dataset Size Answer Type Explanation Type Topics Source
CoS-E19,500 MCQ 1-step 1 ConceptNet
eQASC29,980 MCQ 2-step 1 WorldTree
CausalQA324,000 Short None 1 Yahoo Finance
EntailmentBank41,840 Short Tree 1 WorldTree
WIKIWHY 9,406 Short Set/Chain 11 Wikipedia
1(Rajani et al., 2019), 2(Jhamtani & Clark, 2020), 3(Yang et al., 2022), 4(Dalvi et al., 2021)
Visual Question Answering. Vision and language tasks have also intersected with both QA
and reasoning. The Visual Question Answering (VQA) dataset (Agrawal et al., 2015) prompts
textual answers to questions about images. However, the caption-based generation leads to surface-
level questions that require little reasoning ability, and the multiple-choice output format precludes
explicit reasoning. The vision-based Sherlock dataset (Hessel et al., 2022) is much closer to our
work, focusing on abductive reasoning (working backward from a consequence). Setting aside
modality differences, WIKIWHY requires deeper reasoning with its multi-hop explanations.
Explainable QA. One previous approach to building explanation resources collects direct answers
to “why” questions. TellMeWhy (Lal et al., 2021) features question-answer pairs tied to short story
narrative contexts. The dataset skips step-wise explanations, prioritizing reading comprehension
instead. On the other hand, ELI5 (Fan et al., 2019) dives deep into reasoning with long-form,
detailed explanations. However, the open-endedness (compared to explaining a specific cause-effect
relation) complicates evaluating candidate responses.
Another line of QA work emphasizes a rationale component as support for answer predictions.
Datasets like CoS-E (Rajani et al., 2019), eQASC(Jhamtani & Clark, 2020), and EntailmentBank
(Dalvi et al., 2021) focus on explanation and reasoning much like WIKIWHY, albeit with significant
differences (Table 1). CoS-E’s explanations for CommonsenseQA (Talmor et al., 2019) mark an
important first step, but the commonsense explanations have limited depth, often requiring a sin-
gle hop of reasoning. eQASC and EntailmentBank feature richer explanations with more complex
structure, tightly focusing on grade school level science facts. Regarding structure, fixed-length ra-
tionale in CoS-E, eQASC, FEVER (Thorne et al., 2018), and e-SNLI (Camburu et al., 2018) capture
less granularity, while entailment trees accept limitations in scale and naturalness in exchange for
complete ordering information. Previous datasets tend towards retrieval tasks with eQASC’s corpus
of all rationale sentences and EntailmentBank’s collection of root causes. Retrieval enables simple
evaluation, at the cost of decreased difficulty, the possibility for exploiting spurious artifacts, and
reduced debugging opportunity.
3 BACKGROUND
3.1 WHY FOCUS ON WHY” QUESTIONS?
“Why” questions are underrepresented in other QA datasets. Users tend to ask straightforward
questions that use words like “who”, “what”, “when” or “where. Questions of this more common
form have simple answers that state standalone facts which may be elaborated but do not require
explanation. Consider the pair, “Q: Where do the Tigris and Euphrates rivers meet? A: The Persian
Gulf. The answer is straightforward.
In contrast, a “why” QA-pair encodes a cause-effect relation. Take, for example, “Q: Why are
precipitation levels falling in the Tigris and Euphrates river basin? A: Climate Change. This pair
encodes the causal relation “Climate change is reducing the amount of precipitation in the Tigris and
Euphrates river basin” (Figure 2). The answer to a “why”-question is an explanation itself (climate
change explains reduced precipitation), but we can take it a step further and ask “why” again to
request the understanding or intuition of this process. While there are some processes at the edge
of human understanding or taken as axioms, we assert that there are valid explanations for most
processes due to the layered nature of human understanding. This extra step is especially worth
3
Preprint.
CAUSE
:
Climate change around the
Tigris and Euphrates river basins.
𝐶
𝑆!
𝑆"
𝐸
ELEMENT 1:
Climate change increases temperature
ELEMENT 2:
Higher temperatures increase the
atmosphere’s water storing capacity
EFFECT
:
There will be less precipitation in the
Tigris and Euphrates river basins.
CAUSE
:
The USSR mostly traded with
Eastern Bloc neighbors
EFFECT
:
The merchant marine
was not used much
under Joseph Stalin
ELEMENT 1
:
Eastern Bloc countries are
connected by land
ELEMENT 2
:
A Merchant Marine trades by sea
ELEMENT 3
:
More direct routes are
preferable in trading
𝐶𝑆"𝑆!𝑆#
Step Sequence Rationale Set
𝐸
Figure 2: Explanation topologies in WIKIWHY mainly vary between a sequence of intermediate
conclusions (chain-like) and a set of rationale that combine with the original cause to entail the final
effect.
taking since it allows WIKIWHY to not only test if a model “knows” that “climate change causes
reduced precipitation“ but also if it “understands” the underlying mechanics of why that is the case.
3.2 TASK FORMULATION
Formally defined in §5, we propose a generative explanation task. Previous works have made
strides in assessing reasoning through multiple choice (Lu et al., 2022), retrieval (Asai et al., 2019),
and partial generation (Dalvi et al., 2021). While these works are undoubtedly crucial towards the
end goal of understanding and reasoning, their task formulations have some drawbacks. Referring
back to education, studies on human students have shown that multiple choice questions “obscure
nuance in student thinking” (Hubbard et al., 2017). Likewise, a selection decision can be correct
for retriever systems but for the wrong reasons. Augmenting multi-hop factoid questions with an
additional task of selecting the relevant supporting facts from the context passage, Inoue et al. (2020)
emphasizes that interpretability is lost in the absence of explanation. Furthermore, text generation
to combine existing ideas is arguably a different task than generating from scratch. The field of
psychology defines recall (mental retrieval of information) as a distinct process from recognition
(mental familiarity with the cue) (Mohr et al., 1989). Neural nets’ biological inspiration suggests
that there might be a similar difference between cue-aided retrieval and freeform generation. In
the context of NLP, we are interested in the implicit understandings and assumptions embedded in
LLMs and hypothesize that an entirely generative approach is most conducive to this study.
3.3 EXPLANATION STRUCTURE
Explanations come in various structures, as seen in the typology defined by Ribeiro et al. (2022).
Shown in Figure 2, our work focuses on a subset of said typology. WIKIWHY includes two struc-
tures that explain cause-and-effect relations: (1) multi-hop step sequences and (2) rationale sets.
While the chain structure adds intermediate conclusions between cause and effect, rationale sets
contain elements that support the relation from without. The rationale set topology acts as our gen-
eral, catch-all case that other structures can be condensed to. Since our data collection procedure
promotes a stepwise, ordered approach, we also consider the sequential topology to respect the
structure exhibited in applicable explanations. We forego the unstructured approach as even lim-
ited structure helps bring freeform generated text evaluation within reach. Finally, we opt against
pursuing the most complex entailment tree organization to maintain naturalness and facilitate crowd-
sourcing scalability.
4
摘要:

Preprint.WIKIWHY:ANSWERINGANDEXPLAININGCAUSE-AND-EFFECTQUESTIONSMatthewHo,AdityaSharma,JustinChang,MichaelSaxon,SharonLevy,YujieLu,WilliamYangWangDepartmentofComputerScience,UniversityofCalifornia,SantaBarbara,USAfmsho,adityasharma,justinchangg@ucsb.edu,fsaxon,sharonlevy,yujielug@ucsb.edu,william...

展开>> 收起<<
Preprint. WIKIWHY ANSWERING AND EXPLAINING CAUSE -AND -EFFECT QUESTIONS.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:2.45MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注