Preprint. WIKIWHY ANSWERING AND EXPLAINING CAUSE -AND -EFFECT QUESTIONS

2025-05-06 0 0 2.45MB 19 页 10玖币

侵权投诉

Preprint.

WIKIWHY: ANSWERING AND EXPLAINING

CAUSE-AND-EFFECT QUESTIONS

Matthew Ho∗

, Aditya Sharma∗

, Justin Chang∗

Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang

Department of Computer Science, University of California, Santa Barbara, USA

{msho,aditya sharma,justin chang}@ucsb.edu,

{saxon,sharonlevy,yujielu}@ucsb.edu,william @cs.ucsb.edu

ABSTRACT

As large language models (LLMs) grow larger and more sophisticated, assessing

their “reasoning” capabilities in natural language grows more challenging. Re-

cent question answering (QA) benchmarks that attempt to assess reasoning are

often limited by a narrow scope of covered situations and subject matters. We in-

troduce WIKIWHY, a QA dataset built around a novel auxiliary task: explaining

why an answer is true in natural language. WIKIWHY contains over 9,000 “why”

question-answer-rationale triples, grounded on Wikipedia facts across a diverse set

of topics. Each rationale is a set of supporting statements connecting the question

to the answer. WIKIWHY serves as a benchmark for the reasoning capabilities of

LLMs because it demands rigorous explicit rationales for each answer to demon-

strate the acquisition of implicit commonsense knowledge, which is unlikely to be

easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correct-

ness in the end-to-end answer & explain condition, leaving signiﬁcant room for

future improvements.

1 INTRODUCTION

Error analyses of practical NLP systems in recent history demonstrate that some of the mistakes

made by state-of-the-art models would be avoided by basic human intuition (Shuster et al., 2022),

and some of the most challenging tasks for models are the same ones that might be trivial to hu-

man children. With modern systems’ impressive performance on tasks such as grammar correction

showing that manipulating language is not the issue, LLMs seem to face a fundamental lack of com-

mon sense– an understanding of everyday phenomena and how they interact with each other and the

world at large. As striking gains in subjective performance on summarization, creative text genera-

tion, and apparent language understanding continue to be called into question, the development of

strong benchmarks to assess reasoning capabilities for these LLMs grows more important.

One popular approach to measuring reasoning capability is through performance on question an-

swering (QA) benchmark tasks where direct queries for information act as a straightforward exam-

ination of a system’s “understanding.” Classic QA datasets, however, are primarily concerned with

retrieving factoids to answer questions of “Who”, “What”, “When”, and “Where”. These questions

have been shown to be answerable (with high accuracy) by simple pattern-matching approaches

(Wadhwa et al., 2018), thereby limiting their ability to measure the aforementioned reasoning capa-

bility. Looking to maintain the breadth of topics covered while increasing the difﬁculty of the QA

task, researchers introduced multi-hop QA datasets like HotpotQA (Yang et al., 2018). While chal-

lenging, the task’s extra complexity mostly leads to unnatural questions that can be addressed with

iterated factoid retrieval and entity resolution, rather than a necessary understanding of how different

entities interact. Noticeably absent in these prior datasets are “why” questions, which prompt for

not factoids, but explanations– reasoning made explicit.

The task of explanation uses reasoning and produces explicit, interpretable “thought” processes.

Capitalizing on these properties, this paper introduces WIKIWHY, a novel dataset containing “why”

∗Equal contribution

arXiv:2210.12152v2 [cs.CL] 30 Nov 2022

Preprint.

“... Numerous plans for the Second Avenue

Subway appeared throughout the 20th century,

but these were usually

due to

lack of funds…”

QUESTION

Why were numerous plans for the Second

Avenue Subway of New York City deferred

throughout the 20th century?

ANSWER:

Lack of Funds.

Contractors complete construction

Contractors need to be compensated.

Numerous plans for the Second Avenue Subway of

New York City were deferred throughout the 20th

century.

PASSAGE

REASONING

Lack of Funds.

CAUSE

EFFECT

Figure 1: A simple example of an entry from WIKIWHY; a cause and effect sourced from a

Wikipedia passage, a “why” question and its answer about this relation, and most importantly

rationale that explains why cause leads to effect.

question-answer pairs. Each WIKIWHY entry contains a rationale explaining the QA pair’s causal

relation (Figure 1), summing to a total of 14,238 explanation elements. In the context of recent

multimodal, self-supervised approaches aiming to capture intuitions unlearnable from text alone

(Chadha & Jain, 2021), WIKIWHY presents an opportunity to investigate a speciﬁc kind of infor-

mation absent in text: implicit commonsense assumptions. Compared to other QA datasets with

rationales, WIKIWHY covers a signiﬁcantly broader range of 11 topics which may prove valuable

for developing the skill of applied reasoning on various speciﬁc situations.

Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art

generative models struggle with producing satisfying explanations for WIKIWHY cause-effect rela-

tions. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of

“understanding” in certain relations. Our key contributions are thus:

• We propose explanation within cause-effect relations as a novel problem formulation for exploring

LLM reasoning ability.

• We create WIKIWHY, the ﬁrst question-answering dataset focusing on reasoning within causal

relations, spanning 11 topics.

• We perform experiments on state-of-the-art, generative models to investigate various settings and

establish baseline results with sizable room for improvement.

• We introduce idea-level evaluation metrics for free-form text (explanation) generation and a hu-

man judgment correlation analysis, demonstrating that (1) reference similarity is strongly corre-

lated with explanation correctness, and (2) the metrics we introduced correlate with this proxy.

2 RELATED WORK

Cause and Effect. Causality has been a subject of rigorous work in various ﬁelds. In science phi-

losophy, Pearl (2009) has contributed seminal work relating to causal models, Bayesian networks,

and causal strength via interventions and counterfactuals. These ideas have even been incorporated

into QA tasks through Knowledge Graph approaches, such as ﬁltering spurious latent correlations

(Sui et al., 2022). While our work emphasizes cause-and-effect, we are unconcerned with causal

strength as we begin with Wikipedia-grounded relations and are interested in the information en-

coded into LLMs rather than augmented structures such as knowledge graphs.

Multi-hop Question Answering. While datasets such as HotpotQA (Yang et al., 2018) and Hy-

bridQA (Chen et al., 2020) are instrumental in gauging models’ ability to handle multiple sources

and modalities, they are focused on iterated factoid retrieval. Although chaining multiple facts into

a multi-hop answer is useful for products, WIKIWHY focuses on in-ﬁlling rationales to demonstrate

reasoning.

Preprint.

Table 1: A comparison of WIKIWHY with previous QA datasets relating to explanation

Dataset Size Answer Type Explanation Type Topics Source

CoS-E19,500 MCQ 1-step 1 ConceptNet

eQASC29,980 MCQ 2-step 1 WorldTree

CausalQA324,000 Short None 1 Yahoo Finance

EntailmentBank41,840 Short Tree 1 WorldTree

WIKIWHY 9,406 Short Set/Chain 11 Wikipedia

1(Rajani et al., 2019), 2(Jhamtani & Clark, 2020), 3(Yang et al., 2022), 4(Dalvi et al., 2021)

Visual Question Answering. Vision and language tasks have also intersected with both QA

and reasoning. The Visual Question Answering (VQA) dataset (Agrawal et al., 2015) prompts

textual answers to questions about images. However, the caption-based generation leads to surface-

level questions that require little reasoning ability, and the multiple-choice output format precludes

explicit reasoning. The vision-based Sherlock dataset (Hessel et al., 2022) is much closer to our

work, focusing on abductive reasoning (working backward from a consequence). Setting aside

modality differences, WIKIWHY requires deeper reasoning with its multi-hop explanations.

Explainable QA. One previous approach to building explanation resources collects direct answers

to “why” questions. TellMeWhy (Lal et al., 2021) features question-answer pairs tied to short story

narrative contexts. The dataset skips step-wise explanations, prioritizing reading comprehension

instead. On the other hand, ELI5 (Fan et al., 2019) dives deep into reasoning with long-form,

detailed explanations. However, the open-endedness (compared to explaining a speciﬁc cause-effect

relation) complicates evaluating candidate responses.

Another line of QA work emphasizes a rationale component as support for answer predictions.

Datasets like CoS-E (Rajani et al., 2019), eQASC(Jhamtani & Clark, 2020), and EntailmentBank

(Dalvi et al., 2021) focus on explanation and reasoning much like WIKIWHY, albeit with signiﬁcant

differences (Table 1). CoS-E’s explanations for CommonsenseQA (Talmor et al., 2019) mark an

important ﬁrst step, but the commonsense explanations have limited depth, often requiring a sin-

gle hop of reasoning. eQASC and EntailmentBank feature richer explanations with more complex

structure, tightly focusing on grade school level science facts. Regarding structure, ﬁxed-length ra-

tionale in CoS-E, eQASC, FEVER (Thorne et al., 2018), and e-SNLI (Camburu et al., 2018) capture

less granularity, while entailment trees accept limitations in scale and naturalness in exchange for

complete ordering information. Previous datasets tend towards retrieval tasks with eQASC’s corpus

of all rationale sentences and EntailmentBank’s collection of root causes. Retrieval enables simple

evaluation, at the cost of decreased difﬁculty, the possibility for exploiting spurious artifacts, and

reduced debugging opportunity.

3 BACKGROUND

3.1 WHY FOCUS ON “WHY” QUESTIONS?

“Why” questions are underrepresented in other QA datasets. Users tend to ask straightforward

questions that use words like “who”, “what”, “when” or “where.” Questions of this more common

form have simple answers that state standalone facts which may be elaborated but do not require

explanation. Consider the pair, “Q: Where do the Tigris and Euphrates rivers meet? A: The Persian

Gulf.” The answer is straightforward.

In contrast, a “why” QA-pair encodes a cause-effect relation. Take, for example, “Q: Why are

precipitation levels falling in the Tigris and Euphrates river basin? A: Climate Change.” This pair

encodes the causal relation “Climate change is reducing the amount of precipitation in the Tigris and

Euphrates river basin” (Figure 2). The answer to a “why”-question is an explanation itself (climate

change explains reduced precipitation), but we can take it a step further and ask “why” again to

request the understanding or intuition of this process. While there are some processes at the edge

of human understanding or taken as axioms, we assert that there are valid explanations for most

processes due to the layered nature of human understanding. This extra step is especially worth

Preprint.

CAUSE

Climate change around the

Tigris and Euphrates river basins.

𝐶

𝑆!

𝑆"

𝐸

ELEMENT 1:

Climate change increases temperature

ELEMENT 2:

Higher temperatures increase the

atmosphere’s water storing capacity

EFFECT

There will be less precipitation in the

Tigris and Euphrates river basins.

CAUSE

The USSR mostly traded with

Eastern Bloc neighbors

EFFECT

The merchant marine

was not used much

under Joseph Stalin

ELEMENT 1

Eastern Bloc countries are

connected by land

ELEMENT 2

A Merchant Marine trades by sea

ELEMENT 3

More direct routes are

preferable in trading

𝐶𝑆"𝑆!𝑆#

Step Sequence Rationale Set

𝐸

Figure 2: Explanation topologies in WIKIWHY mainly vary between a sequence of intermediate

conclusions (chain-like) and a set of rationale that combine with the original cause to entail the ﬁnal

effect.

taking since it allows WIKIWHY to not only test if a model “knows” that “climate change causes

reduced precipitation“ but also if it “understands” the underlying mechanics of why that is the case.

3.2 TASK FORMULATION

Formally deﬁned in §5, we propose a generative explanation task. Previous works have made

strides in assessing reasoning through multiple choice (Lu et al., 2022), retrieval (Asai et al., 2019),

and partial generation (Dalvi et al., 2021). While these works are undoubtedly crucial towards the

end goal of understanding and reasoning, their task formulations have some drawbacks. Referring

back to education, studies on human students have shown that multiple choice questions “obscure

nuance in student thinking” (Hubbard et al., 2017). Likewise, a selection decision can be correct

for retriever systems but for the wrong reasons. Augmenting multi-hop factoid questions with an

additional task of selecting the relevant supporting facts from the context passage, Inoue et al. (2020)

emphasizes that interpretability is lost in the absence of explanation. Furthermore, text generation

to combine existing ideas is arguably a different task than generating from scratch. The ﬁeld of

psychology deﬁnes recall (mental retrieval of information) as a distinct process from recognition

(mental familiarity with the cue) (Mohr et al., 1989). Neural nets’ biological inspiration suggests

that there might be a similar difference between cue-aided retrieval and freeform generation. In

the context of NLP, we are interested in the implicit understandings and assumptions embedded in

LLMs and hypothesize that an entirely generative approach is most conducive to this study.

3.3 EXPLANATION STRUCTURE

Explanations come in various structures, as seen in the typology deﬁned by Ribeiro et al. (2022).

Shown in Figure 2, our work focuses on a subset of said typology. WIKIWHY includes two struc-

tures that explain cause-and-effect relations: (1) multi-hop step sequences and (2) rationale sets.

While the chain structure adds intermediate conclusions between cause and effect, rationale sets

contain elements that support the relation from without. The rationale set topology acts as our gen-

eral, catch-all case that other structures can be condensed to. Since our data collection procedure

promotes a stepwise, ordered approach, we also consider the sequential topology to respect the

structure exhibited in applicable explanations. We forego the unstructured approach as even lim-

ited structure helps bring freeform generated text evaluation within reach. Finally, we opt against

pursuing the most complex entailment tree organization to maintain naturalness and facilitate crowd-

sourcing scalability.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.WIKIWHY:ANSWERINGANDEXPLAININGCAUSE-AND-EFFECTQUESTIONSMatthewHo,AdityaSharma,JustinChang,MichaelSaxon,SharonLevy,YujieLu,WilliamYangWangDepartmentofComputerScience,UniversityofCalifornia,SantaBarbara,USAfmsho,adityasharma,justinchangg@ucsb.edu,fsaxon,sharonlevy,yujielug@ucsb.edu,william...

展开>> 收起<<

Preprint. WIKIWHY ANSWERING AND EXPLAINING CAUSE -AND -EFFECT QUESTIONS.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. WIKIWHY ANSWERING AND EXPLAINING CAUSE -AND -EFFECT QUESTIONS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: