
Preprint.
Table 1: A comparison of WIKIWHY with previous QA datasets relating to explanation
Dataset Size Answer Type Explanation Type Topics Source
CoS-E19,500 MCQ 1-step 1 ConceptNet
eQASC29,980 MCQ 2-step 1 WorldTree
CausalQA324,000 Short None 1 Yahoo Finance
EntailmentBank41,840 Short Tree 1 WorldTree
WIKIWHY 9,406 Short Set/Chain 11 Wikipedia
1(Rajani et al., 2019), 2(Jhamtani & Clark, 2020), 3(Yang et al., 2022), 4(Dalvi et al., 2021)
Visual Question Answering. Vision and language tasks have also intersected with both QA
and reasoning. The Visual Question Answering (VQA) dataset (Agrawal et al., 2015) prompts
textual answers to questions about images. However, the caption-based generation leads to surface-
level questions that require little reasoning ability, and the multiple-choice output format precludes
explicit reasoning. The vision-based Sherlock dataset (Hessel et al., 2022) is much closer to our
work, focusing on abductive reasoning (working backward from a consequence). Setting aside
modality differences, WIKIWHY requires deeper reasoning with its multi-hop explanations.
Explainable QA. One previous approach to building explanation resources collects direct answers
to “why” questions. TellMeWhy (Lal et al., 2021) features question-answer pairs tied to short story
narrative contexts. The dataset skips step-wise explanations, prioritizing reading comprehension
instead. On the other hand, ELI5 (Fan et al., 2019) dives deep into reasoning with long-form,
detailed explanations. However, the open-endedness (compared to explaining a specific cause-effect
relation) complicates evaluating candidate responses.
Another line of QA work emphasizes a rationale component as support for answer predictions.
Datasets like CoS-E (Rajani et al., 2019), eQASC(Jhamtani & Clark, 2020), and EntailmentBank
(Dalvi et al., 2021) focus on explanation and reasoning much like WIKIWHY, albeit with significant
differences (Table 1). CoS-E’s explanations for CommonsenseQA (Talmor et al., 2019) mark an
important first step, but the commonsense explanations have limited depth, often requiring a sin-
gle hop of reasoning. eQASC and EntailmentBank feature richer explanations with more complex
structure, tightly focusing on grade school level science facts. Regarding structure, fixed-length ra-
tionale in CoS-E, eQASC, FEVER (Thorne et al., 2018), and e-SNLI (Camburu et al., 2018) capture
less granularity, while entailment trees accept limitations in scale and naturalness in exchange for
complete ordering information. Previous datasets tend towards retrieval tasks with eQASC’s corpus
of all rationale sentences and EntailmentBank’s collection of root causes. Retrieval enables simple
evaluation, at the cost of decreased difficulty, the possibility for exploiting spurious artifacts, and
reduced debugging opportunity.
3 BACKGROUND
3.1 WHY FOCUS ON “WHY” QUESTIONS?
“Why” questions are underrepresented in other QA datasets. Users tend to ask straightforward
questions that use words like “who”, “what”, “when” or “where.” Questions of this more common
form have simple answers that state standalone facts which may be elaborated but do not require
explanation. Consider the pair, “Q: Where do the Tigris and Euphrates rivers meet? A: The Persian
Gulf.” The answer is straightforward.
In contrast, a “why” QA-pair encodes a cause-effect relation. Take, for example, “Q: Why are
precipitation levels falling in the Tigris and Euphrates river basin? A: Climate Change.” This pair
encodes the causal relation “Climate change is reducing the amount of precipitation in the Tigris and
Euphrates river basin” (Figure 2). The answer to a “why”-question is an explanation itself (climate
change explains reduced precipitation), but we can take it a step further and ask “why” again to
request the understanding or intuition of this process. While there are some processes at the edge
of human understanding or taken as axioms, we assert that there are valid explanations for most
processes due to the layered nature of human understanding. This extra step is especially worth
3