Entailer Answering Questions with Faithful and Truthful Chains of Reasoning Oyvind Tafjord Bhavana Dalvi Mishra Peter Clark

2025-05-06 0 0 1.46MB 17 页 10玖币

侵权投诉

Entailer: Answering Questions with Faithful and Truthful Chains of

Reasoning

Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark

Allen Institute for AI, Seattle, WA

{oyvindt,bhavanad,peterc}@allenai.org

Abstract

Our goal is a question-answering (QA) system

that can show how its answers are implied by

its own internal beliefs via a systematic chain

of reasoning. Such a capability would allow

better understanding of why a model produced

the answer it did. Our approach is to recur-

sively combine a trained backward-chaining

model, capable of generating a set of premises

entailing an answer hypothesis, with a veriﬁer

that checks that the model itself believes those

premises (and the entailment itself) through

self-querying. To our knowledge, this is the

ﬁrst system to generate multistep chains that

are both faithful (the answer follows from the

reasoning) and truthful (the chain reﬂects the

system’s own internal beliefs). In evaluation

using two different datasets, users judge that

a majority (70%+) of generated chains clearly

show how an answer follows from a set of facts

- substantially better than a high-performance

baseline - while preserving answer accuracy.

By materializing model beliefs that systemat-

ically support an answer, new opportunities

arise for understanding the model’s system of

belief, and diagnosing and correcting its mis-

understandings when an answer is wrong.

1 Introduction

Although pretrained language models (PTLMs)

have shown remarkable question-answering (QA)

performance, it is often unclear why their answers

follow from what they know. While there has been

substantial work on training models to also gener-

ate explanations for their answers (Wiegreffe and

Marasovi´

c,2021), or produce them via few-shot

prompting, e.g., “chains of thought” (Wei et al.,

2022), those explanations may not be faithful (the

answer does not necessarily follow from them) and

may not be truthful, in the sense that the language

model itself does not believe

the explanation state-

We here adopt a simple operational deﬁnition of belief,

namely that a model believes X if it answers "yes" to the

question "Is X true?". Other deﬁnitions could also be used.

Figure 1: Given a question, Entailer searches for an

answer hypothesis that is supported by an entailment

proof. First it over-generates candidate proofs, then it

removes those that the model itself does not “believe”

(i.e., conﬁrms via self-querying that it considers all the

generated proof elements to be true). Finally it selects

the best veriﬁed proof. Multistep proofs are generated

by iteratively backward chaining on the premises (Sec-

tion 3.2).

ments that it generated. Rather, our goal is to gen-

erate answers that systematically follow from the

model’s own internal beliefs, materializing those

beliefs as explicit statements that can then be in-

spected. Such a capability offers new opportunities

for understanding, diagnosing, and ultimately cor-

recting errors in a language model’s behavior.

Our approach uses a combination of generation

and veriﬁcation, implemented in a system called

Entailer

. Chains are constructed by backward

chaining from candidate answers, recursively us-

ing a language model (LM) trained for a single

backward-chaining step. For each step, Entailer

over-generates candidate entailments, then ﬁlters

Entailer data and models are available at https://allenai.

org/data/entailer

arXiv:2210.12217v1 [cs.AI] 21 Oct 2022

Q: A magnet will stick to (A) a belt buckle (B) a wooden table (C) a plastic cup (D) a paper plate

A: (A) A magnet will stick to a belt buckle because

A belt buckle is sometimes magnetic. because

Metal is sometimes magnetic.

A belt buckle is made of metal.

A magnet will stick to magnetic metals.

Q: You can make a telescope with a (A) straw (B) Glass (C) Candle (D) mailing tube

A: (D) You can make a telescope with a mailing tube. because

A telescope is made of a tube for observing / seeing.

A mailing tube is a kind of tube.

Q: Quartz may produce rainbows when light is shined (A) around the crystal’s area (B) through any of its sides (C)

in the room its in (D) in to a mirror at it

A: (B) Quartz may produce rainbows when light is shined through any of its sides. because

A rainbow is produced when light shines through a prism. because

The rainbow is made of all different colors in visible light.

A prism can split light into different colors.

A quartz is a kind of prism.

Figure 2: Questions (from the OBQA dataset) and Entailer’s answers, showing its chain of reasoning.

out those that do not conform to its own internal

knowledge (“beliefs”) by self-querying, asking it-

self whether (a) the generated premises (leaves of

the proof step) are true, and (b) each entailment step

is valid (Figure 1). It then recursively backward-

chains on premises until the overall proof conﬁ-

dence cannot be further improved (or a depth limit

is reached). Finally, the candidate answer sup-

ported by the highest-scoring chain of a reasoning

is returned. As a result, the system has material-

ized some of its latent knowledge from which the

selected answer follows. Most signiﬁcantly, the

resulting proof is thus both faithful (the answer fol-

lows from the proof) and truthful (the proof reﬂects

the system’s beliefs), providing a previously un-

available window into the model’s beliefs about the

world and their implications, e.g., Figure 2.

To train the Entailer model, we use a combina-

tion of the existing EntailmentBank dataset (Dalvi

et al.,2021), plus a new crowd-annotated dataset

that we construct by bootstrapping (train an ini-

tial model, generate candidate entailment examples

with it, then annotate those examples as extra train-

ing data). The model is then frozen, and Entailer

is then applied

zero-shot

to new datasets, i.e., En-

tailer is a treated as a general-purpose, ﬁxed model

specialized for reasoning, rather than requiring ﬁne-

tuning for new tasks.

We evaluate Entailer on two existing datasets,

OBQA (Mihaylov et al.,2018) and QuaRTz

(Tafjord et al.,2019). We ﬁnd that its reasoning-

based QA accuracy is similar to its direct QA ac-

curacy, with the advantage that a supporting chain

of reasoning is also produced. We also perform a

human evaluation, and ﬁnd that 70% of time users

judge the chains to clearly show how an answer

followed from their premises, substantially higher

than for explanations produced by a comparable

high-performance QA system, Macaw (Tafjord and

Clark,2021). Our contributions are thus:

The ﬁrst system to generate chains of reason-

ing showing how answers are systematically

implied by a model’s own internal beliefs,

making relevant model beliefs explicit. The

chains are both

faithful

(the answer follows

from the reasoning) and

truthful

(the chain

reﬂects the system’s own beliefs).

A new, crowdsourced dataset of multi-premise

entailments, doubling the amount of data

available in EntailmentBank (Dalvi et al.,

2021), and including examples of both pos-

itive and negative entailments (Entailment-

Bank only includes positive examples)3.

2 Related Work

Systematic Reasoning:

Several recent systems

have demonstrated the ability to perform systematic

reasoning directly over natural language (Natural

Language Inference (Manning and MacCartney,

2009)), namely deriving conclusions from known

facts via step-wise application of well-deﬁned infer-

ence operations. One approach is to retrain a black-

box model end-to-end (Clark et al.,2020), but has

been limited to small rulebases. An alternative ap-

proach, which we follow here, is to have an outside

loop around a model, where the model generates

individual inference steps (i.e., rules), and a con-

3The dataset is provided in the supplementary material.

troller chains them together. SCSearch (Bostrom

et al.,2022), NLProofS (Yang et al.,2022), IRGR

(Ribeiro et al.,2022), ProofWriter (iterative ver-

sion) (Tafjord et al.,2020), and Selection-Inference

(Creswell et al.,2022) do this in a forward-chaining

fashion, MetGen (Hong et al.,2022) does this bidi-

rectionally, while Braid (Kalyanpur et al.,2020)

(like us) does this in a backward-chaining fashion.

In all these systems, the required facts were ex-

pected to be provided explicitly to the model. In

contrast, Entailer’s reasoning uses its own inter-

nal, latent knowledge, as well as (optionally) exter-

nally provided facts. LeapOfThought (Talmor et al.,

2020) demonstrated that reasoning with a combina-

tion of implicit and explicit knowledge was possi-

ble for simple 1-step inferences. We expand this for

multi-step inference, and (unlike LeapOfThought)

have the system also explicitly articulate the im-

plicit knowledge it uses, and its chain of reasoning.

Recent work has shown that generating a free-

form explanation (“chain of thought”) before an

answer can also improve performance on a variety

of tasks (Wei et al.,2022;Cobbe et al.,2021;Nye

et al.,2021). In these works, however, the expla-

nations are unstructured, and there are no claims

of faithfulness that the answer follows from the

generation, nor that the explanations themselves

represent model beliefs.

Materializing a Model’s Internal Knowledge:

Pretrained LMs contain a vast amount of knowl-

edge, and can be thought of as a kind of knowledge

base to tap into (Petroni et al.,2019). Recent work

has shown that this latent knowledge can be materi-

alized as explicit English sentences or a knowledge

graph using generative techniques, e.g., COMeT

(Bosselut et al.,2019), ParaCOMET (Gabriel et al.,

2021). Our work with Entailer similarly mate-

rializes its latent knowledge, but here in a goal-

directed way, namely by producing a faithful chain

of reasoning from facts it validates (“believes”) as

true to an answer. This articulation can be seen as

a kind of self-talk, where a self-generated context

can improve QA (Shwartz et al.,2020). However,

here our generations are not used as context for

opaque problem-solving, but are assembled into a

well-deﬁned chain of reasoning.

Beliefs:

We refer to the model’s factual opinions

as “beliefs” rather than “knowledge” because those

opinions may be wrong. In general, an agent can

be said to believe p if it acts as if p was true

(Schwitzgebel,2019). Following (Kassner et al.,

Figure 3: An entailment tree is a set of multi-premise,

1-step entailments (red boxes) showing how the hypoth-

esis (root node, green) is entailed from the leaf nodes

(white). If all the leaf nodes are true, and all the 1-step

entailment relations are valid, then we say the tree is a

valid chain of reasoning for the hypothesis.

2021), we take a simple, syntactic operationaliza-

tion of this, namely the agent answers “yes” to the

question “p?”, but also note that more semantic

versions could be used, e.g., the agent also answers

“yes” to paraphrases and implications of p. In gen-

eral, models can sometimes be inconsistent in their

beliefs (Elazar et al.,2021;Kassner and Schütze,

2020;Ribeiro et al.,2019). For our purposes here,

we simply note that such inconsistencies may oc-

casionally exist, and that techniques for inconsis-

tency resolution could be applied in future to re-

duce these, e.g., (Kassner et al.,2021;Li et al.,

2019).

3 Approach

Like several previous systems (Section 2), Entailer

treats reasoning as Natural Language Inference

(NLI). In NLI, the basic unit of knowledge is (rep-

resented as) a sentence rather than a structure, and

a proof

is a tree of multi-step, multi-premise en-

tailments, e.g., Figures 2and 3.

Within this framework, given a question, En-

tailer ﬁrst generates candidate answers, then tries

to prove each one, selecting the answer with the

highest-scoring proof. We now describe these steps.

3.1 Hypothesis Generation

Given a question, Entailer ﬁrst generates candi-

date answers and converts these into declarative

hypotheses (e.g., “Is the sky (A) blue (B) yellow”

→

{

= “The sky is blue.”,

= “The sky is

yellow.”).

-way multiple choice question

yields

hypotheses. A true/false question yields

We use the word “proof” for convenience but note that

the term is somewhat approximate, as entailment “proofs” do

not have the guarantees of formal, deductive proofs.

Conversion of a QA pair to a declarative hypothesis D

uses a custom T5-11B model trained on the QA2D dataset

(Demszky et al.,2018).

Angle Input →Output (example)

H→P"H: A paperclip is made of metal. P:" →"[PREMISE] A paperclip is made of steel. [PREMISE] Steel is a metal."

H→Sd"H: A paperclip is made of steel. V:" →0.995

P H →Se"H: A paperclip is made of metal. P: [PREMISE] A paperclip is made of steel. [PREMISE] Steel is a metal. I:"

→0.998

Table 1: Examples of the three input/output angles used by Entailer. The ﬁrst generates a candidate entailment rule

P`H given H. The second and third score whether each premise, and the entailment itself, is valid, using tokens

V/I in the input to indicate that Sd/Seis the desired output.

Figure 4: The entailment tree is grown recursively, the algorithm looking for the best tree (maximizes the overall

score of the root node). Each node has a ﬁxed, direct (“fast”) score (in red), and (for internal nodes) a proof

(“slow”) score (in blue) computed from its children. The overall node score (highlighted) is the highest of the two.

If expanding a node increases its overall score (e.g., step 3), that increase is propagated upwards and recursion

continues. If expansions cannot improve a node’s score further (e.g., steps 2 and 4), the expansions are pruned and

that node becomes a leaf (red bars).

2 hypotheses. For open-ended questions, Entailer

ﬁrst collects

candidate answers generated by an

external source (Macaw (Tafjord and Clark,2021)

using nucleus sampling (Holtzman et al.,2019))

then forms Nhypotheses from them.

3.2 Generating Entailment Trees

3.2.1 Generating a Backward-Chaining Step

Models:

The core of Entailer is generating and validating a

single entailment step that entails a hypothesis. We

deﬁne the following data types:

H: A hypothesis (English statement) to prove.

A set of premises {

,...,

} (sentences) that

together may entail the hypothesis H. To-

gether, P and H form a one-deep entailment

step, denoted by P `H.

Q: A question posed to Entailer.

A: A candidate answer for consideration.

An optional context (set of sentences) relevant

to the problem. This allows Entailer to also

use external knowledge, if available, when

generating a tree.

We train a model (details in Section 4) with the

three input/output behaviors below (optional inputs

shown in parentheses):

(QAC)H→P:

Given a hypothesis H, generate a

set of premises P that may entail H

(QAC)H→Sd:

Score whether the model be-

lieves that a hypothesis H (or premise

) is

true (

Sd>0.5

) or not, (i.e. perform yes/no

QA). We call this the

direct

score (range 0-1).

(QAC)P H →Se:

Score whether the model be-

lieves a candidate entailment (P

H) is valid

(Se>0.5

) or not, i.e., P validly entails H

(range 0-1).

Examples of these three angles are in Table 1.

Algorithm:

To generate a single backward-chaining step we

adopt an overgenerate-and-ﬁlter approach, also

found useful elsewhere (Yang et al.,2022;Cobbe

et al.,2021;Li et al.,2022). First, given

, we use

the angle

H→P

to generate a set of premises

that may entail

. We then check that the

model believes all the premises

pi∈P

using the

H(= pi)→Sd

angle, and that it also believes the

inference step

P`H

itself is valid (independent of

whether the

are true or not) using the

P H →Se

angle. The proof score, denoting how well the 1-

step proof supports the hypothesis, is the product

of the premises’ and entailment scores:

sr-1deep(H) = (Πisd(pi)).se(P`H)

We repeat this

times using nucleus sampling to

obtain a diversity of alternative proof steps, and

then select the highest-scoring one

P`H

, as

illustrated in Figure 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Entailer:AnsweringQuestionswithFaithfulandTruthfulChainsofReasoningOyvindTafjord,BhavanaDalviMishra,PeterClarkAllenInstituteforAI,Seattle,WA{oyvindt,bhavanad,peterc}@allenai.orgAbstractOurgoalisaquestion-answering(QA)systemthatcanshowhowitsanswersareimpliedbyitsowninternalbeliefsviaasystematicchaino...

展开>> 收起<<

Entailer Answering Questions with Faithful and Truthful Chains of Reasoning Oyvind Tafjord Bhavana Dalvi Mishra Peter Clark.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Entailer Answering Questions with Faithful and Truthful Chains of Reasoning Oyvind Tafjord Bhavana Dalvi Mishra Peter Clark

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: