Entailer Answering Questions with Faithful and Truthful Chains of Reasoning Oyvind Tafjord Bhavana Dalvi Mishra Peter Clark

2025-05-06 0 0 1.46MB 17 页 10玖币
侵权投诉
Entailer: Answering Questions with Faithful and Truthful Chains of
Reasoning
Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark
Allen Institute for AI, Seattle, WA
{oyvindt,bhavanad,peterc}@allenai.org
Abstract
Our goal is a question-answering (QA) system
that can show how its answers are implied by
its own internal beliefs via a systematic chain
of reasoning. Such a capability would allow
better understanding of why a model produced
the answer it did. Our approach is to recur-
sively combine a trained backward-chaining
model, capable of generating a set of premises
entailing an answer hypothesis, with a verifier
that checks that the model itself believes those
premises (and the entailment itself) through
self-querying. To our knowledge, this is the
first system to generate multistep chains that
are both faithful (the answer follows from the
reasoning) and truthful (the chain reflects the
system’s own internal beliefs). In evaluation
using two different datasets, users judge that
a majority (70%+) of generated chains clearly
show how an answer follows from a set of facts
- substantially better than a high-performance
baseline - while preserving answer accuracy.
By materializing model beliefs that systemat-
ically support an answer, new opportunities
arise for understanding the model’s system of
belief, and diagnosing and correcting its mis-
understandings when an answer is wrong.
1 Introduction
Although pretrained language models (PTLMs)
have shown remarkable question-answering (QA)
performance, it is often unclear why their answers
follow from what they know. While there has been
substantial work on training models to also gener-
ate explanations for their answers (Wiegreffe and
Marasovi´
c,2021), or produce them via few-shot
prompting, e.g., “chains of thought” (Wei et al.,
2022), those explanations may not be faithful (the
answer does not necessarily follow from them) and
may not be truthful, in the sense that the language
model itself does not believe
1
the explanation state-
1
We here adopt a simple operational definition of belief,
namely that a model believes X if it answers "yes" to the
question "Is X true?". Other definitions could also be used.
Figure 1: Given a question, Entailer searches for an
answer hypothesis that is supported by an entailment
proof. First it over-generates candidate proofs, then it
removes those that the model itself does not “believe”
(i.e., confirms via self-querying that it considers all the
generated proof elements to be true). Finally it selects
the best verified proof. Multistep proofs are generated
by iteratively backward chaining on the premises (Sec-
tion 3.2).
ments that it generated. Rather, our goal is to gen-
erate answers that systematically follow from the
model’s own internal beliefs, materializing those
beliefs as explicit statements that can then be in-
spected. Such a capability offers new opportunities
for understanding, diagnosing, and ultimately cor-
recting errors in a language model’s behavior.
Our approach uses a combination of generation
and verification, implemented in a system called
Entailer
2
. Chains are constructed by backward
chaining from candidate answers, recursively us-
ing a language model (LM) trained for a single
backward-chaining step. For each step, Entailer
over-generates candidate entailments, then filters
2
Entailer data and models are available at https://allenai.
org/data/entailer
arXiv:2210.12217v1 [cs.AI] 21 Oct 2022
Q: A magnet will stick to (A) a belt buckle (B) a wooden table (C) a plastic cup (D) a paper plate
A: (A) A magnet will stick to a belt buckle because
A belt buckle is sometimes magnetic. because
Metal is sometimes magnetic.
A belt buckle is made of metal.
A magnet will stick to magnetic metals.
Q: You can make a telescope with a (A) straw (B) Glass (C) Candle (D) mailing tube
A: (D) You can make a telescope with a mailing tube. because
A telescope is made of a tube for observing / seeing.
A mailing tube is a kind of tube.
Q: Quartz may produce rainbows when light is shined (A) around the crystal’s area (B) through any of its sides (C)
in the room its in (D) in to a mirror at it
A: (B) Quartz may produce rainbows when light is shined through any of its sides. because
A rainbow is produced when light shines through a prism. because
The rainbow is made of all different colors in visible light.
A prism can split light into different colors.
A quartz is a kind of prism.
Figure 2: Questions (from the OBQA dataset) and Entailer’s answers, showing its chain of reasoning.
out those that do not conform to its own internal
knowledge (“beliefs”) by self-querying, asking it-
self whether (a) the generated premises (leaves of
the proof step) are true, and (b) each entailment step
is valid (Figure 1). It then recursively backward-
chains on premises until the overall proof confi-
dence cannot be further improved (or a depth limit
d
is reached). Finally, the candidate answer sup-
ported by the highest-scoring chain of a reasoning
is returned. As a result, the system has material-
ized some of its latent knowledge from which the
selected answer follows. Most significantly, the
resulting proof is thus both faithful (the answer fol-
lows from the proof) and truthful (the proof reflects
the system’s beliefs), providing a previously un-
available window into the model’s beliefs about the
world and their implications, e.g., Figure 2.
To train the Entailer model, we use a combina-
tion of the existing EntailmentBank dataset (Dalvi
et al.,2021), plus a new crowd-annotated dataset
that we construct by bootstrapping (train an ini-
tial model, generate candidate entailment examples
with it, then annotate those examples as extra train-
ing data). The model is then frozen, and Entailer
is then applied
zero-shot
to new datasets, i.e., En-
tailer is a treated as a general-purpose, fixed model
specialized for reasoning, rather than requiring fine-
tuning for new tasks.
We evaluate Entailer on two existing datasets,
OBQA (Mihaylov et al.,2018) and QuaRTz
(Tafjord et al.,2019). We find that its reasoning-
based QA accuracy is similar to its direct QA ac-
curacy, with the advantage that a supporting chain
of reasoning is also produced. We also perform a
human evaluation, and find that 70% of time users
judge the chains to clearly show how an answer
followed from their premises, substantially higher
than for explanations produced by a comparable
high-performance QA system, Macaw (Tafjord and
Clark,2021). Our contributions are thus:
1.
The first system to generate chains of reason-
ing showing how answers are systematically
implied by a model’s own internal beliefs,
making relevant model beliefs explicit. The
chains are both
faithful
(the answer follows
from the reasoning) and
truthful
(the chain
reflects the system’s own beliefs).
2.
A new, crowdsourced dataset of multi-premise
entailments, doubling the amount of data
available in EntailmentBank (Dalvi et al.,
2021), and including examples of both pos-
itive and negative entailments (Entailment-
Bank only includes positive examples)3.
2 Related Work
Systematic Reasoning:
Several recent systems
have demonstrated the ability to perform systematic
reasoning directly over natural language (Natural
Language Inference (Manning and MacCartney,
2009)), namely deriving conclusions from known
facts via step-wise application of well-defined infer-
ence operations. One approach is to retrain a black-
box model end-to-end (Clark et al.,2020), but has
been limited to small rulebases. An alternative ap-
proach, which we follow here, is to have an outside
loop around a model, where the model generates
individual inference steps (i.e., rules), and a con-
3The dataset is provided in the supplementary material.
troller chains them together. SCSearch (Bostrom
et al.,2022), NLProofS (Yang et al.,2022), IRGR
(Ribeiro et al.,2022), ProofWriter (iterative ver-
sion) (Tafjord et al.,2020), and Selection-Inference
(Creswell et al.,2022) do this in a forward-chaining
fashion, MetGen (Hong et al.,2022) does this bidi-
rectionally, while Braid (Kalyanpur et al.,2020)
(like us) does this in a backward-chaining fashion.
In all these systems, the required facts were ex-
pected to be provided explicitly to the model. In
contrast, Entailer’s reasoning uses its own inter-
nal, latent knowledge, as well as (optionally) exter-
nally provided facts. LeapOfThought (Talmor et al.,
2020) demonstrated that reasoning with a combina-
tion of implicit and explicit knowledge was possi-
ble for simple 1-step inferences. We expand this for
multi-step inference, and (unlike LeapOfThought)
have the system also explicitly articulate the im-
plicit knowledge it uses, and its chain of reasoning.
Recent work has shown that generating a free-
form explanation (“chain of thought”) before an
answer can also improve performance on a variety
of tasks (Wei et al.,2022;Cobbe et al.,2021;Nye
et al.,2021). In these works, however, the expla-
nations are unstructured, and there are no claims
of faithfulness that the answer follows from the
generation, nor that the explanations themselves
represent model beliefs.
Materializing a Model’s Internal Knowledge:
Pretrained LMs contain a vast amount of knowl-
edge, and can be thought of as a kind of knowledge
base to tap into (Petroni et al.,2019). Recent work
has shown that this latent knowledge can be materi-
alized as explicit English sentences or a knowledge
graph using generative techniques, e.g., COMeT
(Bosselut et al.,2019), ParaCOMET (Gabriel et al.,
2021). Our work with Entailer similarly mate-
rializes its latent knowledge, but here in a goal-
directed way, namely by producing a faithful chain
of reasoning from facts it validates (“believes”) as
true to an answer. This articulation can be seen as
a kind of self-talk, where a self-generated context
can improve QA (Shwartz et al.,2020). However,
here our generations are not used as context for
opaque problem-solving, but are assembled into a
well-defined chain of reasoning.
Beliefs:
We refer to the model’s factual opinions
as “beliefs” rather than “knowledge” because those
opinions may be wrong. In general, an agent can
be said to believe p if it acts as if p was true
(Schwitzgebel,2019). Following (Kassner et al.,
Figure 3: An entailment tree is a set of multi-premise,
1-step entailments (red boxes) showing how the hypoth-
esis (root node, green) is entailed from the leaf nodes
(white). If all the leaf nodes are true, and all the 1-step
entailment relations are valid, then we say the tree is a
valid chain of reasoning for the hypothesis.
2021), we take a simple, syntactic operationaliza-
tion of this, namely the agent answers “yes” to the
question “p?”, but also note that more semantic
versions could be used, e.g., the agent also answers
“yes” to paraphrases and implications of p. In gen-
eral, models can sometimes be inconsistent in their
beliefs (Elazar et al.,2021;Kassner and Schütze,
2020;Ribeiro et al.,2019). For our purposes here,
we simply note that such inconsistencies may oc-
casionally exist, and that techniques for inconsis-
tency resolution could be applied in future to re-
duce these, e.g., (Kassner et al.,2021;Li et al.,
2019).
3 Approach
Like several previous systems (Section 2), Entailer
treats reasoning as Natural Language Inference
(NLI). In NLI, the basic unit of knowledge is (rep-
resented as) a sentence rather than a structure, and
a proof
4
is a tree of multi-step, multi-premise en-
tailments, e.g., Figures 2and 3.
Within this framework, given a question, En-
tailer first generates candidate answers, then tries
to prove each one, selecting the answer with the
highest-scoring proof. We now describe these steps.
3.1 Hypothesis Generation
Given a question, Entailer first generates candi-
date answers and converts these into declarative
hypotheses (e.g., “Is the sky (A) blue (B) yellow”
{
H1
= “The sky is blue.”,
H2
= “The sky is
yellow.”).
5
An
N
-way multiple choice question
yields
N
hypotheses. A true/false question yields
4
We use the word “proof” for convenience but note that
the term is somewhat approximate, as entailment “proofs” do
not have the guarantees of formal, deductive proofs.
5
Conversion of a QA pair to a declarative hypothesis D
uses a custom T5-11B model trained on the QA2D dataset
(Demszky et al.,2018).
Angle Input Output (example)
HP"H: A paperclip is made of metal. P:" "[PREMISE] A paperclip is made of steel. [PREMISE] Steel is a metal."
HSd"H: A paperclip is made of steel. V:" 0.995
P H Se"H: A paperclip is made of metal. P: [PREMISE] A paperclip is made of steel. [PREMISE] Steel is a metal. I:"
0.998
Table 1: Examples of the three input/output angles used by Entailer. The first generates a candidate entailment rule
P`H given H. The second and third score whether each premise, and the entailment itself, is valid, using tokens
V/I in the input to indicate that Sd/Seis the desired output.
Figure 4: The entailment tree is grown recursively, the algorithm looking for the best tree (maximizes the overall
score of the root node). Each node has a fixed, direct (“fast”) score (in red), and (for internal nodes) a proof
(“slow”) score (in blue) computed from its children. The overall node score (highlighted) is the highest of the two.
If expanding a node increases its overall score (e.g., step 3), that increase is propagated upwards and recursion
continues. If expansions cannot improve a node’s score further (e.g., steps 2 and 4), the expansions are pruned and
that node becomes a leaf (red bars).
2 hypotheses. For open-ended questions, Entailer
first collects
N
candidate answers generated by an
external source (Macaw (Tafjord and Clark,2021)
using nucleus sampling (Holtzman et al.,2019))
then forms Nhypotheses from them.
3.2 Generating Entailment Trees
3.2.1 Generating a Backward-Chaining Step
Models:
The core of Entailer is generating and validating a
single entailment step that entails a hypothesis. We
define the following data types:
H: A hypothesis (English statement) to prove.
P:
A set of premises {
p1
,...,
pi
} (sentences) that
together may entail the hypothesis H. To-
gether, P and H form a one-deep entailment
step, denoted by P `H.
Q: A question posed to Entailer.
A: A candidate answer for consideration.
C:
An optional context (set of sentences) relevant
to the problem. This allows Entailer to also
use external knowledge, if available, when
generating a tree.
We train a model (details in Section 4) with the
three input/output behaviors below (optional inputs
shown in parentheses):
(QAC)HP:
Given a hypothesis H, generate a
set of premises P that may entail H
(QAC)HSd:
Score whether the model be-
lieves that a hypothesis H (or premise
pi
) is
true (
Sd>0.5
) or not, (i.e. perform yes/no
QA). We call this the
direct
score (range 0-1).
(QAC)P H Se:
Score whether the model be-
lieves a candidate entailment (P
`
H) is valid
(Se>0.5
) or not, i.e., P validly entails H
(range 0-1).
Examples of these three angles are in Table 1.
Algorithm:
To generate a single backward-chaining step we
adopt an overgenerate-and-filter approach, also
found useful elsewhere (Yang et al.,2022;Cobbe
et al.,2021;Li et al.,2022). First, given
H
, we use
the angle
HP
to generate a set of premises
P
that may entail
H
. We then check that the
model believes all the premises
piP
using the
H(= pi)Sd
angle, and that it also believes the
inference step
P`H
itself is valid (independent of
whether the
pi
are true or not) using the
P H Se
angle. The proof score, denoting how well the 1-
step proof supports the hypothesis, is the product
of the premises’ and entailment scores:
sr-1deep(H) = (Πisd(pi)).se(P`H)
We repeat this
k
times using nucleus sampling to
obtain a diversity of alternative proof steps, and
then select the highest-scoring one
P`H
, as
illustrated in Figure 1.
摘要:

Entailer:AnsweringQuestionswithFaithfulandTruthfulChainsofReasoningOyvindTafjord,BhavanaDalviMishra,PeterClarkAllenInstituteforAI,Seattle,WA{oyvindt,bhavanad,peterc}@allenai.orgAbstractOurgoalisaquestion-answering(QA)systemthatcanshowhowitsanswersareimpliedbyitsowninternalbeliefsviaasystematicchaino...

展开>> 收起<<
Entailer Answering Questions with Faithful and Truthful Chains of Reasoning Oyvind Tafjord Bhavana Dalvi Mishra Peter Clark.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.46MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注