
troller chains them together. SCSearch (Bostrom
et al.,2022), NLProofS (Yang et al.,2022), IRGR
(Ribeiro et al.,2022), ProofWriter (iterative ver-
sion) (Tafjord et al.,2020), and Selection-Inference
(Creswell et al.,2022) do this in a forward-chaining
fashion, MetGen (Hong et al.,2022) does this bidi-
rectionally, while Braid (Kalyanpur et al.,2020)
(like us) does this in a backward-chaining fashion.
In all these systems, the required facts were ex-
pected to be provided explicitly to the model. In
contrast, Entailer’s reasoning uses its own inter-
nal, latent knowledge, as well as (optionally) exter-
nally provided facts. LeapOfThought (Talmor et al.,
2020) demonstrated that reasoning with a combina-
tion of implicit and explicit knowledge was possi-
ble for simple 1-step inferences. We expand this for
multi-step inference, and (unlike LeapOfThought)
have the system also explicitly articulate the im-
plicit knowledge it uses, and its chain of reasoning.
Recent work has shown that generating a free-
form explanation (“chain of thought”) before an
answer can also improve performance on a variety
of tasks (Wei et al.,2022;Cobbe et al.,2021;Nye
et al.,2021). In these works, however, the expla-
nations are unstructured, and there are no claims
of faithfulness that the answer follows from the
generation, nor that the explanations themselves
represent model beliefs.
Materializing a Model’s Internal Knowledge:
Pretrained LMs contain a vast amount of knowl-
edge, and can be thought of as a kind of knowledge
base to tap into (Petroni et al.,2019). Recent work
has shown that this latent knowledge can be materi-
alized as explicit English sentences or a knowledge
graph using generative techniques, e.g., COMeT
(Bosselut et al.,2019), ParaCOMET (Gabriel et al.,
2021). Our work with Entailer similarly mate-
rializes its latent knowledge, but here in a goal-
directed way, namely by producing a faithful chain
of reasoning from facts it validates (“believes”) as
true to an answer. This articulation can be seen as
a kind of self-talk, where a self-generated context
can improve QA (Shwartz et al.,2020). However,
here our generations are not used as context for
opaque problem-solving, but are assembled into a
well-defined chain of reasoning.
Beliefs:
We refer to the model’s factual opinions
as “beliefs” rather than “knowledge” because those
opinions may be wrong. In general, an agent can
be said to believe p if it acts as if p was true
(Schwitzgebel,2019). Following (Kassner et al.,
Figure 3: An entailment tree is a set of multi-premise,
1-step entailments (red boxes) showing how the hypoth-
esis (root node, green) is entailed from the leaf nodes
(white). If all the leaf nodes are true, and all the 1-step
entailment relations are valid, then we say the tree is a
valid chain of reasoning for the hypothesis.
2021), we take a simple, syntactic operationaliza-
tion of this, namely the agent answers “yes” to the
question “p?”, but also note that more semantic
versions could be used, e.g., the agent also answers
“yes” to paraphrases and implications of p. In gen-
eral, models can sometimes be inconsistent in their
beliefs (Elazar et al.,2021;Kassner and Schütze,
2020;Ribeiro et al.,2019). For our purposes here,
we simply note that such inconsistencies may oc-
casionally exist, and that techniques for inconsis-
tency resolution could be applied in future to re-
duce these, e.g., (Kassner et al.,2021;Li et al.,
2019).
3 Approach
Like several previous systems (Section 2), Entailer
treats reasoning as Natural Language Inference
(NLI). In NLI, the basic unit of knowledge is (rep-
resented as) a sentence rather than a structure, and
a proof
4
is a tree of multi-step, multi-premise en-
tailments, e.g., Figures 2and 3.
Within this framework, given a question, En-
tailer first generates candidate answers, then tries
to prove each one, selecting the answer with the
highest-scoring proof. We now describe these steps.
3.1 Hypothesis Generation
Given a question, Entailer first generates candi-
date answers and converts these into declarative
hypotheses (e.g., “Is the sky (A) blue (B) yellow”
→
{
H1
= “The sky is blue.”,
H2
= “The sky is
yellow.”).
5
An
N
-way multiple choice question
yields
N
hypotheses. A true/false question yields
4
We use the word “proof” for convenience but note that
the term is somewhat approximate, as entailment “proofs” do
not have the guarantees of formal, deductive proofs.
5
Conversion of a QA pair to a declarative hypothesis D
uses a custom T5-11B model trained on the QA2D dataset
(Demszky et al.,2018).