also obtain the robust performance on the origi-
nal dataset. We formalize a causal graph to reflect
the causal relationships between question (
Q
), con-
texts and answer (
Y
). To evaluate the disconnected
reasoning, contexts are further divided into two
subsets:
S
is a supporting fact and
C
are the re-
maining supporting facts. Hence, we can formulate
the disconnected reasoning as two natural direct
causal effects of
(Q, S)
and
(Q, C)
on
Y
as shown
in Fig. 1. With the proposed causal graph, we
can relieve the disconnected reasoning by disen-
tangling the two natural direct effects and the true
multi-hop reasoning from the total causal effect.
A novel counterfactual multihop QA is proposed
to disentangle them from the total causal effect.
We utilize the generated probing dataset proposed
by (Trivedi et al.,2020) and DiRe to measures
how much the proposed multi-hop QA model can
reduce the disconnected reasoning. Experiment
results show that our approach can substantially
decrease the disconnected reasoning while guaran-
tee the strong performance on the original test set.
The results indicate that the proposed approach can
reduce the disconnected reasoning and improve the
true multi-hop reasoning capability.
The main contribution of this paper is threefold.
Firstly, our counterfactual multi-hop QA model
formulates disconnected reasoning as two direct
causal effects on answer, which is a new perspec-
tive and technology to learn the true multi-hop rea-
soning. Secondly, our approach achieves notable
improvement on reducing disconnected reasoning
compared to various state-of-the-arts. Thirdly, our
causal-effect approach is model-agnostic and can
be used for reducing disconnected reasoning in
many multi-hop QA architectures.
2 Related Work
Multi-hop question answering (QA) requires the
model to retrieve the supporting facts to predict
the answer. Many approaches and datasets have
been proposed to train QA systems. For example,
HotpotQA (Yang et al.,2018) dataset is a widely
used dataset for multi-hop QA, which consists of
fullwiki setting (Das et al.,2019;Nie et al.,2019;
Qi et al.,2019;Chen et al.,2019;Li et al.,2021;
Xiong et al.,2020) and distractor setting (Min et al.,
2019b;Nishida et al.,2019;Qiu et al.,2019;Jiang
and Bansal,2019;Trivedi et al.,2020).
In fullwiki setting, it firstly finds relevant
facts from all Wikipedia articles, and then fin-
ish the multi-hop QA task with the found facts.
The retrieval model is important in this set-
ting. For instance, SMRS (Nie et al.,2019) and
DPR (Karpukhin et al.,2020) found the implicit
importance of retrieving relevant information in
the semantic space. Entity-centric (Das et al.,
2019), CogQA (Ding et al.,2019) and Golden Re-
triever (Qi et al.,2019) explicitly used the entity
that is mentioned or reformed in query key words to
retrieve next hop document. Furthermore, PathRe-
triever (Asai et al.,2019) and HopRetriever (Li
et al.,2021) can iteratively select the documents
to form a paragraph-level reason path using RNN.
MDPR (Xiong et al.,2020) retrieved passages only
using dense query vector in many times. These
methods hardly discuss the QA model’s discon-
nected reasoning problem.
In distractor setting, 10 paragraphs, two gold
paragraphs and eight distractors, are given. Many
methods have been proposed to strengthen the
model’s capability of multi-hop reasoning, using
graph neural network (Qiu et al.,2019;Fang et al.,
2019;Shao et al.,2020) or adversarial examples or
counterfactual examples (Jiang and Bansal,2019;
Lee et al.,2021) or the sufficiency of the support-
ing evidences (Trivedi et al.,2020) or make use of
the pretrained language models (Zhao et al.,2020;
Zaheer et al.,2020).
However, Min et al. (2019a) demonstrated that
many compositional questions in HotpotQA can
be answered with a single hop. It means that QA
models can take shortcuts instead of multi-hop rea-
soning to produce the corrected answer. To relieve
the issue, Jiang and Bansal (2019) added adver-
sarial examples as hard distractors during training.
Recently, (Trivedi et al.,2020) proposed an ap-
proach, DiRe, to measure the model’s disconnected
reasoning behavior and use the supporting suffi-
ciency label to reduce the disconnected reasoning.
Lee et al. (2021) selected the supporting evidence
according to the sentence causality to the predicted
answer, which guarantees the explainability of the
behavior of the model. While, the original perfor-
mance also drops when reducing the disconnected
reasoning.
Causal Inference.
Recently, causal inference
(Pearl and Mackenzie,2018;Pearl,2022) has been
applied to many tasks of natural language process-
ing, and it shows promising results and provides
strong interpretability and generalizability. The rep-
resentative works include counterfactual interven-