To this end, a variety of Graph Neural Network
(GNN) approaches have been proposed in recent
years, which explore the use of graph structures
with entities as vertices and relations as edges to
address the Multihop QA task. These model the
relationship between the question, context, and po-
tential answers in a more informed way than full
token attention.
In particular, Relational Graph Convolutional
Networks (Schlichtkrull et al.,2018) (RGCN) have
been successfully applied in a number of Multi-
hop QA models. RGCN introduces typed edges
between nodes in the underlying graph, with a con-
volutional update step of nodes depending on their
neighbours and their respective type(s) of relation.
While the general architecture of RGCN is
shared by many approaches, the remaining frame-
work (node types, embeddings, relations, pre- and
post-RGCN layers) vary greatly. However, a prin-
cipled analysis of the impact these factors have on
model performance seems still to be missing from
current literature. This paper aims at rectifying this
by shedding some light on the efficacy of RGCN
for multihop reasoning under various conditions.
The main contributions of this paper are as fol-
lows: (i) We present a direct comparison of two
strong, related, yet significantly different RGCN-
based architectures for the Multihop QA task; (ii)
we derive a new RGCN-based architecture combin-
ing features of the two prior ones; (iii) we present a
principled analysis of the impact on model perfor-
mance across three conditions: model architecture,
node types and relations, node embeddings.
To our knowledge, this is the first attempt at a
principled comparison of RGCN-based Multihop
QA approaches.
2 Related Work
A number of graph-based approaches to Multihop
QA have been proposed in recent years. To answer
the questions presented in the WikiHop dataset
(Welbl et al.,2018), in De Cao et al. (2019) the
authors propose to use Relational Graph Convolu-
tional Networks (Schlichtkrull et al.,2018) by mod-
elling relations between entities in the query, doc-
uments, and candidate choices. Path-based GCN
was introduced by Tang et al. (2020b) for the same
task and builds on these relations, but added rela-
tions over “reasoning entites”, i.e., Named Entities
that co-occur with those presented in the query and
candidate entity set. Furthermore, Tu et al. (2019)
have used different types of nodes and edges in the
graph which led to high performance. They em-
ployed entity, sentence and document level nodes
to represent the relevant background information
and connect them on the basis of co-reference in the
case of entities and co-occurence among all types
of nodes. In a similar vein, in order to answer Hot-
potQA (Yang et al.,2018) questions, which include
answer span as well as support sentence prediction
tasks, Fang et al. (2020) introduced Hierarchical
Graph Networks, which establish a relational hier-
archy between graph nodes on entity, sentence, and
paragraph levels, and uses Graph Attention Net-
works (Veliˇ
ckovi´
c et al.,2017) for information dis-
tribution through the graph. HopRetriever (Li et al.,
2020) leverage Wikipedia hyperlinks to model hops
between articles via entities and their implicit rela-
tions introduced through the link.
In contrast, some research has explored the ques-
tion of whether a multihop approach is even nec-
essary to solve the task presented in recent multi-
hop question answering datasets (Min et al.,2019).
They show that using a single hop is sufficient to
answer 67% of the questions in HotpotQA. What is
more, Tang et al. (2020a) study the ability of state-
of-the-art models in the task of Multihop QA to
answer subquestions that compose the main ques-
tion. They show that these models often fail to
answer the intermediate steps, and suggest that
they may not actually be performing the task in a
compositional manner. Furthermore, Groeneveld
et al. (2020) provide support for the claim that the
multihop tasks can be solved to a large extent with-
out an explicit encoding of the compositionality of
the question and all the relations between knowl-
edge sources in different support documents in Hot-
potQA. Their pipeline simply predicts the support
sentences separately and predicts the answer span
from them using transformer models for encoding
all inputs for the classification. In a similar vein,
Shao et al. (2020) show that self-attention in trans-
formers performs on par with graph structure on
the HotpotQA task, providing further evidence that
this dataset does not require explicit modeling of
multiple hops for high performance.
3 QA Graphs and Architecture
WikiHop
A number of Multihop QA datasets
have been released with the largest two, WikiHop
(Welbl et al.,2018) and HotpotQA (Yang et al.,
2018), most actively used for research. The precise