ples of recent relevant strategies. Some more rele-
vant work is available in (Anirudha et al.,2014a,c)
One intuitive line of work in Machine com-
prehension uses common sense knowledge along
with the comprehension text to generate answers.
Common-sense, knowledge increases the accu-
racy of machine comprehension systems. The
challenge is to find a way to include this addi-
tional data and improve the system’s performance.
There are many possible common-sense knowl-
edge sources. Generally, script knowledge which
is sequences of events that describe typical human
actions in an everyday situations is used.
The work from (Lin et al.,2017a) shows
how a multi-knowledge reasoning method, which
explores heterogeneous knowledge relationships,
can be powerful for commonsense MRC. This ap-
proach is achieved by combining different kinds
of structures for knowledge: narrative knowledge,
entity semantic knowledge and sentiment coherent
knowledge. By using data mining techniques, they
provide a model with cost-based inference rules
as an encoding mechanism for knowledge. Later,
they are able to produce a multi-knowledge rea-
soning model that has the ability to select which
inference rule to use for each context.
Another interesting approach comes from
(Wang et al.,2017a) where the authors proposed
the usage of Conditional Generative Adversarial
Network (CGAN) to tackle the problem of insuffi-
cient data for reading comprehension tasks by gen-
erating additional fake sentences and the proposed
generator is conditioned by the referred context,
achieving state-of-the-art results.
The work by (Wang,2018a) assesses how a
Three-Way Attentive Network (TriAN) with the
inclusion of commonsense knowledge benefits
multiple choice reading comprehension. The com-
bination of attention mechanisms have shown to
strongly improve performance for reading com-
prehension. In addition to that, commonsense
knowledge can help in inferring nontrivial im-
plicit events within the comprehension passage.
The work by (Lin et al.,2017b) focuses on rea-
soning with heterogeneous commonsense knowl-
edge. They use three kinds of commonsense
knowledge, causal relations, semantic relations
like co-reference,associative relations and lastly
sentiment knowledge - sentiment coherence (pos-
itivity and negativity) between two elements. In
human reasoning process, not all inference rules
have the same possibility to be applied, because
the more reasonable inference will be proposed
more likely. They use attention to weigh the in-
ferences based on the nature of the rule and the
given context. Their attention mechanism mod-
els the possibility that an inference rule is applied
during the inference from a premise document to a
hypothesis by considering the relatedness between
elements and knowledge category, as well as the
relatedness between two elements. They answer
the comprehension task by summarizing over all
valid inference rules.
Although Mixture of Experts models are widely
used for NLP tasks, it is still underused for Ma-
chine Reading Comprehension tasks. Recent work
includes the Language Model paper from (Le
et al.,2016) which introduces an LSTM-based
mixture method for the dynamic integration of
a group of word prediction experts in order to
achieve conditional language model which excels
simultaneously at several subtasks. Moreover, the
work from (Xiong et al.,2017) also includes a
sparse mixture of experts layer for a Question-
Answering task, which is inherited from the previ-
ous work of (Shazeer et al.,2017) on a Sparsely-
Gated Mixture-of-Experts Layer. The success
of the aforementioned approaches show the opti-
mism about introducing Mixture of Experts Deep
Learning for Machine Reading Comprehension.
3 Model description
We experiment with a mixture of experts model
to tackle the task of commonsense machine com-
prehension. The model is inspired from ana-
lyzing the errors made by a triple attention net-
work from (Wang,2018b) which achieves state-
of-the-art results by using a Three-Way Attentive
Network (TriAN) with the inclusion of common-
sense knowledge in the form of relational em-
beddings obtained from the ConceptNet knowl-
edge graph which is a large-scale graph of com-
monsense knowledge consisting of over 21 million
edges and 8 million nodes.
A training example in the commonsense com-
prehension task consists of a passage (P), question
(Q), answer (A) and label y which is 0 or 1. P,
Q and A are all sequences of words. For a word
Piin the given passage, the input representation
of Piis the concatenation of several vectors: Pre-
trained GloVe embeddings, Part-of-speech (POS)
embedding, named-entity embedding (NE), Con-