this task since they require the corresponding text,
which is missing in the KGC task.
To complete uninferable relations more accu-
rately, we are inspired by Open-domain Question
Answering (OpenQA) models (Guu et al.,2020;
Lewis et al.,2020) and propose a novel KGC
method based on information retrieval and machine
reading comprehension, namely IR4KGC. Specif-
ically, the triple query
(h, r, ?)
is firstly converted
to a search query containing its knowledge seman-
tics. Then, our model uses a pre-trained knowledge
retrieval module to retrieve documents that match
the search query and generates the final predictions
based on a generative PLM. Besides, the retrieved
documents provide additional interpretability.
Most of the existing OpenQA models are based
on Dense Retrieval (Karpukhin et al.,2020) or
BM25 (Robertson et al.,2009). These retrieval
modules can retrieve natural language queries well,
but it is challenging to handle search queries con-
taining rich knowledge semantics in the KGC task.
To solve this problem, we construct a training cor-
pus for retrieval based on the idea of distant supervi-
sion and pre-train our knowledge retrieval module
on the task KGC. Thus it can better capture the
knowledge semantic information contained in the
search query and return more relevant documents.
Experimental results on two KGC datasets show
that IR4KGC achieves superior results on uninfer-
able relations over the KGE models. In addition,
the combination of IR4KGC and the KGE model
achieves the best performance on all datasets.
2 Related Work
2.1 Knowledge Graph Completion
KGE models are the main components of knowl-
edge graph completion models. KGE models
can be divided into four main categories: (1)
translation-based models (Bordes et al.,2013;Sun
et al.,2019); (2) models based on tensor decompo-
sition (Nickel et al.,2011;Balaževi´
c et al.,2019);
(3) models based on neural networks (Socher et al.,
2013;Dettmers et al.,2018); (4) models that intro-
duce additional information (Lin et al.,2016;Wang
et al.,2021). As we introduced in Section 1, KGE
models struggle with uninferable relations.
There are some PLM-based KGC models being
proposed in recent years, most of which use PLM
to determine the correctness of a given triple (Yao
et al.,2019;Lv et al.,2022) or to directly gener-
ate the predicted tail entities (Saxena et al.,2022).
Implicit knowledge in PLM can help the model to
complete uninferable relations. But these models
still have drawbacks since it is difficult for PLM to
accurately remember all knowledge in the world.
RE models can also complete knowledge from
text. However, RE aims to extract all the knowl-
edge from text, and it is difficult to do specific
knowledge completion. Furthermore, the text re-
quired for RE is also missing, making RE unsuit-
able for the KGC task in this paper.
2.2 Open-domain Question Answering
Open-domain Question Answering aims to answer
open-domain questions without context. Most of
the OpenQA models in recent years have adopted
the retrieving and reading pipeline (Chen et al.,
2017;Guu et al.,2020;Lewis et al.,2020). Specif-
ically, these models use retrieval modules such
as Dense Retrieval (Karpukhin et al.,2020) or
BM25 (Robertson et al.,2009) to retrieve rele-
vant documents and give answers using extraction
or generation-based methods. However, these re-
trieval modules are difficult to adapt to KGC tasks
and have low retrieval efficiency. In addition, there
are some OpenQA models based on knowledge-
guided retrieval (Min et al.,2019;Asai et al.,2019),
but they are limited by KGs with Wikipedia links
and are difficult to adapt to most KGs.
3 Method
Given a triple query
(h, r, ?)
, where
h
is the head
entity and
r
is the relation, we transform it into
a search query and retrieve relevant documents
using our retrieval module. After that, the con-
ditional generation module generates predicted an-
swers based on the documents. These two modules
are optimized jointly following Lewis et al. (2020).
3.1 Knowledge-based Information Retrieval
Triple Query Transformation
. For a triple query
tq = (h, r, ?)
, we have two functions to con-
vert it into a search query, denoted as
FL
and
FLA
.
FL(tq) = LABEL(h)kLABEL(r)
, where
LABEL(x)
is the label corresponding to
x
and
k
denotes the concatenation operation.
FLA
uses
alias to increase the query diversity. Specifically,
FLA(tq) = TEXT(h)kTEXT(r)
, where
TEXT(x)
has a
50%
probability of being a label of
x
and a
50% probability of being a random alias of x.
Pre-training Method
. Following the training ap-
proach of DPR (Karpukhin et al.,2020), we pre-