lenging problem.
We propose RESEL, a hierarchical
Re
trieve-and-
Sel
ection model for multi-modal and document-
level SciIE. In RESEL, we pose the
N
-ary relation
extraction problem as a question answering task
over text and tables (Figure 1). RESEL then decom-
poses the challenging task into two simpler sub-
tasks: (1) high-level component retrieval, which
aims to locate the target paragraph/table where the
final target entity resides, and (2) low-level entity
extraction, which aims to select the target entity
from the chosen component.
For high-level component (i.e., paragraph or ta-
ble) retrieval, we design a feature set that com-
bines the strengths of two classes of retrieval
methods: (1) sparse retrieval (Aizawa,2003;
Robertson and Zaragoza,2009) that represents the
query-candidate pairs as high-dimensional sparse
vectors to encode lexical features; (2) dense re-
trieval (Karpukhin et al.,2020) that leverages la-
tent semantic embeddings to represent query and
candidates. We design sparse and dense retrieval
features for query-component pairs by augmenting
BERT (Devlin et al.,2019)-based semantic similar-
ities with entity-level semantic and lexical similar-
ities, allowing for training an accurate high-level
retriever using only a small amount of labeled data.
The low-level entity extraction stage aims to
infer
N
-ary entity relations from complex and
noisy signals across paragraphs and tables. In
this stage, we first build a cross-modal entity-
correlation graph, which encodes different entity-
entity relations such as co-occurrence, co-reference,
and table structural relations. While most of the
existing methods (Zheng et al.,2020;Zeng et al.,
2020) use BERT embeddings as node representa-
tions, we find BERT embeddings limited in dis-
tinguishing adjacent table cells or similar entities.
This issue is even more severe when the BERT em-
beddings are propagated on the graph. To address
this, we design a new bag-of-neighbors (BON) rep-
resentation. It computes the lexical and semantic
similarities between each candidate entity and its 1-
hop neighbors. We then feed the BON features into
a graph attention network (GAT) to capture both
neighboring semantics and structural correlations.
Such GAT-learned features and BERT-based em-
beddings are treated as two complementary views,
which are co-trained with a consistency loss.
We summarize our key contributions as follows:
(1) We propose a hierarchical retrieve-and-select
learning method that decomposes
N
-ary scientific
relation extraction into two simpler subtasks; (2)
For high-level component retrieval, we propose a
simple but effective feature-based model that com-
bines multi-level semantic and lexical features be-
tween queries and components; (3) For low-level
entity extraction, we propose a multi-view architec-
ture, which fuses graph-based structural relations
with BERT-based semantic information for extrac-
tion; (4) Extensive experiments on three datasets
show the superiority of both the high-level and low-
level modules in RESEL.
2 Related Work
Component Retrieval
For component retrieval,
traditional sparse retrieval methods such as TF-
IDF (Aizawa,2003) and BM25 (Robertson and
Zaragoza,2009) focus on keyword-level match-
ing but ignore entity semantics. Recently, pre-
trained language models have also been used to
represent queries and documents in a learned
space (Karpukhin et al.,2020) and have been ex-
tended to handle tabular context (Herzig et al.,
2021;Ma et al.,2022). However, these methods
mainly focus on passage-level retrieval, and can-
not well capture fine-grained entity-level seman-
tics (Zhang et al.,2020;Su et al.,2021). Such
an issue makes them suboptimal for encoding nu-
anced terms and descriptions in scientific articles.
In contrast, RESEL leverages both component- and
entity-level semantic and lexical features that help
the model better understand the correlations be-
tween components and queries.
N-ary Relation Extraction
Many existing
methods (Jia et al.,2019;Jain et al.,2020;
Viswanathan et al.,2021) treat
N
-ary relation ex-
traction as a binary classification problem and pre-
dict whether the composition of
N
entities in the
document are valid or not. However, the candidate
space grows exponentially with N, and the perfor-
mance of the binary classifiers can be largely influ-
enced by the number and quality of negative tuples.
Some other methods (Du et al.,2021;Huang et al.,
2021) formulate the problem as role-filler entity ex-
traction and propose BERT-based generative mod-
els to extract the correct entities for each element of
the
N
-ary relation. None of these methods consider
N
-ary relation across modalities. Lockard et al.
(2020) leverages the layout information for extract-
ing relations from web pages. However, the layout
information in science articles are less prominent