
R2F: A General Retrieval, Reading and Fusion Framework
for Document-level Natural Language Inference
Hao Wang§, Yixin Cao†∗, Yangguang Li‡, Zhen Huang§
Kun Wang‡, Jing Shao‡
§National University of Defense Technology
†Singapore Management University ‡SenseTime
Abstract
Document-level natural language inference
(D
OC
NLI) is a new challenging task in natural
language processing, aiming at judging the en-
tailment relationship between a pair of hypoth-
esis and premise documents. Current datasets
and baselines largely follow sentence-level set-
tings, but fail to address the issues raised by
longer documents. In this paper, we estab-
lish a general solution, named Retrieval, Read-
ing and Fusion (R2F) framework, and a new
setting, by analyzing the main challenges of
D
OC
NLI: interpretability, long-range depen-
dency, and cross-sentence inference. The basic
idea of the framework is to simplify document-
level task into a set of sentence-level tasks, and
improve both performance and interpretabil-
ity with the power of evidence. For each hy-
pothesis sentence, the framework retrieves ev-
idence sentences from the premise, and reads
to estimate its credibility. Then the sentence-
level results are fused to judge the relationship
between the documents. For the setting, we
contribute complementary evidence and entail-
ment label annotation on hypothesis sentences,
for interpretability study. Our experimental
results show that R2F framework can obtain
state-of-the-art performance and is robust for
diverse evidence retrieval methods. Moreover,
it can give more interpretable prediction re-
sults. Our model and code are released at
https://github.com/phoenixsecularbird/R2F.
1 Introduction
Natural Language Inference (NLI) is the task of
determining whether a hypothesis is entailed or not
in a premise. While earlier works (Bowman et al.,
2015;Williams et al.,2018;Wang et al.,2019;
Nie et al.,2020) assume that both hypothesis and
premise are single sentences, recent research pays
more attention on document-level task, namely
Document-level NLI (DOCNLI) (Yin et al.,2021).
The task can enlarge the task scope to judge the
*Corresponding Author.
variability of semantic expression for many Natural
Language Processing (NLP) tasks, e.g., exposure
bias (Bengio et al.,2015;Arora et al.,2022) alle-
viation for text summarization (Sandhaus,2008;
Narayan et al.,2018), and human-manipulated
news articles recognition for automatic fake news
detection (Jawahar et al.,2022;Huang et al.,2022).
Compared with sentence-level NLI, DOCNLI
poses many new challenges, while there are only
a few datasets and models. In terms of datasets,
Yin et al. (2021) reformat some mainstream NLP
tasks, e.g., text summarization and question answer-
ing, and build the first large scale dataset DOCNLI
with over 1 million document pairs
1
. However, the
dataset does not provide evidence annotation about
how the labels are inferred, i.e., which hypothesis
sentences lead to semantic inconsistency, or which
premise sentences help to decide the entailment
relationship. As shown in Figure 1, although the
sample is annotated as not entailment, most of the
hypothesis sentences are actually entailed. By con-
trast, the detailed disinformation of “in 1989” in
the third hypothesis sentence eventually decides
the entailment relationship between the documents.
For each hypothesis sentence, only several premise
sentences are enough to serve as the exact evidence
to judge its own sentence-level entailment label.
In this paper, we argue that evidence discovery
is important and challenging for DOCNLI. Our pi-
lot experiments in Section 4.3 and 4.5 show that
randomly selected evidences can still contribute to
comparable performance. Thus, only the black-box
models may be not so convincing. However, to
annotate evidence for evaluation is non-trivial. For
each hypothesis sentence, on one hand, it may refer
to multiple premise sentences. On the other hand,
there may be several evidence groups, where each
group can independently serve the label prediction.
We highlight this as interpretability challenge.
1
Yin et al. (2021) propose the task and annotate the dataset
with the same name DOCNLI.
arXiv:2210.12328v1 [cs.CL] 22 Oct 2022