R2F A General Retrieval Reading and Fusion Framework for Document-level Natural Language Inference Hao Wangx Yixin Caoy Yangguang Liz Zhen Huangx

2025-04-29 0 0 970.13KB 13 页 10玖币
侵权投诉
R2F: A General Retrieval, Reading and Fusion Framework
for Document-level Natural Language Inference
Hao Wang§, Yixin Cao†∗, Yangguang Li, Zhen Huang§
Kun Wang, Jing Shao
§National University of Defense Technology
Singapore Management University SenseTime
Abstract
Document-level natural language inference
(D
OC
NLI) is a new challenging task in natural
language processing, aiming at judging the en-
tailment relationship between a pair of hypoth-
esis and premise documents. Current datasets
and baselines largely follow sentence-level set-
tings, but fail to address the issues raised by
longer documents. In this paper, we estab-
lish a general solution, named Retrieval, Read-
ing and Fusion (R2F) framework, and a new
setting, by analyzing the main challenges of
D
OC
NLI: interpretability, long-range depen-
dency, and cross-sentence inference. The basic
idea of the framework is to simplify document-
level task into a set of sentence-level tasks, and
improve both performance and interpretabil-
ity with the power of evidence. For each hy-
pothesis sentence, the framework retrieves ev-
idence sentences from the premise, and reads
to estimate its credibility. Then the sentence-
level results are fused to judge the relationship
between the documents. For the setting, we
contribute complementary evidence and entail-
ment label annotation on hypothesis sentences,
for interpretability study. Our experimental
results show that R2F framework can obtain
state-of-the-art performance and is robust for
diverse evidence retrieval methods. Moreover,
it can give more interpretable prediction re-
sults. Our model and code are released at
https://github.com/phoenixsecularbird/R2F.
1 Introduction
Natural Language Inference (NLI) is the task of
determining whether a hypothesis is entailed or not
in a premise. While earlier works (Bowman et al.,
2015;Williams et al.,2018;Wang et al.,2019;
Nie et al.,2020) assume that both hypothesis and
premise are single sentences, recent research pays
more attention on document-level task, namely
Document-level NLI (DOCNLI) (Yin et al.,2021).
The task can enlarge the task scope to judge the
*Corresponding Author.
variability of semantic expression for many Natural
Language Processing (NLP) tasks, e.g., exposure
bias (Bengio et al.,2015;Arora et al.,2022) alle-
viation for text summarization (Sandhaus,2008;
Narayan et al.,2018), and human-manipulated
news articles recognition for automatic fake news
detection (Jawahar et al.,2022;Huang et al.,2022).
Compared with sentence-level NLI, DOCNLI
poses many new challenges, while there are only
a few datasets and models. In terms of datasets,
Yin et al. (2021) reformat some mainstream NLP
tasks, e.g., text summarization and question answer-
ing, and build the first large scale dataset DOCNLI
with over 1 million document pairs
1
. However, the
dataset does not provide evidence annotation about
how the labels are inferred, i.e., which hypothesis
sentences lead to semantic inconsistency, or which
premise sentences help to decide the entailment
relationship. As shown in Figure 1, although the
sample is annotated as not entailment, most of the
hypothesis sentences are actually entailed. By con-
trast, the detailed disinformation of “in 1989” in
the third hypothesis sentence eventually decides
the entailment relationship between the documents.
For each hypothesis sentence, only several premise
sentences are enough to serve as the exact evidence
to judge its own sentence-level entailment label.
In this paper, we argue that evidence discovery
is important and challenging for DOCNLI. Our pi-
lot experiments in Section 4.3 and 4.5 show that
randomly selected evidences can still contribute to
comparable performance. Thus, only the black-box
models may be not so convincing. However, to
annotate evidence for evaluation is non-trivial. For
each hypothesis sentence, on one hand, it may refer
to multiple premise sentences. On the other hand,
there may be several evidence groups, where each
group can independently serve the label prediction.
We highlight this as interpretability challenge.
1
Yin et al. (2021) propose the task and annotate the dataset
with the same name DOCNLI.
arXiv:2210.12328v1 [cs.CL] 22 Oct 2022
Label: not_entailment
Hypothesis:¬US cities along the Gulf of Mexico from Alabama to eastern Texas were on alert last night as Hurricane Andrew headed west after hitting
southern Florida leaving at least eight dead, causing severe property damage, and leaving 1.2 million homes without electricity. Gusts of up to 165
mph were recorded. ®It is the fiercest hurricane to hit the US in 1989.¯As Andrew moved across the Gulf there was concern that it might hit New
Orleans, which would be particularly susceptible to flooding, or smash into the concentrated offshore oil facilities. °President Bush authorized federal
disaster assistance for the affected areas.
Premise:¬US CITIES along the Gulf of Mexico from Alabama to eastern Texas were on storm watch last night as Hurricane Andrew headed west
after sweeping across southern Florida, causing at least eight deaths and severe property damage.®The hurricane was one of the fiercest in the US in
decades and the first to hit Miami directly in a quarter of a century. ··· ¯However, Hurricane Andrew gathered fresh strength as it moved across the
Gulf of Mexico and there was concern last night that it might head towards New Orleans, which is especially low lying and could suffer severe flood
damage. ··· ¯It could threaten the large concentration of offshore oil production facilities in the Gulf of Mexico. ··· Andrew, the first Caribbean
hurricane of the season, hit the eastern coast of Florida early yesterday, gusting up to 165mph. ··· ¬The Florida Power and Light company said that
about 1.2m of its customers, or 32 per cent, were without power. ··· °President Bush authorized federal disaster assistance for the affected areas and
made plans for an inspection tour of the state. ···
Figure 1: A sample of D
OC
NLI dataset. For each sample, only the entailment label between the documents is
annotated. For display, we mark each hypothesis sentence and its corresponding premise sentences (namely the
evidences, not annotated in the original dataset) with the same number and color. The sample is annotated as not
entailment due to the disinformation of “in 1989” in the third hypothesis sentence. The premise is partly omitted.
In term of models, current baselines (Yin et al.,
2021;Zhong et al.,2020) still largely follow
sentence-level NLI. They either concatenate two
documents for mutual information interaction for
classification, or encode them separately for se-
mantic match with document-level representations.
However, except for the interpretability issue, they
will leave the following challenges unexplored:
Long-range Dependency Modeling
The task
requests to handle a pair of long documents at the
same time, where we observe that 29.81% samples
of DOCNLI dataset (Yin et al.,2021) contain more
than 500 words
2
, while 10.47% samples contain
more than 1000 words. This will not only exceed
the input limit of Pre-trained Language Models
(PLMs), but also make it far more difficult to cap-
ture long-range dependency. Necessary informa-
tion interaction between the hypothesis and some
key premise sentences may not be guaranteed. Be-
sides, most contexts are uninformative for entail-
ment inference and will only serve as noise.
Cross-sentence Inference
To judge the rela-
tionship between the documents, it is supposed to
consider all hypothesis sentences, where the de-
tailed disinformation issue still remains unsolved.
Besides, the verification of one hypothesis sen-
tence may request to combine multiple and distant
premise sentences, different from the sentence pair
mode in sentence-level NLI. As shown in Figure 1,
2A word may correspond to multiple tokens for PLMs.
to process the first one, it needs to take both the
first and sixth premise sentences (all in red fonts).
In this paper, we establish a general solution,
named Retrieval, Reading and Fusion (R
2
F) frame-
work, and a new setting for the task. The basic
idea of the framework is to simplify document-
level classification task into a set of sentence-level
tasks, and then improve both performance and inter-
pretability with the power of evidence. Specifically,
the framework splits the hypothesis document into
sentences. Then for each hypothesis sentence, it
retrieves evidence sentences from the premise, and
reads to estimate its credibility score upon the evi-
dences. Finally, it fuses the sentence-level results
and judge the entailment relationship between the
two documents. For the setting, we contribute
complementary fine-grained annotations for inter-
pretability study. For each hypothesis sentence,
we manually annotate entailment label and several
evidence groups, where each group is enough to
independently infer the label. In summary, our
contributions are as follows:
We propose a Retrieval, Read and Fusion
framework as a general solution for DOCNLI task.
We contribute complementary evidence and
entailment label annotation for each hypothesis
sentence on a subset of DOCNLI dataset for inter-
pretability study.
Our experimental results on DOCNLI dataset
indicate that the framework obtains state-of-the-
Figure 2: R2Fframework. For each hypothesis sentence, the framework firstly retrieves evidence sentences from
the premise, and then reads to estimate the credibility upon the evidences. Finally it fuses the sentence-level
results, and judges the entailment relationship between these two documents. ˆyiand ˆyHP are the credibility score
of the i-th hypothesis sentence and the sample, while Eviij is the j-th evidence of the i-th hypothesis sentence.
art performance. Besides, it is robust for diverse
retrieval methods. Moreover, the framework can
give more interpretable prediction results.
2 R2F Framework
Our
R2F
framework aims at a general solution for
DOCNLI task with interpretability, i.e., to obtain
corresponding evidence and predict entailment la-
bel for each hypothesis sentence. As shown in
Figure 2, the framework consists of 3 components,
namely evidence retrieval, reading for credibility
estimation, and credibility fusion. For efficiency,
the retrieval component is an independent unit to
provide evidence input for the other two compo-
nents, which are optimized jointly.
2.1 Task Formulation
Similar to previous sentence-level NLI tasks, for
each sample in DOCNLI task, given a hypothe-
sis document Hand a premise document P, it is
requested to judge the entailment relationship R
between these two documents. Here, R
{“entail-
ment”, “not_entailment”} for DOCNLI dataset, but
may not be restricted to binary classification.
2.2 Evidence Retrieval
Given each sample, we split the hypothesis into
sentences and retrieve evidence sentences from the
premise. Formally, we split the hypothesis Hand
the premise Pinto single sentences {
H1
,
H2
,
· · ·
,
Hm
} and {
P1
,
P2
,
· · ·
,
Pn
}, through NLTK tool
3
.
Here, mand nare the sentence numbers.
For each hypothesis sentence
Hi
, we respec-
tively utilize the following retrieval methods to
calculate the relevance score with all premise sen-
tences. Then according to the scores, we remain top
3https://www.nltk.org/
Ksentences as the corresponding evidence. The
value of Kis a trade-off between evidence recall
and precision. A lower value pursues higher evi-
dence precision, but may lead to evidence missing,
while a higher value guarantees higher evidence
recall, but may introduce too many uninformative
sentences as noise. Moreover, to keep and utilize
contextual information, for each hypothesis sen-
tence, we reorder the evidence sentences according
to their original order in the premise.
To calculate the relevance score, we take several
sparse and dense retrieval methods into considera-
tion:
ROUGE-1
Inspired by Mao et al. (2022)
and Zhang et al. (2022), we adopt ROUGE-1 re-
trieval. For a pair of sentences, this sparse retrieval
method focuses on n-gram match of the pair to cal-
culate ROUGE-1 score as the relevance metric. We
take it as the main retrieval method.
BM254
BM25 is one of the most advanced
sparse retrieval methods. We take all premise
sentences as the corpus. For a pair of sentences,
BM25 involves not only the pair itself but also the
whole corpus, to count term frequency and inverse-
document frequency to obtain the relevance score.
SimCSE5
Inspired by Gao et al. (2021), we uti-
lize SimCSE (Gao et al.,2021), a strong sentence
embedding model, as dense retrieval method for
semantic match. For a pair of sentences, we take
the cosine similarity of the sentence embeddings
as the relevance score.
Except for above retrieval methods, we also
adopt another simple but effective strategy. If a
hypothesis sentence is a substring of the premise,
4
We adopt the implementation from https://github.com/
dorianbrown/rank_bm25.
5
We adopt unsupervised and supervised version of
RoBERTabase.
摘要:

R2F:AGeneralRetrieval,ReadingandFusionFrameworkforDocument-levelNaturalLanguageInferenceHaoWangx,YixinCaoy,YangguangLiz,ZhenHuangxKunWangz,JingShaozxNationalUniversityofDefenseTechnologyySingaporeManagementUniversityzSenseTimeAbstractDocument-levelnaturallanguageinference(DOCNLI)isanewchallengingta...

展开>> 收起<<
R2F A General Retrieval Reading and Fusion Framework for Document-level Natural Language Inference Hao Wangx Yixin Caoy Yangguang Liz Zhen Huangx.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:970.13KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注