R2F A General Retrieval Reading and Fusion Framework for Document-level Natural Language Inference Hao Wangx Yixin Caoy Yangguang Liz Zhen Huangx

2025-04-29 0 0 970.13KB 13 页 10玖币

侵权投诉

R2F: A General Retrieval, Reading and Fusion Framework

for Document-level Natural Language Inference

Hao Wang§, Yixin Cao†∗, Yangguang Li‡, Zhen Huang§

Kun Wang‡, Jing Shao‡

§National University of Defense Technology

†Singapore Management University ‡SenseTime

Abstract

Document-level natural language inference

NLI) is a new challenging task in natural

language processing, aiming at judging the en-

tailment relationship between a pair of hypoth-

esis and premise documents. Current datasets

and baselines largely follow sentence-level set-

tings, but fail to address the issues raised by

longer documents. In this paper, we estab-

lish a general solution, named Retrieval, Read-

ing and Fusion (R2F) framework, and a new

setting, by analyzing the main challenges of

NLI: interpretability, long-range depen-

dency, and cross-sentence inference. The basic

idea of the framework is to simplify document-

level task into a set of sentence-level tasks, and

improve both performance and interpretabil-

ity with the power of evidence. For each hy-

pothesis sentence, the framework retrieves ev-

idence sentences from the premise, and reads

to estimate its credibility. Then the sentence-

level results are fused to judge the relationship

between the documents. For the setting, we

contribute complementary evidence and entail-

ment label annotation on hypothesis sentences,

for interpretability study. Our experimental

results show that R2F framework can obtain

state-of-the-art performance and is robust for

diverse evidence retrieval methods. Moreover,

it can give more interpretable prediction re-

sults. Our model and code are released at

https://github.com/phoenixsecularbird/R2F.

1 Introduction

Natural Language Inference (NLI) is the task of

determining whether a hypothesis is entailed or not

in a premise. While earlier works (Bowman et al.,

2015;Williams et al.,2018;Wang et al.,2019;

Nie et al.,2020) assume that both hypothesis and

premise are single sentences, recent research pays

more attention on document-level task, namely

Document-level NLI (DOCNLI) (Yin et al.,2021).

The task can enlarge the task scope to judge the

*Corresponding Author.

variability of semantic expression for many Natural

Language Processing (NLP) tasks, e.g., exposure

bias (Bengio et al.,2015;Arora et al.,2022) alle-

viation for text summarization (Sandhaus,2008;

Narayan et al.,2018), and human-manipulated

news articles recognition for automatic fake news

detection (Jawahar et al.,2022;Huang et al.,2022).

Compared with sentence-level NLI, DOCNLI

poses many new challenges, while there are only

a few datasets and models. In terms of datasets,

Yin et al. (2021) reformat some mainstream NLP

tasks, e.g., text summarization and question answer-

ing, and build the ﬁrst large scale dataset DOCNLI

with over 1 million document pairs

. However, the

dataset does not provide evidence annotation about

how the labels are inferred, i.e., which hypothesis

sentences lead to semantic inconsistency, or which

premise sentences help to decide the entailment

relationship. As shown in Figure 1, although the

sample is annotated as not entailment, most of the

hypothesis sentences are actually entailed. By con-

trast, the detailed disinformation of “in 1989” in

the third hypothesis sentence eventually decides

the entailment relationship between the documents.

For each hypothesis sentence, only several premise

sentences are enough to serve as the exact evidence

to judge its own sentence-level entailment label.

In this paper, we argue that evidence discovery

is important and challenging for DOCNLI. Our pi-

lot experiments in Section 4.3 and 4.5 show that

randomly selected evidences can still contribute to

comparable performance. Thus, only the black-box

models may be not so convincing. However, to

annotate evidence for evaluation is non-trivial. For

each hypothesis sentence, on one hand, it may refer

to multiple premise sentences. On the other hand,

there may be several evidence groups, where each

group can independently serve the label prediction.

We highlight this as interpretability challenge.

Yin et al. (2021) propose the task and annotate the dataset

with the same name DOCNLI.

arXiv:2210.12328v1 [cs.CL] 22 Oct 2022

Label: not_entailment

Hypothesis:¬US cities along the Gulf of Mexico from Alabama to eastern Texas were on alert last night as Hurricane Andrew headed west after hitting

southern Florida leaving at least eight dead, causing severe property damage, and leaving 1.2 million homes without electricity. Gusts of up to 165

mph were recorded. ®It is the ﬁercest hurricane to hit the US in 1989.¯As Andrew moved across the Gulf there was concern that it might hit New

Orleans, which would be particularly susceptible to ﬂooding, or smash into the concentrated offshore oil facilities. °President Bush authorized federal

disaster assistance for the affected areas.

Premise:¬US CITIES along the Gulf of Mexico from Alabama to eastern Texas were on storm watch last night as Hurricane Andrew headed west

after sweeping across southern Florida, causing at least eight deaths and severe property damage.®The hurricane was one of the ﬁercest in the US in

decades and the ﬁrst to hit Miami directly in a quarter of a century. ··· ¯However, Hurricane Andrew gathered fresh strength as it moved across the

Gulf of Mexico and there was concern last night that it might head towards New Orleans, which is especially low lying and could suffer severe ﬂood

damage. ··· ¯It could threaten the large concentration of offshore oil production facilities in the Gulf of Mexico. ··· Andrew, the ﬁrst Caribbean

hurricane of the season, hit the eastern coast of Florida early yesterday, gusting up to 165mph. ··· ¬The Florida Power and Light company said that

about 1.2m of its customers, or 32 per cent, were without power. ··· °President Bush authorized federal disaster assistance for the affected areas and

made plans for an inspection tour of the state. ···

Figure 1: A sample of D

NLI dataset. For each sample, only the entailment label between the documents is

annotated. For display, we mark each hypothesis sentence and its corresponding premise sentences (namely the

evidences, not annotated in the original dataset) with the same number and color. The sample is annotated as not

entailment due to the disinformation of “in 1989” in the third hypothesis sentence. The premise is partly omitted.

In term of models, current baselines (Yin et al.,

2021;Zhong et al.,2020) still largely follow

sentence-level NLI. They either concatenate two

documents for mutual information interaction for

classiﬁcation, or encode them separately for se-

mantic match with document-level representations.

However, except for the interpretability issue, they

will leave the following challenges unexplored:

•Long-range Dependency Modeling

The task

requests to handle a pair of long documents at the

same time, where we observe that 29.81% samples

of DOCNLI dataset (Yin et al.,2021) contain more

than 500 words

, while 10.47% samples contain

more than 1000 words. This will not only exceed

the input limit of Pre-trained Language Models

(PLMs), but also make it far more difﬁcult to cap-

ture long-range dependency. Necessary informa-

tion interaction between the hypothesis and some

key premise sentences may not be guaranteed. Be-

sides, most contexts are uninformative for entail-

ment inference and will only serve as noise.

•Cross-sentence Inference

To judge the rela-

tionship between the documents, it is supposed to

consider all hypothesis sentences, where the de-

tailed disinformation issue still remains unsolved.

Besides, the veriﬁcation of one hypothesis sen-

tence may request to combine multiple and distant

premise sentences, different from the sentence pair

mode in sentence-level NLI. As shown in Figure 1,

2A word may correspond to multiple tokens for PLMs.

to process the ﬁrst one, it needs to take both the

ﬁrst and sixth premise sentences (all in red fonts).

In this paper, we establish a general solution,

named Retrieval, Reading and Fusion (R

F) frame-

work, and a new setting for the task. The basic

idea of the framework is to simplify document-

level classiﬁcation task into a set of sentence-level

tasks, and then improve both performance and inter-

pretability with the power of evidence. Speciﬁcally,

the framework splits the hypothesis document into

sentences. Then for each hypothesis sentence, it

retrieves evidence sentences from the premise, and

reads to estimate its credibility score upon the evi-

dences. Finally, it fuses the sentence-level results

and judge the entailment relationship between the

two documents. For the setting, we contribute

complementary ﬁne-grained annotations for inter-

pretability study. For each hypothesis sentence,

we manually annotate entailment label and several

evidence groups, where each group is enough to

independently infer the label. In summary, our

contributions are as follows:

•

We propose a Retrieval, Read and Fusion

framework as a general solution for DOCNLI task.

•

We contribute complementary evidence and

entailment label annotation for each hypothesis

sentence on a subset of DOCNLI dataset for inter-

pretability study.

•

Our experimental results on DOCNLI dataset

indicate that the framework obtains state-of-the-

Figure 2: R2Fframework. For each hypothesis sentence, the framework ﬁrstly retrieves evidence sentences from

the premise, and then reads to estimate the credibility upon the evidences. Finally it fuses the sentence-level

results, and judges the entailment relationship between these two documents. ˆyiand ˆyHP are the credibility score

of the i-th hypothesis sentence and the sample, while Eviij is the j-th evidence of the i-th hypothesis sentence.

art performance. Besides, it is robust for diverse

retrieval methods. Moreover, the framework can

give more interpretable prediction results.

2 R2F Framework

Our

R2F

framework aims at a general solution for

DOCNLI task with interpretability, i.e., to obtain

corresponding evidence and predict entailment la-

bel for each hypothesis sentence. As shown in

Figure 2, the framework consists of 3 components,

namely evidence retrieval, reading for credibility

estimation, and credibility fusion. For efﬁciency,

the retrieval component is an independent unit to

provide evidence input for the other two compo-

nents, which are optimized jointly.

2.1 Task Formulation

Similar to previous sentence-level NLI tasks, for

each sample in DOCNLI task, given a hypothe-

sis document Hand a premise document P, it is

requested to judge the entailment relationship R

between these two documents. Here, R

∈

{“entail-

ment”, “not_entailment”} for DOCNLI dataset, but

may not be restricted to binary classiﬁcation.

2.2 Evidence Retrieval

Given each sample, we split the hypothesis into

sentences and retrieve evidence sentences from the

premise. Formally, we split the hypothesis Hand

the premise Pinto single sentences {

· · ·

} and {

· · ·

}, through NLTK tool

Here, mand nare the sentence numbers.

For each hypothesis sentence

, we respec-

tively utilize the following retrieval methods to

calculate the relevance score with all premise sen-

tences. Then according to the scores, we remain top

3https://www.nltk.org/

Ksentences as the corresponding evidence. The

value of Kis a trade-off between evidence recall

and precision. A lower value pursues higher evi-

dence precision, but may lead to evidence missing,

while a higher value guarantees higher evidence

recall, but may introduce too many uninformative

sentences as noise. Moreover, to keep and utilize

contextual information, for each hypothesis sen-

tence, we reorder the evidence sentences according

to their original order in the premise.

To calculate the relevance score, we take several

sparse and dense retrieval methods into considera-

tion:

•ROUGE-1

Inspired by Mao et al. (2022)

and Zhang et al. (2022), we adopt ROUGE-1 re-

trieval. For a pair of sentences, this sparse retrieval

method focuses on n-gram match of the pair to cal-

culate ROUGE-1 score as the relevance metric. We

take it as the main retrieval method.

•BM254

BM25 is one of the most advanced

sparse retrieval methods. We take all premise

sentences as the corpus. For a pair of sentences,

BM25 involves not only the pair itself but also the

whole corpus, to count term frequency and inverse-

document frequency to obtain the relevance score.

•SimCSE5

Inspired by Gao et al. (2021), we uti-

lize SimCSE (Gao et al.,2021), a strong sentence

embedding model, as dense retrieval method for

semantic match. For a pair of sentences, we take

the cosine similarity of the sentence embeddings

as the relevance score.

Except for above retrieval methods, we also

adopt another simple but effective strategy. If a

hypothesis sentence is a substring of the premise,

We adopt the implementation from https://github.com/

dorianbrown/rank_bm25.

We adopt unsupervised and supervised version of

RoBERTabase.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

R2F:AGeneralRetrieval,ReadingandFusionFrameworkforDocument-levelNaturalLanguageInferenceHaoWangx,YixinCaoy,YangguangLiz,ZhenHuangxKunWangz,JingShaozxNationalUniversityofDefenseTechnologyySingaporeManagementUniversityzSenseTimeAbstractDocument-levelnaturallanguageinference(DOCNLI)isanewchallengingta...

展开>> 收起<<

R2F A General Retrieval Reading and Fusion Framework for Document-level Natural Language Inference Hao Wangx Yixin Caoy Yangguang Liz Zhen Huangx.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

R2F A General Retrieval Reading and Fusion Framework for Document-level Natural Language Inference Hao Wangx Yixin Caoy Yangguang Liz Zhen Huangx

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: