SCIFACT-OPEN Towards open-domain scientiﬁc claim veriﬁcation David WaddenyKyle LozBailey KuehlzArman Cohanz Iz BeltagyzLucy Lu WangyzHannaneh Hajishirziyz

2025-05-03 0 0 762.16KB 17 页 10玖币

侵权投诉

SCIFACT-OPEN: Towards open-domain scientiﬁc claim veriﬁcation

David Wadden†Kyle Lo‡Bailey Kuehl‡Arman Cohan‡

Iz Beltagy‡Lucy Lu Wang†‡ Hannaneh Hajishirzi†‡

†University of Washington, Seattle, WA, USA

‡Allen Institute for Artiﬁcial Intelligence, Seattle, WA, USA

{dwadden,hannaneh}@cs.washington.edu,lucylw@uw.edu,

{kylel,baileyk,armanc,beltagy}@allenai.org

Abstract

While research on scientiﬁc claim veriﬁca-

tion has led to the development of power-

ful systems that appear to approach human

performance, these approaches have yet to

be tested in a realistic setting against large

corpora of scientiﬁc literature. Moving to

this open-domain evaluation setting, however,

poses unique challenges; in particular, it is in-

feasible to exhaustively annotate all evidence

documents. In this work, we present SCIFACT-

OPEN, a new test collection designed to eval-

uate the performance of scientiﬁc claim ver-

iﬁcation systems on a corpus of 500K re-

search abstracts. Drawing upon pooling tech-

niques from information retrieval, we collect

evidence for scientiﬁc claims by pooling and

annotating the top predictions of four state-

of-the-art scientiﬁc claim veriﬁcation models.

We ﬁnd that systems developed on smaller cor-

pora struggle to generalize to SCIFACT-OPEN,

exhibiting performance drops of at least 15

F1. In addition, analysis of the evidence in

SCIFACT-OPEN reveals interesting phenom-

ena likely to appear when claim veriﬁcation

systems are deployed in practice, e.g., cases

where the evidence supports only a special

case of the claim. Our dataset is available at

https://github.com/dwadden/scifact-open.

1 Introduction

The task of scientiﬁc claim veriﬁcation (Wadden

et al.,2020;Kotonya and Toni,2020) aims to help

system users assess the veracity of a scientiﬁc

claim relative to a corpus of research literature.

Most existing work and available datasets focus

on verifying claims against a much more limited

context—for instance, a single article or text snip-

pet (Saakyan et al.,2021;Sarrouti et al.,2021;

Kotonya and Toni,2020) or a small, artiﬁcially-

constructed collection of documents (Wadden et al.,

2020). Current state-of-the-art models are able to

achieve very strong performance on these datasets,

SCIFACT-ORIG

SCIFACT-

OPEN (500K)

Claim: Cancer risk is lower in individuals with a

history of alcohol consumption

Supports: Alcohol consumption was associated

with a decreased risk of thyroid cancer

Refutes: We found that the risk of cancer rises

with increasing levels of alcohol consumption

Figure 1: SCIFACT-OPEN, a new test collection for

scientiﬁc claim veriﬁcation that expands beyond the

5K abstract retrieval setting in the original SCIFACT

dataset (Wadden et al.,2020) to a corpus of 500K ab-

stracts. Each claim in SCIFACT-OPEN is annotated

with evidence that SUPPORTS or REFUTES the claim.

In the example shown, the majority of evidence RE-

FUTES the claim that alcohol consumption reduces can-

cer risk, although one abstract indicates that alcohol

consumption may reduce thyroid cancer risk speciﬁ-

cally.

in some cases approaching human agreement (Wad-

den et al.,2022).

This gives rise to the question of the scalability

of scientiﬁc claim veriﬁcation systems to realistic,

open-domain settings that involve verifying claims

against corpora containing hundreds of thousands

of documents. In these cases, claim veriﬁcation sys-

tems should assist users by identifying and catego-

rizing all available documents that contain evidence

supporting or refuting each claim (Fig. 1). How-

ever, evaluating system performance in this setting

is difﬁcult because exhaustive evidence annotation

is infeasible, an issue analogous to evaluation chal-

lenges in information retrieval (IR).

In this paper, we construct a new test collec-

tion for open-domain scientiﬁc claim veriﬁcation,

arXiv:2210.13777v1 [cs.CL] 25 Oct 2022

called SCIFACT-OPEN, which requires models to

verify claims against evidence from both the SCI-

FACT (Wadden et al.,2020) collection, as well as

additional evidence from a corpus of 500K scien-

tiﬁc research abstracts. To avoid the burden of

exhaustive annotation, we take inspiration from the

pooling strategy (Sparck Jones and van Rijsber-

gen,1975) popularized by the TREC competitions

(Voorhees and Harman,2005) and combine the pre-

dictions of several state-of-the-art scientiﬁc claim

veriﬁcation models—for each claim, abstracts that

the models identify as likely to SUPPORT or RE-

FUTE the claim are included as candidates for hu-

man annotation.

Our main contributions and ﬁndings are as fol-

lows. (1) We introduce SCIFACT-OPEN, a new test

collection for open-domain scientiﬁc claim veriﬁca-

tion, including 279 claims veriﬁed against evidence

retrieved from a corpus of 500K abstracts. (2) We

ﬁnd that state-of-the-art models developed for SCI-

FACT perform substantially worse (at least 15 F1)

in the open-domain setting, highlighting the need

to improve upon the generalization capabilities of

existing systems. (3) We identify and character-

ize new dataset phenomena that are likely to occur

in real-world claim veriﬁcation settings. These

include mismatches between the speciﬁcity of a

claim and a piece of evidence, and the presence of

conﬂicting evidence (Fig. 1).

With SCIFACT-OPEN, we introduce a challeng-

ing new test set for scientiﬁc claim veriﬁcation that

more closely approximates how the task might be

performed in real-word settings. This dataset will

allow for further study of claim-evidence phenom-

ena and model generalizability as encountered in

open-domain scientiﬁc claim veriﬁcation.

2 Background and Task Overview

We review the scientiﬁc claim veriﬁcation task, and

summarize the data collection process and model-

ing approaches for SCIFACT, which we build upon

in this work. We elect to use the SCIFACT dataset

as our starting point because of the diversity of

claims in the dataset and the availability of a num-

ber of state-of-the-art models that can be used for

pooled data collection. In the following, we refer

to the original SCIFACT dataset as SCIFACT-ORIG.

2.1 Task deﬁnition

Given a claim

and a corpus of research ab-

stracts

, the scientiﬁc claim veriﬁcation task

is to identify all abstracts in

which contain

evidence relevant to

, and to predict a label

y(c, a)∈ {SUPPORTS,REFUTES}

for each evi-

dence abstract. All other abstracts are labeled

y(c, a) = NEI

(Not Enough Info). We will re-

fer to a single

(c, a)

pair as a claim / abstract pair,

or CAP. Any CAP where the abstract

provides

evidence for the claim

(either SUPPORTS or RE-

FUTES) will be called an evidentiary CAP, or ECAP.

Models are evaluated on their precision, recall, and

F1 in identifying and correctly labeling the evi-

dence abstracts associated with each claim in the

dataset (or equivalently, in identifying ECAPs).1

2.2 SCIFACT-ORIG

Each claim in SCIFACT-ORIG was created by re-

writing a citation sentence occurring in a scientiﬁc

article, and verifying the claim against the abstracts

of the cited articles. The resulting claims are di-

verse both in terms of their subject matter—ranging

from molecular biology to public health—as well

as their level of speciﬁcity (see §3.3). Models are

required to retrieve and label evidence from a small

(roughly 5K abstract) corpus.

Models for SCIFACT-ORIG generally follow a

two-stage approach to verify a given claim. First, a

small collection of candidate abstracts is retrieved

from the corpus using a retrieval technique like

BM25 (Robertson and Zaragoza,2009); then, a

transformer-based language model (Devlin et al.,

2019;Raffel et al.,2020) is trained to predict

whether each retrieved document SUPPORTS, RE-

FUTES, or contains no relevant evidence (NEI) with

respect to the claim.

As we show in §4and §5, a key determinant

of system generalization is the negative sampling

ratio. A negative sampling ratio of

indicates that

the model is trained on

irrelevant CAPs for every

relevant ECAP. Negative sampling has been shown

to improve performance (particularly precision) on

SCIFACT-ORIG (Li et al.,2021). See Appendix

A.4 for additional details.

3 The SCIFACT-OPEN dataset

In this section, we describe the construction of

SCIFACT-OPEN. We report the performance of

claim veriﬁcation models on SCIFACT-OPEN in §4,

and perform reliability checks on the results in §5.

The original SCIFACT task also requires the prediction

of rationales justifying each label. Due to the expense of

collecting rationale annotations, in this work we do not require

rationales; we evaluate using the abstract-level label-only F1

metric described in Wadden et al. (2020).

…

0.95

0.61

…

0.23

0.65

0.43

0.41

0.16

0.99 0.97

0.66

0.18

0.87

… …

(3) Rank

CAP's by

conﬁdence

score

0.99

0.97 0.95

0.87

…

0.18

0.16

Corpus

500K

abstracts

Claim 1 Claim 2 Claim n

(1) Retrieve

k abstracts

per claim

(4) Take

union of

top-d CAPs

from each

system

(2) Compute model conﬁdence scores

SUPPORTS

NEI

REFUTES

NEI

SUPPORTS

(5) Human-annotate

all CAPs in pool

…

N systems

Figure 2: Pooling methodology used to collect evidence for SCIFACT-OPEN. We construct the pool by combining

the dmost-conﬁdent predictions of ndifferent systems. A single CAP is represented as a colored box; the number

in the box indicates a hypothetical conﬁdence score. In this example, the annotation pool contains 3 CAPs from

Claim 1, 2 for Claim 2, and 1 for Claim 3. Annotators found evidence for 4 / 6 of these CAPS.

Our goal is to construct a test collection which

can be used to assess the performance of claim

veriﬁcation systems deployed on a large corpus

of scientiﬁc literature. This requires a collection

of claims, a corpus of abstracts against which to

verify them, and evidence annotations with which

to evaluate system predictions. We use the claims

from the SCIFACT-ORIG test set as our claims for

SCIFACT-OPEN.2To obtain evidence annotations,

we use all evidence from SCIFACT-ORIG as ev-

idence in our new dataset and collect additional

evidence from the SCIFACT-OPEN corpus.

For our corpus, we ﬁlter the S2ORC dataset (Lo

et al.,2020) for all articles which (1) cover topics

related to medicine or biology and (2) have at least

one inbound and one outbound citation. From the

roughly 6.5 million articles that pass these ﬁlters,

we randomly sample 500K articles to form the cor-

pus for SCIFACT-OPEN, making sure to include the

5K abstracts from SCIFACT-ORIG. We choose to

limit the corpus to 500K abstracts to ensure that we

can achieve sufﬁcient annotation coverage of the

available evidence. Additional details on corpus

construction can be found in Appendix A.

Unlike SCIFACT-ORIG (which is skewed toward

highly-cited articles from “high-impact” journals),

we do not impose any additional quality ﬁlters on

articles included in SCIFACT-OPEN; thus, our cor-

pus captures the full diversity of information likely

to be encountered when scientiﬁc fact-checking

systems are deployed on real-world resources like

S2ORC, arXiv,3or PubMed Central.4

We remove 21 claims (out of 300 total) whose source

citations lack important metadata; see Appendix Afor details.

3https://arxiv.org

4https://www.ncbi.nlm.nih.gov/pmc

3.1 Pooling for evidence collection

To collect evidence from the SCIFACT-OPEN cor-

pus, we adopt a pooling approach popularized by

the TREC competitions: use a collection of state-

of-the-art models to select CAPs for human annota-

tion, and assume that all un-annotated CAPs have

y(c, a) = NEI

. We will examine the degree to

which this assumption holds in §5.

Pooling approach

We annotate the

most-

conﬁdent predicted CAPS from each of

claim

veriﬁcation systems. An overview of the process is

in shown in Fig. 2; we number the annotation steps

below to match the ﬁgure.

We select the most conﬁdent predictions for a

single model as follows.

(1)

For each claim in

SCIFACT-OPEN, we use an information retrieval

system consisting of BM25 followed by a neu-

ral re-ranker (Pradeep et al.,2021) to retrieve

abstracts from the SCIFACT-OPEN corpus.

(2)

For each CAP, we compute the softmax scores

associated with the three possible output labels,

denoted

s(SUPPORTS), s(REFUTES), s(NEI)

. We

use

max(s(SUPPORTS), s(REFUTES))

as a mea-

sure of the model’s conﬁdence that the CAP con-

tains evidence.

(3)

We rank all CAPs by model

conﬁdence, and add the

top-ranked predictions

to the annotation pool. The ﬁnal pool

(4)

is the

union of the top-

CAPs identiﬁed by each system.

Since some CAPs are identiﬁed by multiple sys-

tems, the size of the ﬁnal annotation pool is less

than

n×d

; we provide statistics in §3.2. Finally,

(5)

all CAPs in the pool are annotated for evidence and

assigned a ﬁnal label by an expert annotator, and

the label is double-checked by a second annotator

(see Appendix Afor details).

We choose to prioritize CAPS for annotation

based on model conﬁdence, rather than annotating

Model Source Negative

sampling

Pooling and Eval

VERT5ERINI Pradeep et al. (2021)0

PARAGRAPHJOINT Li et al. (2021)10

MULTIVERSWadden et al. (2022)20

MULTIVERS10 Wadden et al. (2022)10

Eval only

ARSJOINT Zhang et al. (2021)12

Table 1: Models used for pooled data collection and

evaluation (top), and for evaluation only (bottom).

“Negative sampling” indicates the negative sampling

ratio. MULTIVERS10 shares the same architecture as

MULTIVERS, but trains on fewer negative samples.

a ﬁxed number of CAPs per claim, in order to maxi-

mize the amount of evidence likely to be discovered

during pooling. In §3.3, we conﬁrm that our pro-

cedure identiﬁes more evidence for claims that we

would expect to be more extensively-studied.

Models and parameter settings

We set

k= 50

for abstract retrieval. In practice, we found that the

great majority of evidentiary abstracts were ranked

among the top 20 retrievals for their respective

claims (Appendix A.3), and thus using a larger

would serve mainly to increase the number of

irrelevant results. We set

d= 250

; in §5.1, we

show that this is sufﬁcient to ensure that our dataset

can be used for reliable model evaluation.

For our models, we utilized all state-of-the-art

models developed for SCIFACT-ORIG for which

modeling code and checkpoints were available (to

our knowledge). We used

n= 4

systems for

pooled data collection. During evaluation, we in-

cluded a ﬁfth system — ARSJOINT— which be-

came available after the dataset had been collected.

Model names, source publications, and negative

sampling ratios are listed in Table 1; see Appendix

Afor additional details.

3.2 Dataset statistics

We summarize key properties of SCIFACT-OPEN.

Table 2a provides an overview of the claims, cor-

pus, and evidence in the dataset. Table 2b shows the

fraction of CAPs annotated during pooling which

were judged to be ECAPs (i.e. to contain evidence).

Overall, roughly a third of predicted CAPs were

judged as relevant; this indicates that existing sys-

tems achieve relatively low precision when used in

an open-domain setting. Relevance is somewhat

higher (roughly 50%) for CAPs predicted by more

ECAPs

Claims Corpus SCIFACT-ORIG Pooling Total

279 500K 209 251 460

(a) Summary of the SCIFACT-OPEN dataset, including the

number of claims, abstracts, and ECAPs (evidentiary claim /

evidence pairs). ECAPs come from two sources: those from

SCIFACT-ORIG, and those discovered via pooling.

Num. systems Annotated Evidence % Evidence

1 528 154 29.2

2 150 71 47.3

3 44 20 45.5

4 10 6 60.0

All 732 251 34.3

(b) Relevance of CAPs annotated during the pooling process.

The ﬁrst row indicates that 528 CAPs were identiﬁed for pool-

ing by one system only; of those CAPs, 154 were judged by

annotators as containing evidence. The more systems identi-

ﬁed a given CAP, the more likely it is to contain evidence.

Total Retrieved Annotated

ECAPs 209 187 (89%) 171 (82%)

have been identiﬁed during pooled data collection. “Retrieved”

indicates the number of ECAPs that would have been retrieved

among the top

, and “Annotated” indicates the number that

would further have been included in the annotation pool.

Table 2: Annotation results and dataset statistics for

SCIFACT-OPEN.

than one system. The majority of CAPs are selected

by a single system only, indicating high diversity

in model predictions. As mentioned in §3.1, the

total number of annotated CAPs is 732 (rather than

4 models

250 CAPs / model

= 1000

) due to

overlap in system predictions.

Table 2c shows how many of the ECAPs from

SCIFACT-ORIG would have been annotated by our

pooling procedure. The fact that the great majority

of the original ECAPs would have been included

in the annotation pool suggests that our approach

achieves reasonable evidence coverage.

3.3 Evidence phenomena in SCIFACT-OPEN

We observe three properties of evidence in

SCIFACT-OPEN that have received less attention in

the study of scientiﬁc claim veriﬁcation, and that

can inform future work on this task.

Unequal allocation of evidence

Fig. 3shows

the distribution of evidence amongst claims in

SCIFACT-OPEN. We ﬁnd that evidence is dis-

tributed unequally; half of all ECAPs are allocated

to 34 highly-studied claims (12% of all claims in

the dataset). We investigated the characteristics of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SCIFACT-OPEN:Towardsopen-domainscienticclaimvericationDavidWaddenyKyleLozBaileyKuehlzArmanCohanzIzBeltagyzLucyLuWangyzHannanehHajishirziyzyUniversityofWashington,Seattle,WA,USAzAllenInstituteforArticialIntelligence,Seattle,WA,USA{dwadden,hannaneh}@cs.washington.edu,lucylw@uw.edu,{kylel,baileyk,ar...

展开>> 收起<<

SCIFACT-OPEN Towards open-domain scientiﬁc claim veriﬁcation David WaddenyKyle LozBailey KuehlzArman Cohanz Iz BeltagyzLucy Lu WangyzHannaneh Hajishirziyz.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SCIFACT-OPEN Towards open-domain scientiﬁc claim veriﬁcation David WaddenyKyle LozBailey KuehlzArman Cohanz Iz BeltagyzLucy Lu WangyzHannaneh Hajishirziyz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: