SCIFACT-OPEN Towards open-domain scientific claim verification David WaddenyKyle LozBailey KuehlzArman Cohanz Iz BeltagyzLucy Lu WangyzHannaneh Hajishirziyz

2025-05-03 0 0 762.16KB 17 页 10玖币
侵权投诉
SCIFACT-OPEN: Towards open-domain scientific claim verification
David WaddenKyle LoBailey KuehlArman Cohan
Iz BeltagyLucy Lu Wang†‡ Hannaneh Hajishirzi†‡
University of Washington, Seattle, WA, USA
Allen Institute for Artificial Intelligence, Seattle, WA, USA
{dwadden,hannaneh}@cs.washington.edu,lucylw@uw.edu,
{kylel,baileyk,armanc,beltagy}@allenai.org
Abstract
While research on scientific claim verifica-
tion has led to the development of power-
ful systems that appear to approach human
performance, these approaches have yet to
be tested in a realistic setting against large
corpora of scientific literature. Moving to
this open-domain evaluation setting, however,
poses unique challenges; in particular, it is in-
feasible to exhaustively annotate all evidence
documents. In this work, we present SCIFACT-
OPEN, a new test collection designed to eval-
uate the performance of scientific claim ver-
ification systems on a corpus of 500K re-
search abstracts. Drawing upon pooling tech-
niques from information retrieval, we collect
evidence for scientific claims by pooling and
annotating the top predictions of four state-
of-the-art scientific claim verification models.
We find that systems developed on smaller cor-
pora struggle to generalize to SCIFACT-OPEN,
exhibiting performance drops of at least 15
F1. In addition, analysis of the evidence in
SCIFACT-OPEN reveals interesting phenom-
ena likely to appear when claim verification
systems are deployed in practice, e.g., cases
where the evidence supports only a special
case of the claim. Our dataset is available at
https://github.com/dwadden/scifact-open.
1 Introduction
The task of scientific claim verification (Wadden
et al.,2020;Kotonya and Toni,2020) aims to help
system users assess the veracity of a scientific
claim relative to a corpus of research literature.
Most existing work and available datasets focus
on verifying claims against a much more limited
context—for instance, a single article or text snip-
pet (Saakyan et al.,2021;Sarrouti et al.,2021;
Kotonya and Toni,2020) or a small, artificially-
constructed collection of documents (Wadden et al.,
2020). Current state-of-the-art models are able to
achieve very strong performance on these datasets,
SCIFACT-ORIG
SCIFACT-
OPEN (500K)
Claim: Cancer risk is lower in individuals with a
history of alcohol consumption
Supports: Alcohol consumption was associated
with a decreased risk of thyroid cancer
Refutes: We found that the risk of cancer rises
with increasing levels of alcohol consumption
Figure 1: SCIFACT-OPEN, a new test collection for
scientific claim verification that expands beyond the
5K abstract retrieval setting in the original SCIFACT
dataset (Wadden et al.,2020) to a corpus of 500K ab-
stracts. Each claim in SCIFACT-OPEN is annotated
with evidence that SUPPORTS or REFUTES the claim.
In the example shown, the majority of evidence RE-
FUTES the claim that alcohol consumption reduces can-
cer risk, although one abstract indicates that alcohol
consumption may reduce thyroid cancer risk specifi-
cally.
in some cases approaching human agreement (Wad-
den et al.,2022).
This gives rise to the question of the scalability
of scientific claim verification systems to realistic,
open-domain settings that involve verifying claims
against corpora containing hundreds of thousands
of documents. In these cases, claim verification sys-
tems should assist users by identifying and catego-
rizing all available documents that contain evidence
supporting or refuting each claim (Fig. 1). How-
ever, evaluating system performance in this setting
is difficult because exhaustive evidence annotation
is infeasible, an issue analogous to evaluation chal-
lenges in information retrieval (IR).
In this paper, we construct a new test collec-
tion for open-domain scientific claim verification,
arXiv:2210.13777v1 [cs.CL] 25 Oct 2022
called SCIFACT-OPEN, which requires models to
verify claims against evidence from both the SCI-
FACT (Wadden et al.,2020) collection, as well as
additional evidence from a corpus of 500K scien-
tific research abstracts. To avoid the burden of
exhaustive annotation, we take inspiration from the
pooling strategy (Sparck Jones and van Rijsber-
gen,1975) popularized by the TREC competitions
(Voorhees and Harman,2005) and combine the pre-
dictions of several state-of-the-art scientific claim
verification models—for each claim, abstracts that
the models identify as likely to SUPPORT or RE-
FUTE the claim are included as candidates for hu-
man annotation.
Our main contributions and findings are as fol-
lows. (1) We introduce SCIFACT-OPEN, a new test
collection for open-domain scientific claim verifica-
tion, including 279 claims verified against evidence
retrieved from a corpus of 500K abstracts. (2) We
find that state-of-the-art models developed for SCI-
FACT perform substantially worse (at least 15 F1)
in the open-domain setting, highlighting the need
to improve upon the generalization capabilities of
existing systems. (3) We identify and character-
ize new dataset phenomena that are likely to occur
in real-world claim verification settings. These
include mismatches between the specificity of a
claim and a piece of evidence, and the presence of
conflicting evidence (Fig. 1).
With SCIFACT-OPEN, we introduce a challeng-
ing new test set for scientific claim verification that
more closely approximates how the task might be
performed in real-word settings. This dataset will
allow for further study of claim-evidence phenom-
ena and model generalizability as encountered in
open-domain scientific claim verification.
2 Background and Task Overview
We review the scientific claim verification task, and
summarize the data collection process and model-
ing approaches for SCIFACT, which we build upon
in this work. We elect to use the SCIFACT dataset
as our starting point because of the diversity of
claims in the dataset and the availability of a num-
ber of state-of-the-art models that can be used for
pooled data collection. In the following, we refer
to the original SCIFACT dataset as SCIFACT-ORIG.
2.1 Task definition
Given a claim
c
and a corpus of research ab-
stracts
A
, the scientific claim verification task
is to identify all abstracts in
A
which contain
evidence relevant to
c
, and to predict a label
y(c, a)∈ {SUPPORTS,REFUTES}
for each evi-
dence abstract. All other abstracts are labeled
y(c, a) = NEI
(Not Enough Info). We will re-
fer to a single
(c, a)
pair as a claim / abstract pair,
or CAP. Any CAP where the abstract
a
provides
evidence for the claim
c
(either SUPPORTS or RE-
FUTES) will be called an evidentiary CAP, or ECAP.
Models are evaluated on their precision, recall, and
F1 in identifying and correctly labeling the evi-
dence abstracts associated with each claim in the
dataset (or equivalently, in identifying ECAPs).1
2.2 SCIFACT-ORIG
Each claim in SCIFACT-ORIG was created by re-
writing a citation sentence occurring in a scientific
article, and verifying the claim against the abstracts
of the cited articles. The resulting claims are di-
verse both in terms of their subject matter—ranging
from molecular biology to public health—as well
as their level of specificity (see §3.3). Models are
required to retrieve and label evidence from a small
(roughly 5K abstract) corpus.
Models for SCIFACT-ORIG generally follow a
two-stage approach to verify a given claim. First, a
small collection of candidate abstracts is retrieved
from the corpus using a retrieval technique like
BM25 (Robertson and Zaragoza,2009); then, a
transformer-based language model (Devlin et al.,
2019;Raffel et al.,2020) is trained to predict
whether each retrieved document SUPPORTS, RE-
FUTES, or contains no relevant evidence (NEI) with
respect to the claim.
As we show in §4and §5, a key determinant
of system generalization is the negative sampling
ratio. A negative sampling ratio of
r
indicates that
the model is trained on
r
irrelevant CAPs for every
relevant ECAP. Negative sampling has been shown
to improve performance (particularly precision) on
SCIFACT-ORIG (Li et al.,2021). See Appendix
A.4 for additional details.
3 The SCIFACT-OPEN dataset
In this section, we describe the construction of
SCIFACT-OPEN. We report the performance of
claim verification models on SCIFACT-OPEN in §4,
and perform reliability checks on the results in §5.
1
The original SCIFACT task also requires the prediction
of rationales justifying each label. Due to the expense of
collecting rationale annotations, in this work we do not require
rationales; we evaluate using the abstract-level label-only F1
metric described in Wadden et al. (2020).
0.95
0.61
0.23
0.65
0.43
0.41
0.16
0.99 0.97
0.66
0.18
1
0.87
… …
(3) Rank
CAP's by
confidence
score
0.99
0.97 0.95
0.87
0.18
0.16
Corpus
500K
abstracts
Claim 1 Claim 2 Claim n
(1) Retrieve
k abstracts
per claim
(4) Take
union of
top-d CAPs
from each
system
(2) Compute model confidence scores
d
SUPPORTS
SUPPORTS
NEI
REFUTES
NEI
SUPPORTS
(5) Human-annotate
all CAPs in pool
2
3
k
N systems
Figure 2: Pooling methodology used to collect evidence for SCIFACT-OPEN. We construct the pool by combining
the dmost-confident predictions of ndifferent systems. A single CAP is represented as a colored box; the number
in the box indicates a hypothetical confidence score. In this example, the annotation pool contains 3 CAPs from
Claim 1, 2 for Claim 2, and 1 for Claim 3. Annotators found evidence for 4 / 6 of these CAPS.
Our goal is to construct a test collection which
can be used to assess the performance of claim
verification systems deployed on a large corpus
of scientific literature. This requires a collection
of claims, a corpus of abstracts against which to
verify them, and evidence annotations with which
to evaluate system predictions. We use the claims
from the SCIFACT-ORIG test set as our claims for
SCIFACT-OPEN.2To obtain evidence annotations,
we use all evidence from SCIFACT-ORIG as ev-
idence in our new dataset and collect additional
evidence from the SCIFACT-OPEN corpus.
For our corpus, we filter the S2ORC dataset (Lo
et al.,2020) for all articles which (1) cover topics
related to medicine or biology and (2) have at least
one inbound and one outbound citation. From the
roughly 6.5 million articles that pass these filters,
we randomly sample 500K articles to form the cor-
pus for SCIFACT-OPEN, making sure to include the
5K abstracts from SCIFACT-ORIG. We choose to
limit the corpus to 500K abstracts to ensure that we
can achieve sufficient annotation coverage of the
available evidence. Additional details on corpus
construction can be found in Appendix A.
Unlike SCIFACT-ORIG (which is skewed toward
highly-cited articles from “high-impact” journals),
we do not impose any additional quality filters on
articles included in SCIFACT-OPEN; thus, our cor-
pus captures the full diversity of information likely
to be encountered when scientific fact-checking
systems are deployed on real-world resources like
S2ORC, arXiv,3or PubMed Central.4
2
We remove 21 claims (out of 300 total) whose source
citations lack important metadata; see Appendix Afor details.
3https://arxiv.org
4https://www.ncbi.nlm.nih.gov/pmc
3.1 Pooling for evidence collection
To collect evidence from the SCIFACT-OPEN cor-
pus, we adopt a pooling approach popularized by
the TREC competitions: use a collection of state-
of-the-art models to select CAPs for human annota-
tion, and assume that all un-annotated CAPs have
y(c, a) = NEI
. We will examine the degree to
which this assumption holds in §5.
Pooling approach
We annotate the
d
most-
confident predicted CAPS from each of
n
claim
verification systems. An overview of the process is
in shown in Fig. 2; we number the annotation steps
below to match the figure.
We select the most confident predictions for a
single model as follows.
(1)
For each claim in
SCIFACT-OPEN, we use an information retrieval
system consisting of BM25 followed by a neu-
ral re-ranker (Pradeep et al.,2021) to retrieve
k
abstracts from the SCIFACT-OPEN corpus.
(2)
For each CAP, we compute the softmax scores
associated with the three possible output labels,
denoted
s(SUPPORTS), s(REFUTES), s(NEI)
. We
use
max(s(SUPPORTS), s(REFUTES))
as a mea-
sure of the model’s confidence that the CAP con-
tains evidence.
(3)
We rank all CAPs by model
confidence, and add the
d
top-ranked predictions
to the annotation pool. The final pool
(4)
is the
union of the top-
d
CAPs identified by each system.
Since some CAPs are identified by multiple sys-
tems, the size of the final annotation pool is less
than
n×d
; we provide statistics in §3.2. Finally,
(5)
all CAPs in the pool are annotated for evidence and
assigned a final label by an expert annotator, and
the label is double-checked by a second annotator
(see Appendix Afor details).
We choose to prioritize CAPS for annotation
based on model confidence, rather than annotating
Model Source Negative
sampling
Pooling and Eval
VERT5ERINI Pradeep et al. (2021)0
PARAGRAPHJOINT Li et al. (2021)10
MULTIVERSWadden et al. (2022)20
MULTIVERS10 Wadden et al. (2022)10
Eval only
ARSJOINT Zhang et al. (2021)12
Table 1: Models used for pooled data collection and
evaluation (top), and for evaluation only (bottom).
“Negative sampling” indicates the negative sampling
ratio. MULTIVERS10 shares the same architecture as
MULTIVERS, but trains on fewer negative samples.
a fixed number of CAPs per claim, in order to maxi-
mize the amount of evidence likely to be discovered
during pooling. In §3.3, we confirm that our pro-
cedure identifies more evidence for claims that we
would expect to be more extensively-studied.
Models and parameter settings
We set
k= 50
for abstract retrieval. In practice, we found that the
great majority of evidentiary abstracts were ranked
among the top 20 retrievals for their respective
claims (Appendix A.3), and thus using a larger
k
would serve mainly to increase the number of
irrelevant results. We set
d= 250
; in §5.1, we
show that this is sufficient to ensure that our dataset
can be used for reliable model evaluation.
For our models, we utilized all state-of-the-art
models developed for SCIFACT-ORIG for which
modeling code and checkpoints were available (to
our knowledge). We used
n= 4
systems for
pooled data collection. During evaluation, we in-
cluded a fifth system — ARSJOINT— which be-
came available after the dataset had been collected.
Model names, source publications, and negative
sampling ratios are listed in Table 1; see Appendix
Afor additional details.
3.2 Dataset statistics
We summarize key properties of SCIFACT-OPEN.
Table 2a provides an overview of the claims, cor-
pus, and evidence in the dataset. Table 2b shows the
fraction of CAPs annotated during pooling which
were judged to be ECAPs (i.e. to contain evidence).
Overall, roughly a third of predicted CAPs were
judged as relevant; this indicates that existing sys-
tems achieve relatively low precision when used in
an open-domain setting. Relevance is somewhat
higher (roughly 50%) for CAPs predicted by more
ECAPs
Claims Corpus SCIFACT-ORIG Pooling Total
279 500K 209 251 460
(a) Summary of the SCIFACT-OPEN dataset, including the
number of claims, abstracts, and ECAPs (evidentiary claim /
evidence pairs). ECAPs come from two sources: those from
SCIFACT-ORIG, and those discovered via pooling.
Num. systems Annotated Evidence % Evidence
1 528 154 29.2
2 150 71 47.3
3 44 20 45.5
4 10 6 60.0
All 732 251 34.3
(b) Relevance of CAPs annotated during the pooling process.
The first row indicates that 528 CAPs were identified for pool-
ing by one system only; of those CAPs, 154 were judged by
annotators as containing evidence. The more systems identi-
fied a given CAP, the more likely it is to contain evidence.
Total Retrieved Annotated
ECAPs 209 187 (89%) 171 (82%)
(c) Count of how many ECAPs from SCIFACT-ORIG would
have been identified during pooled data collection. “Retrieved”
indicates the number of ECAPs that would have been retrieved
among the top
k
, and “Annotated” indicates the number that
would further have been included in the annotation pool.
Table 2: Annotation results and dataset statistics for
SCIFACT-OPEN.
than one system. The majority of CAPs are selected
by a single system only, indicating high diversity
in model predictions. As mentioned in §3.1, the
total number of annotated CAPs is 732 (rather than
4 models
×
250 CAPs / model
= 1000
) due to
overlap in system predictions.
Table 2c shows how many of the ECAPs from
SCIFACT-ORIG would have been annotated by our
pooling procedure. The fact that the great majority
of the original ECAPs would have been included
in the annotation pool suggests that our approach
achieves reasonable evidence coverage.
3.3 Evidence phenomena in SCIFACT-OPEN
We observe three properties of evidence in
SCIFACT-OPEN that have received less attention in
the study of scientific claim verification, and that
can inform future work on this task.
Unequal allocation of evidence
Fig. 3shows
the distribution of evidence amongst claims in
SCIFACT-OPEN. We find that evidence is dis-
tributed unequally; half of all ECAPs are allocated
to 34 highly-studied claims (12% of all claims in
the dataset). We investigated the characteristics of
摘要:

SCIFACT-OPEN:Towardsopen-domainscienticclaimvericationDavidWaddenyKyleLozBaileyKuehlzArmanCohanzIzBeltagyzLucyLuWangyzHannanehHajishirziyzyUniversityofWashington,Seattle,WA,USAzAllenInstituteforArticialIntelligence,Seattle,WA,USA{dwadden,hannaneh}@cs.washington.edu,lucylw@uw.edu,{kylel,baileyk,ar...

展开>> 收起<<
SCIFACT-OPEN Towards open-domain scientific claim verification David WaddenyKyle LozBailey KuehlzArman Cohanz Iz BeltagyzLucy Lu WangyzHannaneh Hajishirziyz.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:762.16KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注