
called SCIFACT-OPEN, which requires models to
verify claims against evidence from both the SCI-
FACT (Wadden et al.,2020) collection, as well as
additional evidence from a corpus of 500K scien-
tific research abstracts. To avoid the burden of
exhaustive annotation, we take inspiration from the
pooling strategy (Sparck Jones and van Rijsber-
gen,1975) popularized by the TREC competitions
(Voorhees and Harman,2005) and combine the pre-
dictions of several state-of-the-art scientific claim
verification models—for each claim, abstracts that
the models identify as likely to SUPPORT or RE-
FUTE the claim are included as candidates for hu-
man annotation.
Our main contributions and findings are as fol-
lows. (1) We introduce SCIFACT-OPEN, a new test
collection for open-domain scientific claim verifica-
tion, including 279 claims verified against evidence
retrieved from a corpus of 500K abstracts. (2) We
find that state-of-the-art models developed for SCI-
FACT perform substantially worse (at least 15 F1)
in the open-domain setting, highlighting the need
to improve upon the generalization capabilities of
existing systems. (3) We identify and character-
ize new dataset phenomena that are likely to occur
in real-world claim verification settings. These
include mismatches between the specificity of a
claim and a piece of evidence, and the presence of
conflicting evidence (Fig. 1).
With SCIFACT-OPEN, we introduce a challeng-
ing new test set for scientific claim verification that
more closely approximates how the task might be
performed in real-word settings. This dataset will
allow for further study of claim-evidence phenom-
ena and model generalizability as encountered in
open-domain scientific claim verification.
2 Background and Task Overview
We review the scientific claim verification task, and
summarize the data collection process and model-
ing approaches for SCIFACT, which we build upon
in this work. We elect to use the SCIFACT dataset
as our starting point because of the diversity of
claims in the dataset and the availability of a num-
ber of state-of-the-art models that can be used for
pooled data collection. In the following, we refer
to the original SCIFACT dataset as SCIFACT-ORIG.
2.1 Task definition
Given a claim
c
and a corpus of research ab-
stracts
A
, the scientific claim verification task
is to identify all abstracts in
A
which contain
evidence relevant to
c
, and to predict a label
y(c, a)∈ {SUPPORTS,REFUTES}
for each evi-
dence abstract. All other abstracts are labeled
y(c, a) = NEI
(Not Enough Info). We will re-
fer to a single
(c, a)
pair as a claim / abstract pair,
or CAP. Any CAP where the abstract
a
provides
evidence for the claim
c
(either SUPPORTS or RE-
FUTES) will be called an evidentiary CAP, or ECAP.
Models are evaluated on their precision, recall, and
F1 in identifying and correctly labeling the evi-
dence abstracts associated with each claim in the
dataset (or equivalently, in identifying ECAPs).1
2.2 SCIFACT-ORIG
Each claim in SCIFACT-ORIG was created by re-
writing a citation sentence occurring in a scientific
article, and verifying the claim against the abstracts
of the cited articles. The resulting claims are di-
verse both in terms of their subject matter—ranging
from molecular biology to public health—as well
as their level of specificity (see §3.3). Models are
required to retrieve and label evidence from a small
(roughly 5K abstract) corpus.
Models for SCIFACT-ORIG generally follow a
two-stage approach to verify a given claim. First, a
small collection of candidate abstracts is retrieved
from the corpus using a retrieval technique like
BM25 (Robertson and Zaragoza,2009); then, a
transformer-based language model (Devlin et al.,
2019;Raffel et al.,2020) is trained to predict
whether each retrieved document SUPPORTS, RE-
FUTES, or contains no relevant evidence (NEI) with
respect to the claim.
As we show in §4and §5, a key determinant
of system generalization is the negative sampling
ratio. A negative sampling ratio of
r
indicates that
the model is trained on
r
irrelevant CAPs for every
relevant ECAP. Negative sampling has been shown
to improve performance (particularly precision) on
SCIFACT-ORIG (Li et al.,2021). See Appendix
A.4 for additional details.
3 The SCIFACT-OPEN dataset
In this section, we describe the construction of
SCIFACT-OPEN. We report the performance of
claim verification models on SCIFACT-OPEN in §4,
and perform reliability checks on the results in §5.
1
The original SCIFACT task also requires the prediction
of rationales justifying each label. Due to the expense of
collecting rationale annotations, in this work we do not require
rationales; we evaluate using the abstract-level label-only F1
metric described in Wadden et al. (2020).