However, manually creating a biomedical NLI
dataset that focuses on mechanistic information is
challenging. Table 1, which contains an actual ex-
ample from our proposed dataset, highlights several
difficulties. First, understanding biomedical mech-
anisms and the necessary experimental evidence
that supports (or does not support) them requires
tremendous expertise and effort (Kaushik et al.,
2019). For example, the premise shown is con-
siderably larger than the average premise in other
open-domain NLI datasets such as SNLI (Bowman
et al.,2015), and is packed with domain-specific
information. Second, negative examples are sel-
dom explicit in publications. Creating them manu-
ally risks introducing biases, simplistic information,
and systematic omissions (Wu et al.,2021).
In this work, we introduce a novel semi-
supervised procedure for the creation of biomedical
NLI datasets that include mechanistic information.
Our key contribution is automating the creation
of negative examples that are informative without
being simplistic. Intuitively, we achieve this by
defining lexico-semantic constraints based on the
mechanism structures in the biomedical literature
abstracts. Our dataset creation is as follows:
(1)
We extract positive entailment examples con-
sisting of a premise and hypothesis from abstracts
of PubMed publications. We focus on abstracts
that contain an explicit conclusion sentence, which
describes a biomedical interaction between two en-
tities (a regulator and a regulated protein or chem-
ical). This yields premises that are considerably
larger than premises in other open-domain NLI
datasets: between 3 – 15 sentences.
(2)
We generate a wide range of negative exam-
ples by manipulating the structure of the under-
lying mechanisms both with rules, e.g., flip the
roles of the entities in the interaction, and, more
importantly, by imposing the perturbed conditions
as logical constraints in a neuro-logical decoding
system (Lu et al.,2021b). This battery of strategies
produces a variety of negative examples, which
range in difficulty, and, thus, provide an important
framework for the evaluation of NLI methods.
We employ this procedure to create a new dataset
for natural language inference (NLI) in the biomed-
ical domain, called BioNLI. Table 1shows an ac-
tual example from BioNLI. The dataset contains
13489 positive entailment examples, and 26907 ad-
versarial negative examples generated using nine
different strategies. An evaluation of a sample of
these negative examples by human biomedical ex-
perts indicated that
86%
of these examples are in-
deed true negatives. We trained two state-of-the-art
neural NLI classifiers on this dataset, and show
that the overall F1 score remains relatively low,
in the mid 70s, which indicates that this NLI task
remains to be solved. Critically, we observe that
the performance on the different classes of nega-
tive examples varies widely, from
97%
accuracy
on the simple negative examples that change the
role of the entities in the hypothesis, to
55%
(i.e.,
barely better than chance) on the negative exam-
ples generated using neuro-logic decoding. Further,
given how the dataset is constructed we can also
test if models produce consistent decisions on all
adversarial negatives associated with a mechanism,
giving deeper insight into model behavior. Thus, in
addition of its importance in the biomedical field,
we hope that this dataset will serve as a benchmark
to test models’ language understanding abilities.
2 Related Work
Previous work on NLI in scientific domains include:
medical question answering (Abacha and Demner-
Fushman,2016), entailment based text exploration
in health care (Adler et al.,2012), entailment recog-
nition in medical texts (Abacha et al.,2015), textual
inference in clinical trials (Shivade et al.,2015),
NLI on medical history (Romanov and Shivade,
2018), and SciTail (Khot et al.,2018) which is
created from multiple-choice science exams and
web sentences. These datasets either have modest
sizes (Abacha et al.,2015), target specific NLP
problems such as coreference resolution or named
entity extraction (Shivade et al.,2015), and make
use of experts in the domain to generate inconsis-
tent data which is costly and labor-intensive. Ad-
ditionally, they also focus on sentence-to-sentence
entailment tasks, where both the premise and the
hypothesis are no longer than one sentence. Most
importantly, none of these are directly aimed at
inference on mechanisms in biomedical literature.
Our work is also related to NLI tasks that go be-
yond sentence-level entailments. For example, (Yin
et al.,2021) include premises longer than a sen-
tence, but only use three simple rule-based meth-
ods to create negative samples. (Yan et al.,2021;
Nie et al.,2019) use larger contexts as premises for
the NLI task but only on general purpose domains
like news, fiction, and Wiki. On the other hand, the
BioNLI dataset is an inference problem with large