
2) out-of-distribution examples.
We propose, implement, and analyze three dif-
ferent strategies for the generation and annotation
of text pairs. In the generation strategy, expert lin-
guists write original hypotheses given a premise.
In the rewrite strategy, expert linguists create con-
trastive and adversarial examples by rewriting and
re-annotating “generated” pairs. In the annota-
tion strategy, we first generate text pairs in a semi-
automated manner and then use crowd annota-
tors to determine the meaning relation. The fi-
nal INFERES corpus contains 8,055 gold standard
premise-hypothesis pairs. The core part of the cor-
pus is expert-generated and we make an additional
effort to ensure the quality of the data and the lin-
guistic diversity of the examples.
We provide two baseline for INFERES by fine-
tuning multilingual BERT and BETO (Spanish
BERT) transformer models. On the full dataset,
BETO obtains 72.8% accuracy, indicating that the
classification task is non-trivial. Both mBERT
and BETO perform poorly in the “hypothesis-only”
condition, indicating fewer annotation artifacts in
the corpus compared to prior work. Both sys-
tems generalize well across the different topics
in INFERES both “in-distribution” and “out-of-
distribution”. We notice a substantial drop in per-
formance when evaluating negation-based adversar-
ial examples, however the systems still outperform
majority and “hypothesis-only”.
INFERES expands the scope of the NLI research
in Spanish, provides new set of naturally occurring
contrastive and adversarial examples, and facili-
tates the study of negation and coreference in the
context of NLI. As part of the corpus creation, we
also present and analyze three unique strategies
for creating examples. All our data and baseline
models are being released to the community1.
The rest of this article is organized as follows.
Section 2discusses the related work. Section 3
formulates our objectives and introduces the differ-
ent corpus-creation strategies. Section 4describes
the final corpus and basic statistical data regarding
it. Section 5presents the machine learning exper-
imental setup and results. Section 6is devoted to
a discussion of the results and their implications.
Finally, Section 7concludes the article.
1
At
https://github.com/venelink/inferes
InferES is also added as a HuggingFace dataset
2 Related Work
The task of Recognizing Textual Entailment (RTE)
was proposed in Dagan et al. (2006) as a binary clas-
sification (“entailment” / “non-entailment”). The
RTE competition ran for seven editions (Bar Haim
et al.,2006;Giampiccolo et al.,2007,2008;Ben-
tivogli et al.,2009,2010,2011). RTE was later re-
formulated as a three-way decision and ultimately
renamed Natural Language Inference in the SNLI
(Bowman et al.,2015) and the MNLI (Williams
et al.,2018) corpora. Both the RTE and the NLI
tasks form part of the Natural Language Under-
standing benchmarks GLUE (Wang et al.,2018)
and Super-GLUE (Wang et al.,2019). The NLU
benchmarks attracted a lot of attention from the
community and by 2020 the state-of-the-art sys-
tems reported human level performance. Parrish
et al. (2021) proposed a “linguist-in-the-loop” cor-
pus creation to improve the quality of the data.
The “super-human” performance of NLI systems
has been questioned by a number of researchers.
Poliak et al. (2018) found that annotation artifacts
in the datasets enable the models to predict the
label by only looking at the hypothesis. McCoy
et al. (2019) and Gururangan et al. (2018) demon-
strate that state-of-the-art NLI systems often rely
on heuristics and annotation artifacts.
Systematic approaches to evaluation propose dif-
ferent sets of stress-tests for NLI and NLU systems
(Kovatchev et al.,2018a;Naik et al.,2018;Wallace
et al.,2019;Kovatchev et al.,2019;Ribeiro et al.,
2020;Kovatchev et al.,2020). The attacks can be
inspired by linguistic phenomena or empirical use
cases. Systematic evaluations show that NLI and
other NLU systems often underperform on complex
linguistic phenomena such as conjunction (Saha
et al.,2020), negation (Hossain et al.,2020), and
coreference (Kovatchev et al.,2022). Researchers
also experimented with creating contrastive exam-
ples that differ only slightly from training examples,
but have a different label (Glockner et al.,2018;
Kaushik et al.,2020;Gardner et al.,2020). Adver-
sarially created datasets such as Adversarial NLI
(Nie et al.,2020) and Dynabench NLI (Kiela et al.,
2021) demonstrate that there is a lot of room for
improvement regarding NLI datasets and models.
Most of the available resources for NLI research
are in English. Conneau et al. (2018) present XNLI,
a multilingual dataset created by translating En-
glish NLI examples into other languages. The inter-
est in multilingual NLI has resulted in the creation