INFER ES A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev

2025-05-05 0 0 346.88KB 12 页 10玖币
侵权投诉
INFERES : A Natural Language Inference Corpus for Spanish
Featuring Negation-Based Contrastive and Adversarial Examples
Venelin Kovatchev
School of Information
The University of Texas at Austin
venelin@utexas.edu
Mariona Taulé
Centre de Llenguatge i Computació
Institut de Recerca en Sistemes Complexos
Universitat de Barcelona
mtaule@ub.edu
Abstract
In this paper, we present INFERES - an orig-
inal corpus for Natural Language Inference
(NLI) in European Spanish. We propose,
implement, and analyze a variety of corpus-
creating strategies utilizing expert linguists
and crowd workers. The objectives behind IN-
FERES are to provide high-quality data, and,
at the same time to facilitate the systematic
evaluation of automated systems. Specifically,
we focus on measuring and improving the
performance of machine learning systems on
negation-based adversarial examples and their
ability to generalize across out-of-distribution
topics.
We train two transformer models on IN-
FERES (8,055 gold examples) in a variety of
scenarios. Our best model obtains 72.8% ac-
curacy, leaving a lot of room for improvement.
The “hypothesis-only” baseline performs only
2%-5% higher than majority, indicating much
fewer annotation artifacts than prior work. We
find that models trained on INFERES general-
ize very well across topics (both in- and out-
of-distribution) and perform moderately well
on negation-based adversarial examples.
1 Introduction
In the task of Natural Language Inference (NLI),
an automated system has to determine the meaning
relation that holds between two texts. The model
has to make a three-way choice between entailment:
a hypothesis (h) is
true
given a premise (p) (e.g.
1.
); contradiction: a hypothesis (h) is
false
given
a premise (p) (e.g.
2.
); or neutral: the truth value
of the hypothesis (h) cannot be determined solely
based on the premise (p) (e.g.: 3.).
1. p) John goes to work every day with a car.
h) John has a job.
2. p) John goes to work every day with a car.
h) John takes the bus to go to work.
3. p) John goes to work every day with a car.
h) John has a Porsche.
NLI (formerly known as Recognizing Textual
Entailment (RTE)) is one of the core tasks in the
popular benchmarks for Natural Language Under-
standing GLUE (Wang et al.,2018) and Super
GLUE (Wang et al.,2019). Hundreds of machine
learning systems compete on these benchmarks,
improving the state of NLU.
One key limitation of NLI research is that most
of the existing corpora are only for English. Lim-
ited research has been done on multilingual and
non-English corpora (Peñas et al.,2006;Conneau
et al.,2018;Amirkhani et al.,2020;Ham et al.,
2020;Hu et al.,2020;Mahendra et al.,2021).
Another well-known issue with NLI is the qual-
ity of the existing datasets and the limitations of
the models trained on them. On most NLI cor-
pora, state-of-the-art transformer based models can
obtain quantitative results (Accuracy and F1) that
equal or exceed human performance. Despite this
high performance, researchers have identified nu-
merous limitations and potential problems. Poliak
et al. (2018) found that annotation artifacts in the
datasets enable the models to predict the label by
only looking at the hypothesis. NLI models are
often prone to adversarial attacks (Williams et al.,
2018) and may fail on instances that require spe-
cific linguistic capabilities (Hossain et al.,2020;
Saha et al.,2020).
In this paper we address both of these shortcom-
ings in NLI research. We present INFERES - to the
best of our knowledge, the first original NLI corpus
for Spanish, not adapted from another language
or task. We study prior work for strategies that
can reduce annotation artifacts and increase the lin-
guistic variety of the corpus, resulting in a dataset
that is more challenging for automated systems to
solve. We also design the corpus in a way that
facilitates systematic evaluation of automated sys-
tems on: 1) negation-based adversarial examples;
arXiv:2210.03068v1 [cs.CL] 6 Oct 2022
2) out-of-distribution examples.
We propose, implement, and analyze three dif-
ferent strategies for the generation and annotation
of text pairs. In the generation strategy, expert lin-
guists write original hypotheses given a premise.
In the rewrite strategy, expert linguists create con-
trastive and adversarial examples by rewriting and
re-annotating “generated” pairs. In the annota-
tion strategy, we first generate text pairs in a semi-
automated manner and then use crowd annota-
tors to determine the meaning relation. The fi-
nal INFERES corpus contains 8,055 gold standard
premise-hypothesis pairs. The core part of the cor-
pus is expert-generated and we make an additional
effort to ensure the quality of the data and the lin-
guistic diversity of the examples.
We provide two baseline for INFERES by fine-
tuning multilingual BERT and BETO (Spanish
BERT) transformer models. On the full dataset,
BETO obtains 72.8% accuracy, indicating that the
classification task is non-trivial. Both mBERT
and BETO perform poorly in the “hypothesis-only”
condition, indicating fewer annotation artifacts in
the corpus compared to prior work. Both sys-
tems generalize well across the different topics
in INFERES both “in-distribution” and “out-of-
distribution”. We notice a substantial drop in per-
formance when evaluating negation-based adversar-
ial examples, however the systems still outperform
majority and “hypothesis-only”.
INFERES expands the scope of the NLI research
in Spanish, provides new set of naturally occurring
contrastive and adversarial examples, and facili-
tates the study of negation and coreference in the
context of NLI. As part of the corpus creation, we
also present and analyze three unique strategies
for creating examples. All our data and baseline
models are being released to the community1.
The rest of this article is organized as follows.
Section 2discusses the related work. Section 3
formulates our objectives and introduces the differ-
ent corpus-creation strategies. Section 4describes
the final corpus and basic statistical data regarding
it. Section 5presents the machine learning exper-
imental setup and results. Section 6is devoted to
a discussion of the results and their implications.
Finally, Section 7concludes the article.
1
At
https://github.com/venelink/inferes
InferES is also added as a HuggingFace dataset
2 Related Work
The task of Recognizing Textual Entailment (RTE)
was proposed in Dagan et al. (2006) as a binary clas-
sification (“entailment” / “non-entailment”). The
RTE competition ran for seven editions (Bar Haim
et al.,2006;Giampiccolo et al.,2007,2008;Ben-
tivogli et al.,2009,2010,2011). RTE was later re-
formulated as a three-way decision and ultimately
renamed Natural Language Inference in the SNLI
(Bowman et al.,2015) and the MNLI (Williams
et al.,2018) corpora. Both the RTE and the NLI
tasks form part of the Natural Language Under-
standing benchmarks GLUE (Wang et al.,2018)
and Super-GLUE (Wang et al.,2019). The NLU
benchmarks attracted a lot of attention from the
community and by 2020 the state-of-the-art sys-
tems reported human level performance. Parrish
et al. (2021) proposed a “linguist-in-the-loop” cor-
pus creation to improve the quality of the data.
The “super-human” performance of NLI systems
has been questioned by a number of researchers.
Poliak et al. (2018) found that annotation artifacts
in the datasets enable the models to predict the
label by only looking at the hypothesis. McCoy
et al. (2019) and Gururangan et al. (2018) demon-
strate that state-of-the-art NLI systems often rely
on heuristics and annotation artifacts.
Systematic approaches to evaluation propose dif-
ferent sets of stress-tests for NLI and NLU systems
(Kovatchev et al.,2018a;Naik et al.,2018;Wallace
et al.,2019;Kovatchev et al.,2019;Ribeiro et al.,
2020;Kovatchev et al.,2020). The attacks can be
inspired by linguistic phenomena or empirical use
cases. Systematic evaluations show that NLI and
other NLU systems often underperform on complex
linguistic phenomena such as conjunction (Saha
et al.,2020), negation (Hossain et al.,2020), and
coreference (Kovatchev et al.,2022). Researchers
also experimented with creating contrastive exam-
ples that differ only slightly from training examples,
but have a different label (Glockner et al.,2018;
Kaushik et al.,2020;Gardner et al.,2020). Adver-
sarially created datasets such as Adversarial NLI
(Nie et al.,2020) and Dynabench NLI (Kiela et al.,
2021) demonstrate that there is a lot of room for
improvement regarding NLI datasets and models.
Most of the available resources for NLI research
are in English. Conneau et al. (2018) present XNLI,
a multilingual dataset created by translating En-
glish NLI examples into other languages. The inter-
est in multilingual NLI has resulted in the creation
of some novel non-English resources such as the
Korean NLI corpus (Ham et al.,2020), Chinese
NLI corpus (Hu et al.,2020), Persian NLI corpus
(Amirkhani et al.,2020), Indonesian NLI corpus
(Mahendra et al.,2021), and indigenous languages
of the Americas NLI corpus (Ebrahimi et al.,2022).
For Spanish, the only available resources are the
Spanish portion of XNLI and the SPARTE corpus
for RTE (Peñas et al.,2006) which was adapted
from Question Answering data.
3 Objectives and Corpus Creation
When creating INFERES we experimented with
different strategies for obtaining gold examples. To
the best of our knowledge, this is the first time var-
ious annotation strategies are combined and com-
pared in a single NLI corpus. We adopt three differ-
ent approaches, used in prior work: our
generation
strategy is similar to the original RTE and NLI cor-
pus creation; our
rewrite
strategy is inspired from
work in generating adversarial and contrastive ex-
amples; our
annotation
strategy scales well with
data and allows us to compare expert- and crowd-
crated datasets. Our aim was to provide interesting
and diverse examples that cover a large range of
use cases and linguistic phenomena. We hope that
INFERES can be used not only to train automated
systems, but also to better understand the nature of
inference. We formulated three main objectives:
O1
To create a native NLI dataset for the Spanish
language. The existing resources are either an
adaptation from a different task or a transla-
tion from English.
O2
To promote better data quality and corpus cre-
ation practices. We aim to create a more chal-
lenging dataset and simultaneously reduce the
number of annotation artifacts.
O3
To facilitate the research on negation and
coreference in the context of NLI. More
specifically, we focus on contrastive and ad-
versarial examples.
3.1 Premise Extraction
In the first step of the process, we extracted a set of
candidate premises. We decided to use a single sen-
tence premise, similar to SNLI and MNLI datasets.
We defined two requirements for our premise sen-
tences: 1) that they cover a range of different top-
ics; and 2) that they be complex enough to entail
or contradict multiple possible hypotheses.
Choice of topics
As a source for premises we
used the Spanish version of Wikipedia from Oc-
tober 2019. We chose six topics, covering five
different domains: history, art, sports, technology,
and politics. We also selected the topics in pairs
hypothesizing that this selection might facilitate
the creation of contrastive examples, specifically in
the context of coreference.
famous historical figures:
Pablo Picasso (ES: Pablo Picasso)
Christopher Columbus (ES: Cristobal Colón)
types of “games”:
Olympic games (ES: Juegos Olímpicos)
Video games (ES: videojuegos)
types of multinational “unions”:
The European Union (ES: Unión Europeo)
The Union of Soviet Socialist Republics
(ES: Unión Sovética)
Extraction process
We extracted the main
Wikipedia article for each topic and preprocessed
it (sentence segmentation and tokenization) using
Spacy (Honnibal and Montani,2017). We split the
text by paragraphs and discarded paragraphs that
contained only one sentence or more than five sen-
tences. Then, from each paragraph, we selected
a single sentence, prioritizing sentences contain-
ing negation
2
where possible, otherwise selecting a
sentence at random. We ensured that each selected
sentence had a length between 15 and 45 tokens.
Post-processing
At the end of the extraction pro-
cess, we had 471 candidate-premise sentences as
follows: 82 for Picasso, 60 for Columbus, 68 for
the Olympic games, 73 for video games, 107 for
the EU, and 81 for the USSR. For each sentence,
we also kept the corresponding paragraph to enable
experimental setup where we provide an additional
context to the machine learning models at train and
test time. We also used the “context paragraphs”
when generating “neutral” pairs. One of the authors
manually inspected all 471 candidate-premise sen-
tences. They manually resolved problems with
sentence segmentation, removed URLs and inter-
nal wikipedia document references, and explicitly
resolved any coreferential and anaphorical ambi-
guities (i.e.: replaced pronouns and coreferential
entities with an unambiguous referent).
2
To check for negation, we used a simple keyword based
search, using a list of the most common negative particles,
adverbs, and verbs in Spanish. The list is available at
https:
//github.com/venelink/inferes
摘要:

INFERES:ANaturalLanguageInferenceCorpusforSpanishFeaturingNegation-BasedContrastiveandAdversarialExamplesVenelinKovatchevSchoolofInformationTheUniversityofTexasatAustinvenelin@utexas.eduMarionaTauléCentredeLlenguatgeiComputacióInstitutdeRecercaenSistemesComplexosUniversitatdeBarcelonamtaule@ub.eduAb...

展开>> 收起<<
INFER ES A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:346.88KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注