INFER ES A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev

2025-05-05 0 0 346.88KB 12 页 10玖币

侵权投诉

INFERES : A Natural Language Inference Corpus for Spanish

Featuring Negation-Based Contrastive and Adversarial Examples

Venelin Kovatchev

School of Information

The University of Texas at Austin

venelin@utexas.edu

Mariona Taulé

Centre de Llenguatge i Computació

Institut de Recerca en Sistemes Complexos

Universitat de Barcelona

mtaule@ub.edu

Abstract

In this paper, we present INFERES - an orig-

inal corpus for Natural Language Inference

(NLI) in European Spanish. We propose,

implement, and analyze a variety of corpus-

creating strategies utilizing expert linguists

and crowd workers. The objectives behind IN-

FERES are to provide high-quality data, and,

at the same time to facilitate the systematic

evaluation of automated systems. Speciﬁcally,

we focus on measuring and improving the

performance of machine learning systems on

negation-based adversarial examples and their

ability to generalize across out-of-distribution

topics.

We train two transformer models on IN-

FERES (8,055 gold examples) in a variety of

scenarios. Our best model obtains 72.8% ac-

curacy, leaving a lot of room for improvement.

The “hypothesis-only” baseline performs only

2%-5% higher than majority, indicating much

fewer annotation artifacts than prior work. We

ﬁnd that models trained on INFERES general-

ize very well across topics (both in- and out-

of-distribution) and perform moderately well

on negation-based adversarial examples.

1 Introduction

In the task of Natural Language Inference (NLI),

an automated system has to determine the meaning

relation that holds between two texts. The model

has to make a three-way choice between entailment:

a hypothesis (h) is

true

given a premise (p) (e.g.

); contradiction: a hypothesis (h) is

false

given

a premise (p) (e.g.

); or neutral: the truth value

of the hypothesis (h) cannot be determined solely

based on the premise (p) (e.g.: 3.).

1. p) John goes to work every day with a car.

h) John has a job.

2. p) John goes to work every day with a car.

h) John takes the bus to go to work.

3. p) John goes to work every day with a car.

h) John has a Porsche.

NLI (formerly known as Recognizing Textual

Entailment (RTE)) is one of the core tasks in the

popular benchmarks for Natural Language Under-

standing GLUE (Wang et al.,2018) and Super

GLUE (Wang et al.,2019). Hundreds of machine

learning systems compete on these benchmarks,

improving the state of NLU.

One key limitation of NLI research is that most

of the existing corpora are only for English. Lim-

ited research has been done on multilingual and

non-English corpora (Peñas et al.,2006;Conneau

et al.,2018;Amirkhani et al.,2020;Ham et al.,

2020;Hu et al.,2020;Mahendra et al.,2021).

Another well-known issue with NLI is the qual-

ity of the existing datasets and the limitations of

the models trained on them. On most NLI cor-

pora, state-of-the-art transformer based models can

obtain quantitative results (Accuracy and F1) that

equal or exceed human performance. Despite this

high performance, researchers have identiﬁed nu-

merous limitations and potential problems. Poliak

et al. (2018) found that annotation artifacts in the

datasets enable the models to predict the label by

only looking at the hypothesis. NLI models are

often prone to adversarial attacks (Williams et al.,

2018) and may fail on instances that require spe-

ciﬁc linguistic capabilities (Hossain et al.,2020;

Saha et al.,2020).

In this paper we address both of these shortcom-

ings in NLI research. We present INFERES - to the

best of our knowledge, the ﬁrst original NLI corpus

for Spanish, not adapted from another language

or task. We study prior work for strategies that

can reduce annotation artifacts and increase the lin-

guistic variety of the corpus, resulting in a dataset

that is more challenging for automated systems to

solve. We also design the corpus in a way that

facilitates systematic evaluation of automated sys-

tems on: 1) negation-based adversarial examples;

arXiv:2210.03068v1 [cs.CL] 6 Oct 2022

2) out-of-distribution examples.

We propose, implement, and analyze three dif-

ferent strategies for the generation and annotation

of text pairs. In the generation strategy, expert lin-

guists write original hypotheses given a premise.

In the rewrite strategy, expert linguists create con-

trastive and adversarial examples by rewriting and

re-annotating “generated” pairs. In the annota-

tion strategy, we ﬁrst generate text pairs in a semi-

automated manner and then use crowd annota-

tors to determine the meaning relation. The ﬁ-

nal INFERES corpus contains 8,055 gold standard

premise-hypothesis pairs. The core part of the cor-

pus is expert-generated and we make an additional

effort to ensure the quality of the data and the lin-

guistic diversity of the examples.

We provide two baseline for INFERES by ﬁne-

tuning multilingual BERT and BETO (Spanish

BERT) transformer models. On the full dataset,

BETO obtains 72.8% accuracy, indicating that the

classiﬁcation task is non-trivial. Both mBERT

and BETO perform poorly in the “hypothesis-only”

condition, indicating fewer annotation artifacts in

the corpus compared to prior work. Both sys-

tems generalize well across the different topics

in INFERES both “in-distribution” and “out-of-

distribution”. We notice a substantial drop in per-

formance when evaluating negation-based adversar-

ial examples, however the systems still outperform

majority and “hypothesis-only”.

INFERES expands the scope of the NLI research

in Spanish, provides new set of naturally occurring

contrastive and adversarial examples, and facili-

tates the study of negation and coreference in the

context of NLI. As part of the corpus creation, we

also present and analyze three unique strategies

for creating examples. All our data and baseline

models are being released to the community1.

The rest of this article is organized as follows.

Section 2discusses the related work. Section 3

formulates our objectives and introduces the differ-

ent corpus-creation strategies. Section 4describes

the ﬁnal corpus and basic statistical data regarding

it. Section 5presents the machine learning exper-

imental setup and results. Section 6is devoted to

a discussion of the results and their implications.

Finally, Section 7concludes the article.

https://github.com/venelink/inferes

InferES is also added as a HuggingFace dataset

2 Related Work

The task of Recognizing Textual Entailment (RTE)

was proposed in Dagan et al. (2006) as a binary clas-

siﬁcation (“entailment” / “non-entailment”). The

RTE competition ran for seven editions (Bar Haim

et al.,2006;Giampiccolo et al.,2007,2008;Ben-

tivogli et al.,2009,2010,2011). RTE was later re-

formulated as a three-way decision and ultimately

renamed Natural Language Inference in the SNLI

(Bowman et al.,2015) and the MNLI (Williams

et al.,2018) corpora. Both the RTE and the NLI

tasks form part of the Natural Language Under-

standing benchmarks GLUE (Wang et al.,2018)

and Super-GLUE (Wang et al.,2019). The NLU

benchmarks attracted a lot of attention from the

community and by 2020 the state-of-the-art sys-

tems reported human level performance. Parrish

et al. (2021) proposed a “linguist-in-the-loop” cor-

pus creation to improve the quality of the data.

The “super-human” performance of NLI systems

has been questioned by a number of researchers.

Poliak et al. (2018) found that annotation artifacts

in the datasets enable the models to predict the

label by only looking at the hypothesis. McCoy

et al. (2019) and Gururangan et al. (2018) demon-

strate that state-of-the-art NLI systems often rely

on heuristics and annotation artifacts.

Systematic approaches to evaluation propose dif-

ferent sets of stress-tests for NLI and NLU systems

(Kovatchev et al.,2018a;Naik et al.,2018;Wallace

et al.,2019;Kovatchev et al.,2019;Ribeiro et al.,

2020;Kovatchev et al.,2020). The attacks can be

inspired by linguistic phenomena or empirical use

cases. Systematic evaluations show that NLI and

other NLU systems often underperform on complex

linguistic phenomena such as conjunction (Saha

et al.,2020), negation (Hossain et al.,2020), and

coreference (Kovatchev et al.,2022). Researchers

also experimented with creating contrastive exam-

ples that differ only slightly from training examples,

but have a different label (Glockner et al.,2018;

Kaushik et al.,2020;Gardner et al.,2020). Adver-

sarially created datasets such as Adversarial NLI

(Nie et al.,2020) and Dynabench NLI (Kiela et al.,

2021) demonstrate that there is a lot of room for

improvement regarding NLI datasets and models.

Most of the available resources for NLI research

are in English. Conneau et al. (2018) present XNLI,

a multilingual dataset created by translating En-

glish NLI examples into other languages. The inter-

est in multilingual NLI has resulted in the creation

of some novel non-English resources such as the

Korean NLI corpus (Ham et al.,2020), Chinese

NLI corpus (Hu et al.,2020), Persian NLI corpus

(Amirkhani et al.,2020), Indonesian NLI corpus

(Mahendra et al.,2021), and indigenous languages

of the Americas NLI corpus (Ebrahimi et al.,2022).

For Spanish, the only available resources are the

Spanish portion of XNLI and the SPARTE corpus

for RTE (Peñas et al.,2006) which was adapted

from Question Answering data.

3 Objectives and Corpus Creation

When creating INFERES we experimented with

different strategies for obtaining gold examples. To

the best of our knowledge, this is the ﬁrst time var-

ious annotation strategies are combined and com-

pared in a single NLI corpus. We adopt three differ-

ent approaches, used in prior work: our

generation

strategy is similar to the original RTE and NLI cor-

pus creation; our

rewrite

strategy is inspired from

work in generating adversarial and contrastive ex-

amples; our

annotation

strategy scales well with

data and allows us to compare expert- and crowd-

crated datasets. Our aim was to provide interesting

and diverse examples that cover a large range of

use cases and linguistic phenomena. We hope that

INFERES can be used not only to train automated

systems, but also to better understand the nature of

inference. We formulated three main objectives:

To create a native NLI dataset for the Spanish

language. The existing resources are either an

adaptation from a different task or a transla-

tion from English.

To promote better data quality and corpus cre-

ation practices. We aim to create a more chal-

lenging dataset and simultaneously reduce the

number of annotation artifacts.

To facilitate the research on negation and

coreference in the context of NLI. More

speciﬁcally, we focus on contrastive and ad-

versarial examples.

3.1 Premise Extraction

In the ﬁrst step of the process, we extracted a set of

candidate premises. We decided to use a single sen-

tence premise, similar to SNLI and MNLI datasets.

We deﬁned two requirements for our premise sen-

tences: 1) that they cover a range of different top-

ics; and 2) that they be complex enough to entail

or contradict multiple possible hypotheses.

Choice of topics

As a source for premises we

used the Spanish version of Wikipedia from Oc-

tober 2019. We chose six topics, covering ﬁve

different domains: history, art, sports, technology,

and politics. We also selected the topics in pairs

hypothesizing that this selection might facilitate

the creation of contrastive examples, speciﬁcally in

the context of coreference.

• famous historical ﬁgures:

Pablo Picasso (ES: Pablo Picasso)

Christopher Columbus (ES: Cristobal Colón)

• types of “games”:

Olympic games (ES: Juegos Olímpicos)

Video games (ES: videojuegos)

• types of multinational “unions”:

The European Union (ES: Unión Europeo)

The Union of Soviet Socialist Republics

(ES: Unión Sovética)

Extraction process

We extracted the main

Wikipedia article for each topic and preprocessed

it (sentence segmentation and tokenization) using

Spacy (Honnibal and Montani,2017). We split the

text by paragraphs and discarded paragraphs that

contained only one sentence or more than ﬁve sen-

tences. Then, from each paragraph, we selected

a single sentence, prioritizing sentences contain-

ing negation

where possible, otherwise selecting a

sentence at random. We ensured that each selected

sentence had a length between 15 and 45 tokens.

Post-processing

At the end of the extraction pro-

cess, we had 471 candidate-premise sentences as

follows: 82 for Picasso, 60 for Columbus, 68 for

the Olympic games, 73 for video games, 107 for

the EU, and 81 for the USSR. For each sentence,

we also kept the corresponding paragraph to enable

experimental setup where we provide an additional

context to the machine learning models at train and

test time. We also used the “context paragraphs”

when generating “neutral” pairs. One of the authors

manually inspected all 471 candidate-premise sen-

tences. They manually resolved problems with

sentence segmentation, removed URLs and inter-

nal wikipedia document references, and explicitly

resolved any coreferential and anaphorical ambi-

guities (i.e.: replaced pronouns and coreferential

entities with an unambiguous referent).

To check for negation, we used a simple keyword based

search, using a list of the most common negative particles,

adverbs, and verbs in Spanish. The list is available at

https:

//github.com/venelink/inferes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

INFERES:ANaturalLanguageInferenceCorpusforSpanishFeaturingNegation-BasedContrastiveandAdversarialExamplesVenelinKovatchevSchoolofInformationTheUniversityofTexasatAustinvenelin@utexas.eduMarionaTauléCentredeLlenguatgeiComputacióInstitutdeRecercaenSistemesComplexosUniversitatdeBarcelonamtaule@ub.eduAb...

展开>> 收起<<

INFER ES A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

INFER ES A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: