UU-Tax at SemEval-2022 Task 3 Improving the generalizability of language models for taxonomy classiﬁcation through data augmentation Injy Sarhan12 Pablo Mosteiro1 and Marco Spruit134

2025-05-06 1 0 394.68KB 11 页 10玖币

侵权投诉

UU-Tax at SemEval-2022 Task 3: Improving the generalizability of

language models for taxonomy classiﬁcation through data augmentation

Injy Sarhan1,2, Pablo Mosteiro1, and Marco Spruit1,3,4

1Department of Information and Computing Sciences, Utrecht University, The Netherlands.

2Arab Academy for Science, Technology, and Maritime Transport, Egypt.

3Department of Public Health and Primary Care, Leiden University Medical Center,

The Netherlands.

4Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands.

i.a.a.sarhan@uu.nl ,p.mosteiro@uu.nl m.r.spruit@lumc.nl

Abstract

This paper presents our strategy to address the

SemEval-2022 Task 3 PreTENS: Presupposed

Taxonomies Evaluating Neural Network Se-

mantics. The goal of the task is to identify if a

sentence is deemed acceptable or not, depend-

ing on the taxonomic relationship that holds

between a noun pair contained in the sentence.

For sub-task 1—binary classiﬁcation—we pro-

pose an effective way to enhance the robust-

ness and the generalizability of language mod-

els for better classiﬁcation on this downstream

task. We design a two-stage ﬁne-tuning pro-

cedure on the ELECTRA language model us-

ing data augmentation techniques. Rigorous

experiments are carried out using multi-task

learning and data-enriched ﬁne-tuning. Ex-

perimental results demonstrate that our pro-

posed model, UU-Tax, is indeed able to gen-

eralize well for our downstream task. For sub-

task 2—regression—we propose a simple clas-

siﬁer that trains on features obtained from Uni-

versal Sentence Encoder (USE). In addition

to describing the submitted systems, we dis-

cuss other experiments that employ pre-trained

language models and data augmentation tech-

niques. For both sub-tasks, we perform error

analysis to further understand the behaviour of

the proposed models. We achieved a global

F1Binary score of 91.25% in sub-task 1 and a

rho score of 0.221 in sub-task 2.1

1 Introduction

Predicting the semantic relationship between words

in a sentence is essential for Natural Language

Processing (NLP) tasks. Deep neural language

models accomplish outstanding results in multiple

tasks involving semantics evaluation. The question

posed by the shared task Presupposed Taxonomies:

Evaluating Neural Network Semantics (PreTENS)

is whether neural models can detect the taxonomic

relationship between nouns, especially in scenarios

Our implementation of UU-Tax is publicly available at

https://github.com/IS5882/UU-TAX.

where the pattern and/or the set of nouns in the

sentence is previously unseen (Zamparelli et al.,

2022). Sub-task 1 is a simpler classiﬁcation task,

while sub-task 2 is a more complex regression task.

Both sub-tasks involve datasets in English, French

and Italian. For each sub-task, teams are permitted

three submissions. For each submission, the score

is averaged over the three languages. The highest

score from the three submissions is reported.

We propose a series of models based on pre-

trained language models. We enhance the provided

datasets using state-of-the-art data augmentation

tools, and further increase the dataset size by em-

ploying translations. The aim of both steps is to

create slightly modiﬁed versions of the sentences,

such that the model can learn alternative forms of

nouns and patterns.

For the classiﬁcation task (sub-task 1), we ob-

tained the 3

place, with an F1

Binary

score of

91.25% averaged over the three languages. For

the regression task (sub-task 2), we obtained the

place, with a Spearman’s correlation coefﬁ-

cient

of 0.221 averaged over the three languages.

Sub-task 2 is markedly more difﬁcult than sub-

task 1 due to sentences that can be ambiguous,

such as I like dogs, but not chihuahuas; some hu-

mans will judge this sentence as acceptable, while

some will not. We attempt to solve both tasks by

employing data augmentation techniques in order

to help the models understand variations in text.

Our main contributions are: (i) we devise a special

development-validation split to emulate the real sit-

uation in which the model must face new words

and patterns, and (ii) we combine various data aug-

mentation tools to allow the models to learn from

various versions of the training dataset.

In Section 2we present the task details and some

of the related work that was done previously. In

Section 3we motivate our choice of models. The

experiments we performed are in Section 4. Results

and conclusions are presented in Sections 5and 6.

arXiv:2210.03378v1 [cs.CL] 7 Oct 2022

2 Background

For the present task, we are provided with a list of

sentences following a set of patterns, all of which

have two slots for noun phrases. One such sentence

might be:

I don’t like beer, a special kind of drink

The pattern corresponding to this sentence would

be: I don’t like [blank], a special kind of [blank].

Sentences are labeled according to whether the

taxonomic relation between the two nouns makes

sense. In sub-task 1, labels are binary; a sen-

tence such as that shown above has a label of

1, while this sentence would have a label of 0:

I like huskies, and dogs too

. In sub-task 2, labels

are continuous, ranging from 1 to 7; these scores

are based on a seven-point Likert scale, judged by

humans via crowdsourcing. The same dataset is

presented in English, Italian and French. For sub-

task 1, the training and test sets consist of 5 838

and 14 556 sentences, respectively; for sub-task 2,

the training and test sets consist of 524 and 1 009

sentences, respectively.

There are two challenges to this dataset: (i) The

test dataset is much bigger than the training dataset,

and (ii) There are unseen patterns and noun pairs in

the test set. The combination of these hampers the

ability of machine learning (ML) models trained

on the training set to generalize well to the test set.

Indeed, that is the aim of this task: to evaluate the

ability of language models to generalize to new

data when it comes to inferring taxonomies.

One way to conceptualize the PreTENS task is

to reformulate it as a taxonomy extraction task with

pattern classiﬁcation and distributed word represen-

tations. For a given sentence, extract the noun pair

and the pattern from the sentence, and then deter-

mine if the taxonomic relation between the nouns

matches the relations allowed by the pattern. This

formulation is motivated by previous work in taxon-

omy construction that relied on various approaches

ranging from pattern-based methods and syntactic

features to word embeddings (Huang et al.,2019;

Luu et al.,2016;Roller et al.,2018). As promising

as this approach sounds for PreTENS, it involves

manual labeling of the noun-pair taxonomic rela-

tions in the training set, as we are not allowed to

use resources such as WordNet (Fellbaum,1998)

or BabelNet (Navigli and Ponzetto,2012).

A different approach is to tackle PreTENS as

a cross-over task between extraction of lexico-

semantic relations and commonsense validation.

There have been SemEval tasks to extract and iden-

tify taxonomic relationships between given terms

(SemEval-2016 task 13) (Bordea et al.,2016), and

to validate sentences for commonsense (SemEval-

2020 task 4, sub-task A) (Wang et al.,2020). The

aim of the common-sense validation task is to iden-

tify which of two natural language statements with

similar wordings makes sense.

In the SemEval-2016 task 13, approaches re-

lated to extracting hypernym-hyponym relations

to construct a taxonomy involved both pattern-

based methods and distributional methods. TAXI

relied on extracting Hearst-style lexico-syntactic

patterns by ﬁrst crawling domain-speciﬁc corpora

based on the terminology of the target domain and

later using substring matching to extract candidate

hypernym-hyponym relations (Panchenko et al.,

2016). Another team designed a semi-supervised

model based on the hypothesis that hypernyms may

be induced by adding a vector offset to the corre-

sponding hyponym word embedding (Pocostales,

2016).

Participants in the SemEval 2020 commonsense

validation task had an advantage over PreTENS

participants: they were allowed to integrate taxo-

nomic information from external resources such

as ConceptNet (Wang et al.,2020), which eased

the process of ﬁne-tuning the language models on

the down-stream task. As an example, the CN-

HIT-IT.NLP team (Zhang et al.,2020) and ECNU-

SenseMaker (Zhao et al.,2020) both used a variant

of K-BERT (Liu et al.,2020a) with additional data;

the former injects relevant triples from ConceptNet

to the language model, while the later also uses

ConceptNet’s unstructured text to pre-train the lan-

guage model. Other systems relied on ensemble

models consisting of different language models

such as RoBERTa and XLNet (Liu,2020;Altiti

et al.,2020).

In Section 3we outline the architectures cho-

sen to tackle the two sub-tasks of PreTENS. We

draw on previous work, as outlined above, and pro-

vide novel combinations of datasets and algorithms

to improve the performance of out-of-the box lan-

guage models.

3 System Description

The systems we propose for both PreTENS sub-

tasks are based on language models. In sub-task 1

we use the ELECTRA (Efﬁciently Learning an

Encoder that Classiﬁes Token Replacements Ac-

curately) transformer (Clark et al.,2020), while in

sub-task 2 we employ USE (Universal Sentence

Encoder) (Yang et al.,2020).

3.1 Sub-task 1: Classiﬁcation

In the ﬁrst sub-task—binary classiﬁcation—we

were required to assign an acceptability label

for each sentence in the three languages English,

French and Italian. Of the 20 394 sentences that

were provided for sub-task 1, only 5 838 sentences

(28.61%) were available for training. This split

causes the model to be likely to encounter un-

known data formats at testing time. This is a piv-

otal challenge in PreTENS, as the robustness and

generalization of language models is an open chal-

lenge and cannot be guaranteed (Tu et al.,2020;

Ramesh Kashyap et al.,2021). In our experi-

ments we found that every language model we used

(BERT, RoBERTa, XLNet, and ELECTRA) failed

to generalize well to unseen datasets, even though

all of them are pre-trained on large amounts of

data. To address this challenge, we built our mod-

els based on data augmentation.

While designing our model, we split the pro-

vided training data into a development set (30%)

and a validation set (70%), to emulate the train-test

split sizes. We deliberately leave several patterns

out of the development set, including, for exam-

ple: I like [blank], and more speciﬁcally [blank].

We choose these so-called complex patterns be-

cause, during exploratory experiments, we found

that pre-trained models had trouble with them. For

example, out of the 820 instances of the aforemen-

tioned pattern in the training dataset, 750 instances

were misclassiﬁed by one of the early instances of

our model; this includes sentences where the noun

pair was included in other sentences in the training

data. We thus remove complex patterns from the

training data, to simulate a situation in which new

unseen and difﬁcult patterns are found in the test

set.

Transformer language models like BERT (De-

vlin et al.,2019) are pre-trained on two tasks:

Masked Language Modelling (MLM) and Next

Sentence Prediction (NSP). However, in subse-

quent models such as RoBERTa, training on NSP

was proven to be unnecessary; these models are

thus pre-trained solely on MLM. ELECTRA fur-

ther enhanced MLM performance while utiliz-

ing notably less computing resources for the pre-

training stage. The pre-training task in ELECTRA

is built on discovering replaced tokens in the input

sequence; to achieve this, ELECTRA deploys two

transformer models: a generator and a discrimina-

tor, where the generator is trained to substitute in-

put tokens with credible alternatives and a discrim-

inator to predict the presence or absence of substi-

tution. This setting is similar to Generative Adver-

sarial Networks (GANs) (Goodfellow et al.,2014),

with a key difference that the generator does not

attempt to trick the discriminator, making ELEC-

TRA non-adversarial. In ELECTRA, the gener-

ator parameters are only adjusted during the pre-

training phase. Fine tuning on downstream tasks

only modiﬁes the discriminator parameters (Clark

et al.,2020).

Electra Discriminator

English Dataset

NLP Aug

Augmented Data

Insertion Substitution

Electra Generator

Stage 1

Italian Dataset English Translation

English TranslationFrench Dataset

English Dataset

Stage 2

Unannotated Test

Dataset

Inference

Output Labels

Training

Figure 1: Sub-task 1: The English version of the pro-

posed two-stage ﬁne-tuning model (UU-Tax). In the

French version, the Italian and English data are trans-

lated to French, and the NLPAug tool is employed on

the provided French training set. Likewise in the Italian

version.

Multi-stage ﬁne-tuning has proven its effective-

ness on the robustness and generalization of mod-

els (Kocijan et al.,2019;Li and Rudzicz,2021).

We perform a 2-stage ﬁne-tuning; Figure 1por-

trays our model work-ﬂow. In the ﬁrst stage, we

use the

NLPAug

tool (Ma,2019) to generate new

sentences by making modiﬁcations to existing sen-

tences based on contextualized word embeddings.

There are several actions for the NLPAug tool; we

utilize the ‘Insertion’ and ‘Substitution’ operations.

The ‘Insertion’ operation picks a random position

in the sentence, and then inserts at that position the

word that best ﬁts the local context. Meanwhile,

the ‘Substitution’ operation replaces a word in a

given sentence by the most appropriate alternative

for that word. In both operations, the word choice

is given by contextualized word embeddings, as

will be explained in Section 4.1. To avoid drifting

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UU-TaxatSemEval-2022Task3:ImprovingthegeneralizabilityoflanguagemodelsfortaxonomyclassicationthroughdataaugmentationInjySarhan1,2,PabloMosteiro1,andMarcoSpruit1,3,41DepartmentofInformationandComputingSciences,UtrechtUniversity,TheNetherlands.2ArabAcademyforScience,Technology,andMaritimeTransport,Eg...

展开>> 收起<<

UU-Tax at SemEval-2022 Task 3 Improving the generalizability of language models for taxonomy classiﬁcation through data augmentation Injy Sarhan12 Pablo Mosteiro1 and Marco Spruit134.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UU-Tax at SemEval-2022 Task 3 Improving the generalizability of language models for taxonomy classiﬁcation through data augmentation Injy Sarhan12 Pablo Mosteiro1 and Marco Spruit134

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: