UU-Tax at SemEval-2022 Task 3 Improving the generalizability of language models for taxonomy classification through data augmentation Injy Sarhan12 Pablo Mosteiro1 and Marco Spruit134

2025-05-06 0 0 394.68KB 11 页 10玖币
侵权投诉
UU-Tax at SemEval-2022 Task 3: Improving the generalizability of
language models for taxonomy classification through data augmentation
Injy Sarhan1,2, Pablo Mosteiro1, and Marco Spruit1,3,4
1Department of Information and Computing Sciences, Utrecht University, The Netherlands.
2Arab Academy for Science, Technology, and Maritime Transport, Egypt.
3Department of Public Health and Primary Care, Leiden University Medical Center,
The Netherlands.
4Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands.
i.a.a.sarhan@uu.nl ,p.mosteiro@uu.nl m.r.spruit@lumc.nl
Abstract
This paper presents our strategy to address the
SemEval-2022 Task 3 PreTENS: Presupposed
Taxonomies Evaluating Neural Network Se-
mantics. The goal of the task is to identify if a
sentence is deemed acceptable or not, depend-
ing on the taxonomic relationship that holds
between a noun pair contained in the sentence.
For sub-task 1—binary classification—we pro-
pose an effective way to enhance the robust-
ness and the generalizability of language mod-
els for better classification on this downstream
task. We design a two-stage fine-tuning pro-
cedure on the ELECTRA language model us-
ing data augmentation techniques. Rigorous
experiments are carried out using multi-task
learning and data-enriched fine-tuning. Ex-
perimental results demonstrate that our pro-
posed model, UU-Tax, is indeed able to gen-
eralize well for our downstream task. For sub-
task 2—regression—we propose a simple clas-
sifier that trains on features obtained from Uni-
versal Sentence Encoder (USE). In addition
to describing the submitted systems, we dis-
cuss other experiments that employ pre-trained
language models and data augmentation tech-
niques. For both sub-tasks, we perform error
analysis to further understand the behaviour of
the proposed models. We achieved a global
F1Binary score of 91.25% in sub-task 1 and a
rho score of 0.221 in sub-task 2.1
1 Introduction
Predicting the semantic relationship between words
in a sentence is essential for Natural Language
Processing (NLP) tasks. Deep neural language
models accomplish outstanding results in multiple
tasks involving semantics evaluation. The question
posed by the shared task Presupposed Taxonomies:
Evaluating Neural Network Semantics (PreTENS)
is whether neural models can detect the taxonomic
relationship between nouns, especially in scenarios
1
Our implementation of UU-Tax is publicly available at
https://github.com/IS5882/UU-TAX.
where the pattern and/or the set of nouns in the
sentence is previously unseen (Zamparelli et al.,
2022). Sub-task 1 is a simpler classification task,
while sub-task 2 is a more complex regression task.
Both sub-tasks involve datasets in English, French
and Italian. For each sub-task, teams are permitted
three submissions. For each submission, the score
is averaged over the three languages. The highest
score from the three submissions is reported.
We propose a series of models based on pre-
trained language models. We enhance the provided
datasets using state-of-the-art data augmentation
tools, and further increase the dataset size by em-
ploying translations. The aim of both steps is to
create slightly modified versions of the sentences,
such that the model can learn alternative forms of
nouns and patterns.
For the classification task (sub-task 1), we ob-
tained the 3
rd
place, with an F1
Binary
score of
91.25% averaged over the three languages. For
the regression task (sub-task 2), we obtained the
5
th
place, with a Spearman’s correlation coeffi-
cient
ρ
of 0.221 averaged over the three languages.
Sub-task 2 is markedly more difficult than sub-
task 1 due to sentences that can be ambiguous,
such as I like dogs, but not chihuahuas; some hu-
mans will judge this sentence as acceptable, while
some will not. We attempt to solve both tasks by
employing data augmentation techniques in order
to help the models understand variations in text.
Our main contributions are: (i) we devise a special
development-validation split to emulate the real sit-
uation in which the model must face new words
and patterns, and (ii) we combine various data aug-
mentation tools to allow the models to learn from
various versions of the training dataset.
In Section 2we present the task details and some
of the related work that was done previously. In
Section 3we motivate our choice of models. The
experiments we performed are in Section 4. Results
and conclusions are presented in Sections 5and 6.
arXiv:2210.03378v1 [cs.CL] 7 Oct 2022
2 Background
For the present task, we are provided with a list of
sentences following a set of patterns, all of which
have two slots for noun phrases. One such sentence
might be:
I don’t like beer, a special kind of drink
.
The pattern corresponding to this sentence would
be: I don’t like [blank], a special kind of [blank].
Sentences are labeled according to whether the
taxonomic relation between the two nouns makes
sense. In sub-task 1, labels are binary; a sen-
tence such as that shown above has a label of
1, while this sentence would have a label of 0:
I like huskies, and dogs too
. In sub-task 2, labels
are continuous, ranging from 1 to 7; these scores
are based on a seven-point Likert scale, judged by
humans via crowdsourcing. The same dataset is
presented in English, Italian and French. For sub-
task 1, the training and test sets consist of 5 838
and 14 556 sentences, respectively; for sub-task 2,
the training and test sets consist of 524 and 1 009
sentences, respectively.
There are two challenges to this dataset: (i) The
test dataset is much bigger than the training dataset,
and (ii) There are unseen patterns and noun pairs in
the test set. The combination of these hampers the
ability of machine learning (ML) models trained
on the training set to generalize well to the test set.
Indeed, that is the aim of this task: to evaluate the
ability of language models to generalize to new
data when it comes to inferring taxonomies.
One way to conceptualize the PreTENS task is
to reformulate it as a taxonomy extraction task with
pattern classification and distributed word represen-
tations. For a given sentence, extract the noun pair
and the pattern from the sentence, and then deter-
mine if the taxonomic relation between the nouns
matches the relations allowed by the pattern. This
formulation is motivated by previous work in taxon-
omy construction that relied on various approaches
ranging from pattern-based methods and syntactic
features to word embeddings (Huang et al.,2019;
Luu et al.,2016;Roller et al.,2018). As promising
as this approach sounds for PreTENS, it involves
manual labeling of the noun-pair taxonomic rela-
tions in the training set, as we are not allowed to
use resources such as WordNet (Fellbaum,1998)
or BabelNet (Navigli and Ponzetto,2012).
A different approach is to tackle PreTENS as
a cross-over task between extraction of lexico-
semantic relations and commonsense validation.
There have been SemEval tasks to extract and iden-
tify taxonomic relationships between given terms
(SemEval-2016 task 13) (Bordea et al.,2016), and
to validate sentences for commonsense (SemEval-
2020 task 4, sub-task A) (Wang et al.,2020). The
aim of the common-sense validation task is to iden-
tify which of two natural language statements with
similar wordings makes sense.
In the SemEval-2016 task 13, approaches re-
lated to extracting hypernym-hyponym relations
to construct a taxonomy involved both pattern-
based methods and distributional methods. TAXI
relied on extracting Hearst-style lexico-syntactic
patterns by first crawling domain-specific corpora
based on the terminology of the target domain and
later using substring matching to extract candidate
hypernym-hyponym relations (Panchenko et al.,
2016). Another team designed a semi-supervised
model based on the hypothesis that hypernyms may
be induced by adding a vector offset to the corre-
sponding hyponym word embedding (Pocostales,
2016).
Participants in the SemEval 2020 commonsense
validation task had an advantage over PreTENS
participants: they were allowed to integrate taxo-
nomic information from external resources such
as ConceptNet (Wang et al.,2020), which eased
the process of fine-tuning the language models on
the down-stream task. As an example, the CN-
HIT-IT.NLP team (Zhang et al.,2020) and ECNU-
SenseMaker (Zhao et al.,2020) both used a variant
of K-BERT (Liu et al.,2020a) with additional data;
the former injects relevant triples from ConceptNet
to the language model, while the later also uses
ConceptNet’s unstructured text to pre-train the lan-
guage model. Other systems relied on ensemble
models consisting of different language models
such as RoBERTa and XLNet (Liu,2020;Altiti
et al.,2020).
In Section 3we outline the architectures cho-
sen to tackle the two sub-tasks of PreTENS. We
draw on previous work, as outlined above, and pro-
vide novel combinations of datasets and algorithms
to improve the performance of out-of-the box lan-
guage models.
3 System Description
The systems we propose for both PreTENS sub-
tasks are based on language models. In sub-task 1
we use the ELECTRA (Efficiently Learning an
Encoder that Classifies Token Replacements Ac-
curately) transformer (Clark et al.,2020), while in
sub-task 2 we employ USE (Universal Sentence
Encoder) (Yang et al.,2020).
3.1 Sub-task 1: Classification
In the first sub-task—binary classification—we
were required to assign an acceptability label
for each sentence in the three languages English,
French and Italian. Of the 20 394 sentences that
were provided for sub-task 1, only 5 838 sentences
(28.61%) were available for training. This split
causes the model to be likely to encounter un-
known data formats at testing time. This is a piv-
otal challenge in PreTENS, as the robustness and
generalization of language models is an open chal-
lenge and cannot be guaranteed (Tu et al.,2020;
Ramesh Kashyap et al.,2021). In our experi-
ments we found that every language model we used
(BERT, RoBERTa, XLNet, and ELECTRA) failed
to generalize well to unseen datasets, even though
all of them are pre-trained on large amounts of
data. To address this challenge, we built our mod-
els based on data augmentation.
While designing our model, we split the pro-
vided training data into a development set (30%)
and a validation set (70%), to emulate the train-test
split sizes. We deliberately leave several patterns
out of the development set, including, for exam-
ple: I like [blank], and more specifically [blank].
We choose these so-called complex patterns be-
cause, during exploratory experiments, we found
that pre-trained models had trouble with them. For
example, out of the 820 instances of the aforemen-
tioned pattern in the training dataset, 750 instances
were misclassified by one of the early instances of
our model; this includes sentences where the noun
pair was included in other sentences in the training
data. We thus remove complex patterns from the
training data, to simulate a situation in which new
unseen and difficult patterns are found in the test
set.
Transformer language models like BERT (De-
vlin et al.,2019) are pre-trained on two tasks:
Masked Language Modelling (MLM) and Next
Sentence Prediction (NSP). However, in subse-
quent models such as RoBERTa, training on NSP
was proven to be unnecessary; these models are
thus pre-trained solely on MLM. ELECTRA fur-
ther enhanced MLM performance while utiliz-
ing notably less computing resources for the pre-
training stage. The pre-training task in ELECTRA
is built on discovering replaced tokens in the input
sequence; to achieve this, ELECTRA deploys two
transformer models: a generator and a discrimina-
tor, where the generator is trained to substitute in-
put tokens with credible alternatives and a discrim-
inator to predict the presence or absence of substi-
tution. This setting is similar to Generative Adver-
sarial Networks (GANs) (Goodfellow et al.,2014),
with a key difference that the generator does not
attempt to trick the discriminator, making ELEC-
TRA non-adversarial. In ELECTRA, the gener-
ator parameters are only adjusted during the pre-
training phase. Fine tuning on downstream tasks
only modifies the discriminator parameters (Clark
et al.,2020).
Electra Discriminator
English Dataset
NLP Aug
Augmented Data
Insertion Substitution
Electra Generator
Stage 1
Italian Dataset English Translation
English TranslationFrench Dataset
English Dataset
Stage 2
Unannotated Test
Dataset
Inference
Output Labels
Training
Figure 1: Sub-task 1: The English version of the pro-
posed two-stage fine-tuning model (UU-Tax). In the
French version, the Italian and English data are trans-
lated to French, and the NLPAug tool is employed on
the provided French training set. Likewise in the Italian
version.
Multi-stage fine-tuning has proven its effective-
ness on the robustness and generalization of mod-
els (Kocijan et al.,2019;Li and Rudzicz,2021).
We perform a 2-stage fine-tuning; Figure 1por-
trays our model work-flow. In the first stage, we
use the
NLPAug
tool (Ma,2019) to generate new
sentences by making modifications to existing sen-
tences based on contextualized word embeddings.
There are several actions for the NLPAug tool; we
utilize the ‘Insertion’ and ‘Substitution’ operations.
The ‘Insertion’ operation picks a random position
in the sentence, and then inserts at that position the
word that best fits the local context. Meanwhile,
the ‘Substitution’ operation replaces a word in a
given sentence by the most appropriate alternative
for that word. In both operations, the word choice
is given by contextualized word embeddings, as
will be explained in Section 4.1. To avoid drifting
摘要:

UU-TaxatSemEval-2022Task3:ImprovingthegeneralizabilityoflanguagemodelsfortaxonomyclassicationthroughdataaugmentationInjySarhan1,2,PabloMosteiro1,andMarcoSpruit1,3,41DepartmentofInformationandComputingSciences,UtrechtUniversity,TheNetherlands.2ArabAcademyforScience,Technology,andMaritimeTransport,Eg...

展开>> 收起<<
UU-Tax at SemEval-2022 Task 3 Improving the generalizability of language models for taxonomy classification through data augmentation Injy Sarhan12 Pablo Mosteiro1 and Marco Spruit134.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:394.68KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注