Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings Iker García-Ferrero Rodrigo Agerri German Rigau

2025-05-06 0 0 1.76MB 14 页 10玖币
侵权投诉
Model and Data Transfer for Cross-Lingual Sequence Labelling in
Zero-Resource Settings
Iker García-Ferrero Rodrigo Agerri German Rigau
HiTZ Basque Center for Language Technologies - Ixa NLP Group
University of the Basque Country UPV/EHU
{ iker.garciaf, rodrigo.agerri, german.rigau }@ehu.eus
Abstract
Zero-resource cross-lingual transfer ap-
proaches aim to apply supervised models from
a source language to unlabelled target lan-
guages. In this paper we perform an in-depth
study of the two main techniques employed so
far for cross-lingual zero-resource sequence
labelling, based either on data or model trans-
fer. Although previous research has proposed
translation and annotation projection (data-
based cross-lingual transfer) as an effective
technique for cross-lingual sequence labelling,
in this paper we experimentally demonstrate
that high capacity multilingual language
models applied in a zero-shot (model-based
cross-lingual transfer) setting consistently
outperform data-based cross-lingual transfer
approaches. A detailed analysis of our results
suggests that this might be due to important
differences in language use. More specifically,
machine translation often generates a textual
signal which is different to what the models
are exposed to when using gold standard
data, which affects both the fine-tuning
and evaluation processes. Our results also
indicate that data-based cross-lingual transfer
approaches remain a competitive option when
high-capacity multilingual language models
are not available.
1 Introduction
Sequence labelling is the task of assigning a label
to each token in a given input sequence. Sequence
labelling is a fundamental process in many down-
stream NLP tasks. Currently, most successful ap-
proaches for this task apply supervised deep-neural
networks (Lample et al.,2016;Akbik et al.,2018;
Devlin et al.,2019;Conneau et al.,2020). How-
ever, as it was the case for supervised statistical
approaches (Agerri and Rigau,2016), their perfor-
mance still depends on the amount of manually
annotated training data. Additionally, deep-neural
models still show a significant loss of performance
Figure 1: In the data-based transfer approach we trans-
late and project the labels of the gold data into the tar-
get language, and use the resulting silver data to train
a model for the target language. In the model-based
transfer approach we train a model with gold data in
English and use it in a zero-shot setting in the target
language.
when evaluated in out-of-domain data (Liu et al.,
2021). This means that to improvie their perfor-
mance, it would therefore be necessary to develop
very costly manually annotated data for each lan-
guage and domain of application. Thus, consider-
ing that for most of the languages in the world the
amount of manually annotated corpora is simply
nonexistent (Joshi et al.,2020), then the task of de-
veloping sequence labelling models for languages
and domain-specific tasks, for which supervised
data is not available, remains a challenge of great
interest. This task is known as zero-resource cross-
lingual sequence labelling.
arXiv:2210.12623v2 [cs.CL] 27 Apr 2023
Data-based cross-lingual transfer
methods
aim to automatically generate labelled data for a tar-
get language. Previous works on data-based trans-
fer have proposed translation and annotation pro-
jection as an effective technique for zero-resource
cross-lingual sequence labelling (Jain et al.,2019;
Fei et al.,2020). In this setting, as illustrated in Fig-
ure 1, the idea is to translate gold-labelled text into
the target language and then, using automatic word
alignments, project the labels from the source into
the target language. The result is an automatically
generated dataset in the target language that can be
used for training a sequence labelling model.
The emergence of multilingual language models
(Devlin et al.,2019;Conneau et al.,2020) allows
for model-based cross-lingual transfer. As Figure
1illustrates, using labelled data in one source lan-
guage (usually English), it is possible to fine-tune
a pre-trained multilingual model that is directly
used to make predictions in any of the languages
included in the model. This is also known as zero-
shot cross-lingual sequence labelling.
In this work we present an in-depth study of
both approaches using the latest advancements in
machine translation, word aligners and multilin-
gual language models. We focus on two sequence
labelling tasks, namely, Named Entity Recogni-
tion (NER) and Opinion Target Extraction (OTE).
In order to do so, we present a data-based cross-
lingual transfer approach consisting of translating
gold labeled data between English and 7 other lan-
guages using state-of-the-art machine translation
systems. Sequence labelling annotations are then
automatically projected for every language pair.
Additionally, we also produced manual alignments
for those 4 languages for which we had expert an-
notators. After translation and projection, for the
data-transfer approach we fine-tune multilingual
language models using the automatically generated
datasets. We then compare the performance ob-
tained for each of the target languages against the
performance of the zero-shot cross-lingual method,
consisting of fine-tuning the multilingual language
models in the English gold data and generating the
predictions in the required target languages.
The main contributions of our work are the
following: First, we empirically establish the re-
quired conditions for each of these two approaches,
data-transfer and zero-shot model-based, to out-
perform the other. In this sense, our experiments
show that, contrary to what previous research sug-
gested (Fei et al.,2020;Li et al.,2021), the zero-
shot model-based approach obtains the best results
when high-capacity multilingual models including
the target language and domain are available. Sec-
ond, when the performance of the multilingual lan-
guage model is not optimal for the specific target
language or domain (for example when working
on a text genre and domain for which available
language models have not been trained), or when
the required hardware to work with high-capacity
language models is not easily accessible, then data-
transfer based on translate and project constitutes
a competitive option. Third, we observe that ma-
chine translation data often generates training and
test data which is, due to important differences in
language use, markedly different to the signal re-
ceived when using gold standard data in the target
language. These discrepancies seem to explain the
larger error rate of the translate and project method
with respect to the zero-shot technique. Finally,
we create manually projected datasets for four lan-
guages and automatically projected datasets for
seven languages. We use them to train and evaluate
cross-lingual sequence labelling models. Addition-
ally, they are also used to extrinsically evaluate
machine translation and word alignment systems.
These new datasets, together with the code to gen-
erate them are publicly available to facilitate the
reproducibility of results and its use in future re-
search.1
2 Related work
2.1 Data-based cross-lingual transfer
Data-based cross-lingual transfer methods aim to
automatically generate labelled data for a target
language. Some of these methods exploit parallel
data. Ehrmann et al. (2011) automatically annotate
the English version of a multi-parallel corpus and
projects the annotations into all the other languages
using statistical alignments of phrases. Wang and
Manning (2014) project model expectations rather
than labels, which facilities transfer of model un-
certainty across languages. Ni et al. (2017) use
a heuristic scheme that effectively selects good-
quality projection-labeled data from noisy data.
They also project word embeddings from a tar-
get language into a source language, so that the
1https://github.com/ikergarcia1996/
Easy-Label-Projection
https://github.com/ikergarcia1996/
Easy-Translate
source-language sequence labelling system can be
applied to the target language without re-training.
Agerri et al. (2018) use parallel data from multiple
languages as source to project the labelled data to
a target language, showing that the combination of
multiple sources improves the quality of the pro-
jections. Li et al. (2021) uses the XLM-R model
(Conneau et al.,2020) for labelling sequences in
the source part of the parallel data and also for
annotation projection.
Instead of relying on parallel data, Jain et al.
(2019) and Fei et al. (2020), use machine transla-
tion to automatically translate the sentences of a
gold-labelled dataset to the target languages. The
translated data is then annotated by projecting the
gold labels from the source dataset. For this pur-
pose, Jain et al. (2019) first generate a list of pro-
jection candidates by orthographic and phonetic
similarity. They choose the best matching candi-
date based on distributional statistics derived from
the dataset. Fei et al. (2020) leverages the word
alignment probabilities calculated with FastAlign
(Dyer et al.,2013) and the POS tag distributions of
the source and target words.
High quality parallel data or machine translation
systems are not always available. Thus, Xie et al.
(2018) proposes to find word translations based
on bilingual word-embeddings. Alternatively, Guo
and Roth (2021) translate labelled data in a word-
by-word manner with a dictionary. Then, they
the construct target-language text from the source-
language annotations with a constrained pretrained
language model.
2.2 Model-based transfer
Language models trained on monolingual corpora
in many languages (Devlin et al.,2019;Conneau
et al.,2020) allow zero-shot cross-lingual model
transfer. Task-specific data in one language is used
to fine-tune the model for evaluation in another
language (Pires et al.,2019). The zero-shot cross-
lingual capability can be improved for the sequence
labelling task using different techniques. The ap-
proaches of Wang et al. (2019) and Ouyang et al.
(2021) use monolingual corpora to improve the
alignment of the language representations within
a multilingual model. Instead of using a single
source model, (Rahimi et al.,2019) propose to
use many models from many source languages to
improve the zero-shot transfer to a new language.
They learn to infer which are the most reliable mod-
els in an unsupervised manner. Wu et al. (2020)
take advantage of a Teacher-Student learning ap-
proach. NER models in the source languages are
used as teachers to train a student model on un-
labeled data in the target language. Bari et al.
(2021) propose an unsupervised data augmentation
framework to improve the cross-lingual adaptation
of models using self-training. Hu et al. (2021)
use the minimum risk training framework to over-
come the gap between the source and the target lan-
guages/domains. They propose a unified learning
algorithm based on the expectation maximization.
Using low-capacity multilingual language mod-
els such as mBERT, Fei et al. (2020) finds that their
data-based cross-lingual transfer approach is su-
perior to the zero-shot transfer method. However,
Li et al. (2021) when using XLM-RoBERTa, a
higher capacity multilingual model, obtain the best
results for German and Chinese applying the data-
based cross-lingual transfer approach, while the
zero-shot approach is best for Spanish and Dutch.
We extend their research on zero-resource settings
with two different Sequence Labelling tasks, seven
languages and three multilingual models of differ-
ent capacity. Our experiments and the error anal-
ysis carried out establish the required conditions
on which zero-shot and data-transfer approaches
outperform each other.
3 Translation and projection method
Our data-based cross-lingual transfer method to per-
form cross-lingual sequence labelling is the follow-
ing: we assume our source language to be English,
for which we have train and development data.
Furthermore, we also assume that the only gold-
labelled data available for the target language is
the evaluation set. In this setting, we automatically
generate data for the target language by translating
the gold-labelled English data. Then we project
the gold labels from the source sentences to the
translated sentences by leveraging automatic word
alignments. Given a sentence
x=hx1, ..., xni
with
length
n
in the source language and a translated
sentence
y=hy1, ..., ymi
with length
m
in the tar-
get language, we use a word aligner to find a set
of pairs
A={hxi, yji:xix, yjy}
where for
each word pair
hxi, yjiyi
is the lexical translation
of
xj
. Next, given a sequence
s=hxa, ..., xbi ∈ x
labeled with a category
C
we will label the se-
quence
t=hyc, ..., ydi ∈ y
with category
C
if
{∀yjtxis: (hxi, yji ∈ A)}
. In other
摘要:

ModelandDataTransferforCross-LingualSequenceLabellinginZero-ResourceSettingsIkerGarcía-FerreroRodrigoAgerriGermanRigauHiTZBasqueCenterforLanguageTechnologies-IxaNLPGroupUniversityoftheBasqueCountryUPV/EHU{iker.garciaf,rodrigo.agerri,german.rigau}@ehu.eusAbstractZero-resourcecross-lingualtransferap-p...

展开>> 收起<<
Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings Iker García-Ferrero Rodrigo Agerri German Rigau.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:1.76MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注