Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings Iker García-Ferrero Rodrigo Agerri German Rigau

2025-05-06 0 0 1.76MB 14 页 10玖币

侵权投诉

Model and Data Transfer for Cross-Lingual Sequence Labelling in

Zero-Resource Settings

Iker García-Ferrero Rodrigo Agerri German Rigau

HiTZ Basque Center for Language Technologies - Ixa NLP Group

University of the Basque Country UPV/EHU

{ iker.garciaf, rodrigo.agerri, german.rigau }@ehu.eus

Abstract

Zero-resource cross-lingual transfer ap-

proaches aim to apply supervised models from

a source language to unlabelled target lan-

guages. In this paper we perform an in-depth

study of the two main techniques employed so

far for cross-lingual zero-resource sequence

labelling, based either on data or model trans-

fer. Although previous research has proposed

translation and annotation projection (data-

based cross-lingual transfer) as an effective

technique for cross-lingual sequence labelling,

in this paper we experimentally demonstrate

that high capacity multilingual language

models applied in a zero-shot (model-based

cross-lingual transfer) setting consistently

outperform data-based cross-lingual transfer

approaches. A detailed analysis of our results

suggests that this might be due to important

differences in language use. More speciﬁcally,

machine translation often generates a textual

signal which is different to what the models

are exposed to when using gold standard

data, which affects both the ﬁne-tuning

and evaluation processes. Our results also

indicate that data-based cross-lingual transfer

approaches remain a competitive option when

high-capacity multilingual language models

are not available.

1 Introduction

Sequence labelling is the task of assigning a label

to each token in a given input sequence. Sequence

labelling is a fundamental process in many down-

stream NLP tasks. Currently, most successful ap-

proaches for this task apply supervised deep-neural

networks (Lample et al.,2016;Akbik et al.,2018;

Devlin et al.,2019;Conneau et al.,2020). How-

ever, as it was the case for supervised statistical

approaches (Agerri and Rigau,2016), their perfor-

mance still depends on the amount of manually

annotated training data. Additionally, deep-neural

models still show a signiﬁcant loss of performance

Figure 1: In the data-based transfer approach we trans-

late and project the labels of the gold data into the tar-

get language, and use the resulting silver data to train

a model for the target language. In the model-based

transfer approach we train a model with gold data in

English and use it in a zero-shot setting in the target

language.

when evaluated in out-of-domain data (Liu et al.,

2021). This means that to improvie their perfor-

mance, it would therefore be necessary to develop

very costly manually annotated data for each lan-

guage and domain of application. Thus, consider-

ing that for most of the languages in the world the

amount of manually annotated corpora is simply

nonexistent (Joshi et al.,2020), then the task of de-

veloping sequence labelling models for languages

and domain-speciﬁc tasks, for which supervised

data is not available, remains a challenge of great

interest. This task is known as zero-resource cross-

lingual sequence labelling.

arXiv:2210.12623v2 [cs.CL] 27 Apr 2023

Data-based cross-lingual transfer

methods

aim to automatically generate labelled data for a tar-

get language. Previous works on data-based trans-

fer have proposed translation and annotation pro-

jection as an effective technique for zero-resource

cross-lingual sequence labelling (Jain et al.,2019;

Fei et al.,2020). In this setting, as illustrated in Fig-

ure 1, the idea is to translate gold-labelled text into

the target language and then, using automatic word

alignments, project the labels from the source into

the target language. The result is an automatically

generated dataset in the target language that can be

used for training a sequence labelling model.

The emergence of multilingual language models

(Devlin et al.,2019;Conneau et al.,2020) allows

for model-based cross-lingual transfer. As Figure

1illustrates, using labelled data in one source lan-

guage (usually English), it is possible to ﬁne-tune

a pre-trained multilingual model that is directly

used to make predictions in any of the languages

included in the model. This is also known as zero-

shot cross-lingual sequence labelling.

In this work we present an in-depth study of

both approaches using the latest advancements in

machine translation, word aligners and multilin-

gual language models. We focus on two sequence

labelling tasks, namely, Named Entity Recogni-

tion (NER) and Opinion Target Extraction (OTE).

In order to do so, we present a data-based cross-

lingual transfer approach consisting of translating

gold labeled data between English and 7 other lan-

guages using state-of-the-art machine translation

systems. Sequence labelling annotations are then

automatically projected for every language pair.

Additionally, we also produced manual alignments

for those 4 languages for which we had expert an-

notators. After translation and projection, for the

data-transfer approach we ﬁne-tune multilingual

language models using the automatically generated

datasets. We then compare the performance ob-

tained for each of the target languages against the

performance of the zero-shot cross-lingual method,

consisting of ﬁne-tuning the multilingual language

models in the English gold data and generating the

predictions in the required target languages.

The main contributions of our work are the

following: First, we empirically establish the re-

quired conditions for each of these two approaches,

data-transfer and zero-shot model-based, to out-

perform the other. In this sense, our experiments

show that, contrary to what previous research sug-

gested (Fei et al.,2020;Li et al.,2021), the zero-

shot model-based approach obtains the best results

when high-capacity multilingual models including

the target language and domain are available. Sec-

ond, when the performance of the multilingual lan-

guage model is not optimal for the speciﬁc target

language or domain (for example when working

on a text genre and domain for which available

language models have not been trained), or when

the required hardware to work with high-capacity

language models is not easily accessible, then data-

transfer based on translate and project constitutes

a competitive option. Third, we observe that ma-

chine translation data often generates training and

test data which is, due to important differences in

language use, markedly different to the signal re-

ceived when using gold standard data in the target

language. These discrepancies seem to explain the

larger error rate of the translate and project method

with respect to the zero-shot technique. Finally,

we create manually projected datasets for four lan-

guages and automatically projected datasets for

seven languages. We use them to train and evaluate

cross-lingual sequence labelling models. Addition-

ally, they are also used to extrinsically evaluate

machine translation and word alignment systems.

These new datasets, together with the code to gen-

erate them are publicly available to facilitate the

reproducibility of results and its use in future re-

search.1

2 Related work

2.1 Data-based cross-lingual transfer

Data-based cross-lingual transfer methods aim to

automatically generate labelled data for a target

language. Some of these methods exploit parallel

data. Ehrmann et al. (2011) automatically annotate

the English version of a multi-parallel corpus and

projects the annotations into all the other languages

using statistical alignments of phrases. Wang and

Manning (2014) project model expectations rather

than labels, which facilities transfer of model un-

certainty across languages. Ni et al. (2017) use

a heuristic scheme that effectively selects good-

quality projection-labeled data from noisy data.

They also project word embeddings from a tar-

get language into a source language, so that the

1https://github.com/ikergarcia1996/

Easy-Label-Projection

https://github.com/ikergarcia1996/

Easy-Translate

source-language sequence labelling system can be

applied to the target language without re-training.

Agerri et al. (2018) use parallel data from multiple

languages as source to project the labelled data to

a target language, showing that the combination of

multiple sources improves the quality of the pro-

jections. Li et al. (2021) uses the XLM-R model

(Conneau et al.,2020) for labelling sequences in

the source part of the parallel data and also for

annotation projection.

Instead of relying on parallel data, Jain et al.

(2019) and Fei et al. (2020), use machine transla-

tion to automatically translate the sentences of a

gold-labelled dataset to the target languages. The

translated data is then annotated by projecting the

gold labels from the source dataset. For this pur-

pose, Jain et al. (2019) ﬁrst generate a list of pro-

jection candidates by orthographic and phonetic

similarity. They choose the best matching candi-

date based on distributional statistics derived from

the dataset. Fei et al. (2020) leverages the word

alignment probabilities calculated with FastAlign

(Dyer et al.,2013) and the POS tag distributions of

the source and target words.

High quality parallel data or machine translation

systems are not always available. Thus, Xie et al.

(2018) proposes to ﬁnd word translations based

on bilingual word-embeddings. Alternatively, Guo

and Roth (2021) translate labelled data in a word-

by-word manner with a dictionary. Then, they

the construct target-language text from the source-

language annotations with a constrained pretrained

language model.

2.2 Model-based transfer

Language models trained on monolingual corpora

in many languages (Devlin et al.,2019;Conneau

et al.,2020) allow zero-shot cross-lingual model

transfer. Task-speciﬁc data in one language is used

to ﬁne-tune the model for evaluation in another

language (Pires et al.,2019). The zero-shot cross-

lingual capability can be improved for the sequence

labelling task using different techniques. The ap-

proaches of Wang et al. (2019) and Ouyang et al.

(2021) use monolingual corpora to improve the

alignment of the language representations within

a multilingual model. Instead of using a single

source model, (Rahimi et al.,2019) propose to

use many models from many source languages to

improve the zero-shot transfer to a new language.

They learn to infer which are the most reliable mod-

els in an unsupervised manner. Wu et al. (2020)

take advantage of a Teacher-Student learning ap-

proach. NER models in the source languages are

used as teachers to train a student model on un-

labeled data in the target language. Bari et al.

(2021) propose an unsupervised data augmentation

framework to improve the cross-lingual adaptation

of models using self-training. Hu et al. (2021)

use the minimum risk training framework to over-

come the gap between the source and the target lan-

guages/domains. They propose a uniﬁed learning

algorithm based on the expectation maximization.

Using low-capacity multilingual language mod-

els such as mBERT, Fei et al. (2020) ﬁnds that their

data-based cross-lingual transfer approach is su-

perior to the zero-shot transfer method. However,

Li et al. (2021) when using XLM-RoBERTa, a

higher capacity multilingual model, obtain the best

results for German and Chinese applying the data-

based cross-lingual transfer approach, while the

zero-shot approach is best for Spanish and Dutch.

We extend their research on zero-resource settings

with two different Sequence Labelling tasks, seven

languages and three multilingual models of differ-

ent capacity. Our experiments and the error anal-

ysis carried out establish the required conditions

on which zero-shot and data-transfer approaches

outperform each other.

3 Translation and projection method

Our data-based cross-lingual transfer method to per-

form cross-lingual sequence labelling is the follow-

ing: we assume our source language to be English,

for which we have train and development data.

Furthermore, we also assume that the only gold-

labelled data available for the target language is

the evaluation set. In this setting, we automatically

generate data for the target language by translating

the gold-labelled English data. Then we project

the gold labels from the source sentences to the

translated sentences by leveraging automatic word

alignments. Given a sentence

x=hx1, ..., xni

with

length

in the source language and a translated

sentence

y=hy1, ..., ymi

with length

in the tar-

get language, we use a word aligner to ﬁnd a set

of pairs

A={hxi, yji:xi∈x, yj∈y}

where for

each word pair

hxi, yjiyi

is the lexical translation

. Next, given a sequence

s=hxa, ..., xbi ∈ x

labeled with a category

we will label the se-

quence

t=hyc, ..., ydi ∈ y

with category

{∀yj∈t∃xi∈s: (hxi, yji ∈ A)}

. In other

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModelandDataTransferforCross-LingualSequenceLabellinginZero-ResourceSettingsIkerGarcía-FerreroRodrigoAgerriGermanRigauHiTZBasqueCenterforLanguageTechnologies-IxaNLPGroupUniversityoftheBasqueCountryUPV/EHU{iker.garciaf,rodrigo.agerri,german.rigau}@ehu.eusAbstractZero-resourcecross-lingualtransferap-p...

展开>> 收起<<

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings Iker García-Ferrero Rodrigo Agerri German Rigau.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings Iker García-Ferrero Rodrigo Agerri German Rigau

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: