Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging Ayyoob Imani1 Silvia Severini1

2025-05-06 0 0 528.49KB 13 页 10玖币

侵权投诉

Graph-Based Multilingual Label Propagation for Low-Resource

Part-of-Speech Tagging

Ayyoob Imani *1, Silvia Severini*1,

Masoud Jalili Sabet1,François Yvon2,Hinrich Schütze1

1Center for Information and Language Processing (CIS), LMU Munich, Germany

2Université Paris-Saclay, CNRS, LISN, France

{ayyoob, silvia, masoud}@cis.lmu.de, francois.yvon@limsi.fr

Abstract

Part-of-Speech (POS) tagging is an important

component of the NLP pipeline, but many low-

resource languages lack labeled data for train-

ing. An established method for training a POS

tagger in such a scenario is to create a labeled

training set by transferring from high-resource

languages. In this paper, we propose a novel

method for transferring labels from multiple

high-resource source to low-resource target

languages. We formalize POS tag projection

as graph-based label propagation. Given trans-

lations of a sentence in multiple languages, we

create a graph with words as nodes and align-

ment links as edges by aligning words for all

language pairs. We then propagate node la-

bels from source to target using a Graph Neu-

ral Network augmented with transformer lay-

ers. We show that our propagation creates

training sets that allow us to train POS tag-

gers for a diverse set of languages. When

combined with enhanced contextualized em-

beddings, our method achieves a new state-of-

the-art for unsupervised POS tagging of low

resource languages.

1 Introduction

In many applications, Part-of-Speech (POS) tag-

ging is an important part of the NLP pipeline. In

recent years, high-accuracy POS taggers have been

developed owing to advances in machine learning

methods that combine pretraining on large unla-

beled corpora and supervised ﬁne-tuning on well-

curated annotated datasets. This methodology only

applies to a handful of high-resource (HR) lan-

guages for which the necessary training data exists,

leaving behind the majority of low-resource (LR)

languages. When training resources are scarce, an

established method for training POS taggers is to

automatically generate the training data via cross-

lingual transfer (Yarowsky and Ngai,2001;Fossum

and Abney,2005;Agi´

c et al.,2016;Eskander et al.,

*Equal contribution.

actions

NOUN

words

NOUN

speak

VERB

louder

ADV

than

PREP

words

NOUN

taten

NOUN

sagen

VERB

mehr

ADV

als

PREP

worte

NOUN

eylemler

sözlerden

daha

yüksek

sesle

konuşur

دﻧﮐ ﯽﻣ

تﺑﺣﺻ

لﻣﻋ

زا

رﺗﺎﺳر

مﻼﮐ

Figure 1: The sentence “actions speak louder than

words” in English and its translations to Persian, Ger-

man, and Turkish, aligned at the word level. The POS

tags for high-resource English and German are known.

We use a GNN to exploit this graph structure and com-

pute POS tags for low-resource Persian and Turkish.

2020). Typically, POS annotations are projected

through alignment links from the HR source to the

LR target of a word aligned parallel corpus.

In this paper, we propose

GLP

(

raph

abel

ropagation), a novel method for transferring la-

bels simultaneously from multiple high-resource

source languages to multiple low-resource target

languages. We formalize POS tag projection as

graph-based label propagation. Given translations

of a sentence in multiple languages, we create a

graph with words as nodes and alignment links as

edges by aligning words for all language pairs. We

then propagate POS labels associated with source

language nodes to target language nodes, using

a label propagation model that is formalized as

a Graph Neural Network (GNN) (Scarselli et al.,

2008). Nodes are represented by a diverse set of

features that describe both linguistic properties and

graph structural information. In a second step, we

additionally employ self-learning to obtain reliable

arXiv:2210.09840v2 [cs.CL] 31 Oct 2022

training instances in the target languages.

Our approach is based on multiparallel corpora,

meaning that the translation of each sentence is

available in more than two languages. We ex-

ploit the Parallel Bible Corpus (PBC) of Mayer

and Cysouw (2014),

a multiparallel corpus cover-

ing more than 1000 languages, many of which are

extremely low-resource, by which we mean that

only a tiny amount of unlabeled data is available or

that no language technologies exist for them at all

(Joshi et al.,2020).

We evaluate our method on a diverse set of low-

resource languages from multiple language fami-

lies, including four languages not covered by pre-

trained language models (PLMs). We train POS

tagging models for these languages and evaluate

them against references from the Universal Depen-

dencies corpus (Zeman et al.,2019). We compare

the results of our method against multiple state-

of-the-art (SOTA) cross-lingual unsupervised and

semisupervised POS taggers employing different

approaches like annotation projection and zero-shot

transfer. Our experiments highlight the beneﬁts of

our new transfer and self-learning methods; cru-

cially, they show that reasonably accurate POS tag-

gers can be bootstrapped without any annotated

data for a diverse set of low-resource languages,

establishing a new SOTA for high-resource-to-low-

resource cross-lingual POS transfer. We also assess

the quality of the projected annotations with respect

to “silver” references and perform an ablation study.

To summarize, our contributions are:2

•

We formalize annotation projection as graph-

based label propagation and introduce two

new POS annotation projection models,

GLP-B (GLP-Base) and GLP-SL (GLP-

SelfLearning).

•

We evaluate GLP-B and GLP-SL on 17 low-

resource languages, including 4 languages not

covered by large PLMs.

•

By comparing our method with various su-

pervised, semisupervised, and PLM-based ap-

proaches for POS tagging of low-resource lan-

guages, we establish a new SOTA for unsuper-

vised POS tagging.

We do not use PBC-speciﬁc features. Thus, our work is

in principle applicable to any multiparallel corpus.

Our code, data, and trained models are available at

https:

//github.com/ayyoobimani/GLP-POS

2 Related work

POS tagging

Part of Speech tagging aims to as-

sign each word the proper syntactic tag in con-

text (Manning and Schütze,1999). For high-

resource languages, for which large labeled training

sets are available, high-accuracy POS tagging is

achieved through supervised learning (Kondratyuk

and Straka,2019;Tsai et al.,2019).

Zero-shot transfer

In low-resource settings, one

approach is to use cross-lingual transfer thanks to

pretrained multilingual representations, thereby en-

abling zero-shot POS tagging. Kondratyuk and

Straka (2019) analyze the few-shot and zero-shot

performance of mBERT (Devlin et al.,2019) ﬁne-

tuning on POS tagging. We include this approach

in our set of baselines below. Ebrahimi and

Kann (2021) and Wang et al. (2022) analyze zero-

shot POS tagging performance of XLM-RoBERTa

(Conneau et al.,2020) and propose complementary

methods such as continued pretraining, vocabulary

expansion and adapter modules for better perfor-

mance. We show that combining GLP with Wang

et al. (2022)’s embeddings further improves our

base performance.

Annotation projection

Annotation projection is

another approach to annotating low-resource lan-

guages. Yarowsky and Ngai (2001) ﬁrst proposed

projecting annotation labels across languages, ex-

ploiting parallel corpora and word alignment. To

reduce systematic transfer errors, Fossum and Ab-

ney (2005) extended this by projecting from mul-

tiple source languages. Agi´

c et al. (2015a) and

Agi´

c et al. (2016) exploit multilingual transfer se-

tups to bootstrap POS taggers for low-resource lan-

guages starting from a parallel corpus and taggers

and parsers for high-resource languages. Other

works project labels by leveraging token and type-

level constraints (Täckström et al.,2013;Buys and

Botha,2016a;Eskander et al.,2020). The latter

study notably proposes an unsupervised method for

selecting training instances via cross-lingual pro-

jection and trains POS taggers exploiting contex-

tualized word embeddings, afﬁx embeddings and

hierarchical Brown clusters (Brown et al.,1992).

This approach is also used as a baseline below.

Semi-supervised approaches have been proposed

to mitigate the noise of projecting between lan-

guages. This can be achieved with auxiliary lex-

ical resources (Täckström et al.,2013;Ganchev

and Das,2013;Wisniewski et al.,2014;Li et al.,

2012) that guide unsupervised learning or act as an

additional training signal (Plank and Agi´

c,2018).

Other works combine manual and projected anno-

tations (Garrette and Baldridge,2013;Fang and

Cohn,2016). We outperform prior works without

the use of additional resources such as dictionaries

or annotations.

Graph Neural Networks

Many natural and real-

life structures like physical systems, social net-

works & interactions, and molecular ﬁngerprints

have a graph structure (Liu and Zhou,2020). Graph

neural networks have been successfully used to

model them. Applications include social spam-

mer detection (Wu et al.,2020), learning molecular

ﬁngerprints (Duvenaud et al.,2015) and human mo-

tion prediction (Li et al.,2020). Recently, GNNs

have been adopted for NLP tasks such as text clas-

siﬁcation (Peng et al.,2018), sequence labeling

(Zhang et al.,2018;Marcheggiani and Titov,2017),

neural machine translation (Bastings et al.,2017;

Beck et al.,2018), and alignment link prediction

(Imani et al.,2022). As far as we know, our work

is the ﬁrst to formalize the annotation projection

problem as graph-based label propagation.

Multiparallel corpora

A multiparallel corpus

provides the translations of a source text in more

than two languages. A few such corpora (Agi´

c and

Vuli´

c,2019;Mayer and Cysouw,2014;Tiedemann,

2012) provide sentence-level aligned text for hun-

dreds or thousands of languages; for many of these

languages only a tiny amount of digitized content is

available (Joshi et al.,2020). Although the amount

of text found in existing multiparallel corpora is far

less than in monolingual corpora, we believe that

they can serve as cross-lingual bridges, with which

effective representation for low-resource languages

can be derived. Highly multiparallel corpora have

been used for expanding pretrained models to more

languages (Ebrahimi and Kann,2021;Wang et al.,

2022), word alignment improvement and visual-

ization (ImaniGooghari et al.,2021;Imani et al.,

2022), embedding learning (Dufter et al.,2018),

and annotation projection (Agi´

c et al.,2015b;Sev-

erini et al.,2022).

3 Method

We now introduce our Graph Label Propagation

(GLP) method, which formalizes the problem of

annotation projection as graph-based label propa-

gation. We ﬁrst describe the graph structure, then

0 0 XLMR output for actions 2…

0 1 XLMR output for speak 3…

Lang. Pos. Embedding degree …

2 1 XLMR output for sagen 3…

actions

NOUN

taten

NOUN

speak

VERB

sagen

VERB

دﻧﮐ ﯽﻣ

تﺑﺣﺻ

لﻣﻋ

Figure 2: An example of how we represent nodes of an

alignment graph using features for a part of the graph

in Figure 1.

the features associated with each node, and ﬁnally

the architecture of our model.

3.1 Problem formalization

The multilingual alignment graph (MAG) of a sen-

tence is formalized as follows. Each sentence

in our multiparallel corpus exists in a set

of lan-

guages.

contains both high-resource source lan-

guages (in

) and low-resource target languages

(in

) with

Ls∪Lt=L

. Each word in these

|L|

versions of

will constitute a node in our graph.

We ﬁrst automatically annotate the text in all the

source languages using pre-existing taggers: these

POS tags are node labels; they are only known for

languages in

, unknown otherwise. We then use

Eﬂomal (Östling and Tiedemann,2016), an unsu-

pervised word alignment tool to compute alignment

links for all

|L|∗(|L|−1)

language pairs: these links

deﬁne the edges of our MAG. Figure 1displays

an example MAG for four languages, with English

and German as sources and Turkish and Persian as

targets. Note that both the word alignments and the

node labels are noisy, since we do not use gold data

but statistical methods to generate them.

3.2 Features

To train graph neural networks, we represent each

node using a set of features (Duong et al.,2019). In

Figure 2, you see a simple illustration of how nodes

are represented using a feature vector. The graph in

this ﬁgure is part of the original graph in Figure 1.

Two types of features are considered: features that

represent the inherent meaning of a node/word

(word representation features) and features that

describe the position of a node within the graph

(graph structural features). Node representation

features consist of: XLM-R (Conneau et al.,2020)

embeddings, the node’s language and its position

within the sentence. Since XLM-R embeddings

are not available for all languages, we alternatively

3|L|might be different for different sentences.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Graph-BasedMultilingualLabelPropagationforLow-ResourcePart-of-SpeechTaggingAyyoobImani*1,SilviaSeverini*1,MasoudJaliliSabet1,FrançoisYvon2,HinrichSchütze11CenterforInformationandLanguageProcessing(CIS),LMUMunich,Germany2UniversitéParis-Saclay,CNRS,LISN,France{ayyoob,silvia,masoud}@cis.lmu.de,francoi...

展开>> 收起<<

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging Ayyoob Imani1 Silvia Severini1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging Ayyoob Imani1 Silvia Severini1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: