Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging Ayyoob Imani1 Silvia Severini1

2025-05-06 0 0 528.49KB 13 页 10玖币
侵权投诉
Graph-Based Multilingual Label Propagation for Low-Resource
Part-of-Speech Tagging
Ayyoob Imani *1, Silvia Severini*1,
Masoud Jalili Sabet1,François Yvon2,Hinrich Schütze1
1Center for Information and Language Processing (CIS), LMU Munich, Germany
2Université Paris-Saclay, CNRS, LISN, France
{ayyoob, silvia, masoud}@cis.lmu.de, francois.yvon@limsi.fr
Abstract
Part-of-Speech (POS) tagging is an important
component of the NLP pipeline, but many low-
resource languages lack labeled data for train-
ing. An established method for training a POS
tagger in such a scenario is to create a labeled
training set by transferring from high-resource
languages. In this paper, we propose a novel
method for transferring labels from multiple
high-resource source to low-resource target
languages. We formalize POS tag projection
as graph-based label propagation. Given trans-
lations of a sentence in multiple languages, we
create a graph with words as nodes and align-
ment links as edges by aligning words for all
language pairs. We then propagate node la-
bels from source to target using a Graph Neu-
ral Network augmented with transformer lay-
ers. We show that our propagation creates
training sets that allow us to train POS tag-
gers for a diverse set of languages. When
combined with enhanced contextualized em-
beddings, our method achieves a new state-of-
the-art for unsupervised POS tagging of low
resource languages.
1 Introduction
In many applications, Part-of-Speech (POS) tag-
ging is an important part of the NLP pipeline. In
recent years, high-accuracy POS taggers have been
developed owing to advances in machine learning
methods that combine pretraining on large unla-
beled corpora and supervised fine-tuning on well-
curated annotated datasets. This methodology only
applies to a handful of high-resource (HR) lan-
guages for which the necessary training data exists,
leaving behind the majority of low-resource (LR)
languages. When training resources are scarce, an
established method for training POS taggers is to
automatically generate the training data via cross-
lingual transfer (Yarowsky and Ngai,2001;Fossum
and Abney,2005;Agi´
c et al.,2016;Eskander et al.,
*Equal contribution.
actions
NOUN
words
NOUN
speak
VERB
louder
ADV
than
PREP
words
NOUN
taten
NOUN
sagen
VERB
mehr
ADV
als
PREP
worte
NOUN
eylemler
?
sözlerden
?
daha
?
yüksek
?
sesle
?
konuşur
?
دﻧﮐ ﯽﻣ
?
تﺑﺣﺻ
?
لﻣﻋ
?
زا
?
رﺗﺎﺳر
?
مﻼﮐ
?
Figure 1: The sentence “actions speak louder than
words” in English and its translations to Persian, Ger-
man, and Turkish, aligned at the word level. The POS
tags for high-resource English and German are known.
We use a GNN to exploit this graph structure and com-
pute POS tags for low-resource Persian and Turkish.
2020). Typically, POS annotations are projected
through alignment links from the HR source to the
LR target of a word aligned parallel corpus.
In this paper, we propose
GLP
(
G
raph
L
abel
P
ropagation), a novel method for transferring la-
bels simultaneously from multiple high-resource
source languages to multiple low-resource target
languages. We formalize POS tag projection as
graph-based label propagation. Given translations
of a sentence in multiple languages, we create a
graph with words as nodes and alignment links as
edges by aligning words for all language pairs. We
then propagate POS labels associated with source
language nodes to target language nodes, using
a label propagation model that is formalized as
a Graph Neural Network (GNN) (Scarselli et al.,
2008). Nodes are represented by a diverse set of
features that describe both linguistic properties and
graph structural information. In a second step, we
additionally employ self-learning to obtain reliable
arXiv:2210.09840v2 [cs.CL] 31 Oct 2022
training instances in the target languages.
Our approach is based on multiparallel corpora,
meaning that the translation of each sentence is
available in more than two languages. We ex-
ploit the Parallel Bible Corpus (PBC) of Mayer
and Cysouw (2014),
1
a multiparallel corpus cover-
ing more than 1000 languages, many of which are
extremely low-resource, by which we mean that
only a tiny amount of unlabeled data is available or
that no language technologies exist for them at all
(Joshi et al.,2020).
We evaluate our method on a diverse set of low-
resource languages from multiple language fami-
lies, including four languages not covered by pre-
trained language models (PLMs). We train POS
tagging models for these languages and evaluate
them against references from the Universal Depen-
dencies corpus (Zeman et al.,2019). We compare
the results of our method against multiple state-
of-the-art (SOTA) cross-lingual unsupervised and
semisupervised POS taggers employing different
approaches like annotation projection and zero-shot
transfer. Our experiments highlight the benefits of
our new transfer and self-learning methods; cru-
cially, they show that reasonably accurate POS tag-
gers can be bootstrapped without any annotated
data for a diverse set of low-resource languages,
establishing a new SOTA for high-resource-to-low-
resource cross-lingual POS transfer. We also assess
the quality of the projected annotations with respect
to “silver” references and perform an ablation study.
To summarize, our contributions are:2
We formalize annotation projection as graph-
based label propagation and introduce two
new POS annotation projection models,
GLP-B (GLP-Base) and GLP-SL (GLP-
SelfLearning).
We evaluate GLP-B and GLP-SL on 17 low-
resource languages, including 4 languages not
covered by large PLMs.
By comparing our method with various su-
pervised, semisupervised, and PLM-based ap-
proaches for POS tagging of low-resource lan-
guages, we establish a new SOTA for unsuper-
vised POS tagging.
1
We do not use PBC-specific features. Thus, our work is
in principle applicable to any multiparallel corpus.
2
Our code, data, and trained models are available at
https:
//github.com/ayyoobimani/GLP-POS
2 Related work
POS tagging
Part of Speech tagging aims to as-
sign each word the proper syntactic tag in con-
text (Manning and Schütze,1999). For high-
resource languages, for which large labeled training
sets are available, high-accuracy POS tagging is
achieved through supervised learning (Kondratyuk
and Straka,2019;Tsai et al.,2019).
Zero-shot transfer
In low-resource settings, one
approach is to use cross-lingual transfer thanks to
pretrained multilingual representations, thereby en-
abling zero-shot POS tagging. Kondratyuk and
Straka (2019) analyze the few-shot and zero-shot
performance of mBERT (Devlin et al.,2019) fine-
tuning on POS tagging. We include this approach
in our set of baselines below. Ebrahimi and
Kann (2021) and Wang et al. (2022) analyze zero-
shot POS tagging performance of XLM-RoBERTa
(Conneau et al.,2020) and propose complementary
methods such as continued pretraining, vocabulary
expansion and adapter modules for better perfor-
mance. We show that combining GLP with Wang
et al. (2022)’s embeddings further improves our
base performance.
Annotation projection
Annotation projection is
another approach to annotating low-resource lan-
guages. Yarowsky and Ngai (2001) first proposed
projecting annotation labels across languages, ex-
ploiting parallel corpora and word alignment. To
reduce systematic transfer errors, Fossum and Ab-
ney (2005) extended this by projecting from mul-
tiple source languages. Agi´
c et al. (2015a) and
Agi´
c et al. (2016) exploit multilingual transfer se-
tups to bootstrap POS taggers for low-resource lan-
guages starting from a parallel corpus and taggers
and parsers for high-resource languages. Other
works project labels by leveraging token and type-
level constraints (Täckström et al.,2013;Buys and
Botha,2016a;Eskander et al.,2020). The latter
study notably proposes an unsupervised method for
selecting training instances via cross-lingual pro-
jection and trains POS taggers exploiting contex-
tualized word embeddings, affix embeddings and
hierarchical Brown clusters (Brown et al.,1992).
This approach is also used as a baseline below.
Semi-supervised approaches have been proposed
to mitigate the noise of projecting between lan-
guages. This can be achieved with auxiliary lex-
ical resources (Täckström et al.,2013;Ganchev
and Das,2013;Wisniewski et al.,2014;Li et al.,
2012) that guide unsupervised learning or act as an
additional training signal (Plank and Agi´
c,2018).
Other works combine manual and projected anno-
tations (Garrette and Baldridge,2013;Fang and
Cohn,2016). We outperform prior works without
the use of additional resources such as dictionaries
or annotations.
Graph Neural Networks
Many natural and real-
life structures like physical systems, social net-
works & interactions, and molecular fingerprints
have a graph structure (Liu and Zhou,2020). Graph
neural networks have been successfully used to
model them. Applications include social spam-
mer detection (Wu et al.,2020), learning molecular
fingerprints (Duvenaud et al.,2015) and human mo-
tion prediction (Li et al.,2020). Recently, GNNs
have been adopted for NLP tasks such as text clas-
sification (Peng et al.,2018), sequence labeling
(Zhang et al.,2018;Marcheggiani and Titov,2017),
neural machine translation (Bastings et al.,2017;
Beck et al.,2018), and alignment link prediction
(Imani et al.,2022). As far as we know, our work
is the first to formalize the annotation projection
problem as graph-based label propagation.
Multiparallel corpora
A multiparallel corpus
provides the translations of a source text in more
than two languages. A few such corpora (Agi´
c and
Vuli´
c,2019;Mayer and Cysouw,2014;Tiedemann,
2012) provide sentence-level aligned text for hun-
dreds or thousands of languages; for many of these
languages only a tiny amount of digitized content is
available (Joshi et al.,2020). Although the amount
of text found in existing multiparallel corpora is far
less than in monolingual corpora, we believe that
they can serve as cross-lingual bridges, with which
effective representation for low-resource languages
can be derived. Highly multiparallel corpora have
been used for expanding pretrained models to more
languages (Ebrahimi and Kann,2021;Wang et al.,
2022), word alignment improvement and visual-
ization (ImaniGooghari et al.,2021;Imani et al.,
2022), embedding learning (Dufter et al.,2018),
and annotation projection (Agi´
c et al.,2015b;Sev-
erini et al.,2022).
3 Method
We now introduce our Graph Label Propagation
(GLP) method, which formalizes the problem of
annotation projection as graph-based label propa-
gation. We first describe the graph structure, then
0 0 XLMR output for actions 2
0 1 XLMR output for speak 3
Lang. Pos. Embedding degree …
2 1 XLMR output for sagen 3
actions
NOUN
taten
NOUN
speak
VERB
sagen
VERB
دﻧﮐ ﯽ
?
تﺑﺣﺻ
?
لﻣﻋ
?
Figure 2: An example of how we represent nodes of an
alignment graph using features for a part of the graph
in Figure 1.
the features associated with each node, and finally
the architecture of our model.
3.1 Problem formalization
The multilingual alignment graph (MAG) of a sen-
tence is formalized as follows. Each sentence
σ
in our multiparallel corpus exists in a set
L
of lan-
guages.
3L
contains both high-resource source lan-
guages (in
Ls
) and low-resource target languages
(in
Lt
) with
LsLt=L
. Each word in these
|L|
versions of
σ
will constitute a node in our graph.
We first automatically annotate the text in all the
source languages using pre-existing taggers: these
POS tags are node labels; they are only known for
languages in
Ls
, unknown otherwise. We then use
Eflomal (Östling and Tiedemann,2016), an unsu-
pervised word alignment tool to compute alignment
links for all
|L|∗(|L|−1)
2
language pairs: these links
define the edges of our MAG. Figure 1displays
an example MAG for four languages, with English
and German as sources and Turkish and Persian as
targets. Note that both the word alignments and the
node labels are noisy, since we do not use gold data
but statistical methods to generate them.
3.2 Features
To train graph neural networks, we represent each
node using a set of features (Duong et al.,2019). In
Figure 2, you see a simple illustration of how nodes
are represented using a feature vector. The graph in
this figure is part of the original graph in Figure 1.
Two types of features are considered: features that
represent the inherent meaning of a node/word
(word representation features) and features that
describe the position of a node within the graph
(graph structural features). Node representation
features consist of: XLM-R (Conneau et al.,2020)
embeddings, the node’s language and its position
within the sentence. Since XLM-R embeddings
are not available for all languages, we alternatively
3|L|might be different for different sentences.
摘要:

Graph-BasedMultilingualLabelPropagationforLow-ResourcePart-of-SpeechTaggingAyyoobImani*1,SilviaSeverini*1,MasoudJaliliSabet1,FrançoisYvon2,HinrichSchütze11CenterforInformationandLanguageProcessing(CIS),LMUMunich,Germany2UniversitéParis-Saclay,CNRS,LISN,France{ayyoob,silvia,masoud}@cis.lmu.de,francoi...

展开>> 收起<<
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging Ayyoob Imani1 Silvia Severini1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:528.49KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注