Annotating Norwegian Language Varieties on Twitter for Part-of-Speech Petter Mæhlum1 Andre Kåsen2 Samia Touileb3and Jeremy Barnes4 1University of Oslo

2025-04-27 0 0 151.62KB 6 页 10玖币
侵权投诉
Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
Petter Mæhlum1, Andre Kåsen2, Samia Touileb3and Jeremy Barnes4
1University of Oslo
2National Library of Norway
3University of Bergen
4University of the Basque Country
pettemae@ifi.uio.no,andre.kasen@nb.no,
samia.touileb@uib.no,jeremy.barnes@ehu.eus
Abstract
Norwegian Twitter data poses an interesting
challenge for Natural Language Processing
(NLP) tasks. These texts are difficult for mod-
els trained on standardized text in one of the
two Norwegian written forms (Bokmål and
Nynorsk), as they contain both the typical vari-
ation of social media text, as well as a large
amount of dialectal variety. In this paper we
present a novel Norwegian Twitter dataset an-
notated with POS-tags. We show that models
trained on Universal Dependency (UD) data
perform worse when evaluated against this
dataset, and that models trained on Bokmål
generally perform better than those trained on
Nynorsk. We also see that performance on di-
alectal tweets is comparable to the written stan-
dards for some models. Finally we perform a
detailed analysis of the errors that models com-
monly make on this data.
1 Introduction
Norwegian Twitter data poses an interesting chal-
lenge for Natural Language Processing (NLP) tasks.
Not only do these data represent a set of noisy, user-
generated texts with the kinds of orthographic vari-
ation common on social media, but also because
there is a considerable number of tweets written
in dialectal Norwegian. These dialectal variants
are quite common and add another level of diffi-
culty for NLP models trained on clean data in one
of the two Norwegian written forms (Bokmål or
Nynorsk).
Barnes et al. (2021) compiled a dataset of tweets
classified according to whether they are written in
primarily Bokmål, Nynorsk, or a dialect of Nor-
wegian. We build upon this work by annotating
a subset for Part-of-Speech (POS). We investigate
to what extent available Norwegian POS tagging
models, that were trained on Bokmål and Nynorsk
Universal Dependency data (Nivre et al.,2020),
perform on this Twitter dataset.
To this end, we use five POS models: three off-
the-shelf models, and two developed for the pur-
pose of this work. Each of these models was trained
on either a dataset of Bokmål or Nynorsk texts. We
explore the performance of each model in terms
of accuracy, and investigate which standardized
written form can be used as training data and yield
good results for non-standardized dialectal texts.
The main contributions of this work are:
we annotate a moderately sized Twitter dataset
with POS labels and include metadata related
to which language variety it belongs (Bokmål,
Nynorsk, Dialect, or Mixed),
we perform a detailed error analysis of com-
mon model errors specific to our Twitter data,
we include our insights into the annotation
process for POS tagging of non-standardized
written forms,
we release two spaCy models built on top of
a Norwegian BERT model.
2 Background
Johannessen (1990) outlined a system for au-
tomatic morphosyntactic analysis of Norwegian
nouns in the framework of Koskenniemi (1983).
This was among the first systems, if not even the
very first, that automatically assigned Norwegian
texts any morphological information. The first
widely used tagger, however, was developed within
the Taggerprosjektet
1
and came to be known as
the Oslo-Bergen Tagger
2
(OBT). Rather than con-
tinuing and expanding the system of Johannessen
(1990), OBT was implemented in the framework
of Karlsson (1990). OBT was initially a rule-based
Constraint Grammar tagger for Norwegian Bokmål.
1The project ran from April 1996 to December 1998.
2https://github.com/noklesta/
The-Oslo-Bergen-Tagger
arXiv:2210.06150v1 [cs.CL] 12 Oct 2022
Later, both support for Norwegian Nynorsk and a
statistical disambiguation component were added
(Johannessen et al.,2012). But one drawback of
OBT is that it is made for written, edited text, and
therefore might not scale well to sources that are
not standardised.
Extending tagger coverage to spoken Norwegian
dialect transcription, on the other hand, was the
objective of both Nøklestad and Søfteland (2007)
and Kåsen et al. (2019). Both sampled data either
from the Norwegian part of the Nordic Dialect
Corpus (NDC, Johannessen et al. (2009)) or the
Language Infrastructure made Accessible (LIA)
Corpus.
3
Annotations are found in the respective
treebanks of the corpora and are accounted for in
Øvrelid et al. (2018) and Kåsen et al. (2022).
Besides Norwegian, there is a large amount of
work on the difficulty of processing noisy data from
social media (Xu et al.,2015), including the diffi-
culty of POS tagging on social media (Albogamy
and Ramasy,2015), with dialectal variation (Jør-
gensen et al.,2015), or whether lexical normaliza-
tion is helpful (van der Goot et al.,2017). However,
Norwegian currently lacks any of these studies.
3 Data
Resources for evaluating NLP pipeline tasks for
Norwegian are scarce. The only dataset avail-
able for standard NLP tasks such as POS tagging,
lemmatization, and parsing is the Norwegian De-
pendency Treebank (NDT, Solberg (2013), Solberg
et al. (2014)) that has been converted to the Uni-
versal Dependencies standard (Øvrelid and Hohle,
2016). There is, however, a notable exception when
it comes to transcribed spoken dialectal data, where
the LIA and NDC treebanks as mentioned above
are available with annotations for POS tags, mor-
phological features, lemmas, and dependency-style
syntax. Despite this, the transcribed texts in the
LIA and NDC corpora do not share the same char-
acteristics as the Twitter data. Twitter contains
spelling errors and emoji,
4
along with mentions
and hashtags. We observe that although our Twit-
ter data contains some characteristics of spoken
Norwegian, such as subjectless sentences as in 1,
which is otherwise within the spelling norms, the
spelling conventions differ from those of LIA and
3https://tekstlab.uio.no/LIA/korpus.
html
4
Emoji has recently gained some interest in the linguistic
literature (see
https://ling.auf.net/lingbuzz/
005981)
NDC, making it difficult to directly compare the
data.
(1) Kommer
Comes nok
probably hjem
home snart
soon .
.
‘(Unspecified) probably comes home soon .
In LIA and NDC, all transcriptions are done ac-
cording to a Norwegian-based semi-phonetic stan-
dard (Hagen et al.,2015), with strict marking of
vowel quantity, palatalization, retroflexion, and
more. We see that writers on Twitter do not con-
form to any specific spelling norm when writing
in their own or another dialect. This means that
although not all dialectal traits from a dialect are
faithfully preserved, this still leads to much dialec-
tal variation in the Twitter data, as things that could
have had a common spelling is spelled according
to the author’s own preference. Especially pho-
netic differences are often not indicated on Twitter.
Because of this, we needed a separate dataset that
could be used to evaluate how various systems for
Norwegian POS-tagging work on dialectal text as it
is found on real data from social media platforms.
We sampled a balanced subset of the dataset in-
troduced by Barnes et al. (2021), who developed to
develop a dialect classifier for Norwegian tweets,
with the aim to be able to further investigate issues
related to dealing with dialectal data on Twitter.
This subset includes a selection of 38 tweets in
Bokmål, 31 tweets in Nynorsk, and 35 in dialects,
which comprises their full test set. We acknowl-
edge that the size of the dataset is small. The POS-
tagged dataset is subject to restrictions due to it
containing personal information, but is available
upon request.
3.1 Norwegian Dialects
Norwegian is considered to have four main dialect
groups based on four different traits. This has
been a controversial matter and the four-way di-
vide essentially follows Christiansen (1954). There
are also recent proponents of a two-way divide
(Skjekkeland,1997). The four-way distinctions
have a Northern, Middle, Western, and Eastern
group, whereas the two-way divide only operates
with a Western and Eastern group. But these dis-
tinctions are made with traits from the spoken lan-
guage. And, as Mæhlum and Røyneland (2012,
p. 29) point out, there is a discrepancy between
how dialectologists and lay people classify dialects.
What sort of dialectal traits Twitter users choose to
摘要:

AnnotatingNorwegianLanguageVarietiesonTwitterforPart-of-SpeechPetterMæhlum1,AndreKåsen2,SamiaTouileb3andJeremyBarnes41UniversityofOslo2NationalLibraryofNorway3UniversityofBergen4UniversityoftheBasqueCountrypettemae@ifi.uio.no,andre.kasen@nb.no,samia.touileb@uib.no,jeremy.barnes@ehu.eusAbstractNorweg...

展开>> 收起<<
Annotating Norwegian Language Varieties on Twitter for Part-of-Speech Petter Mæhlum1 Andre Kåsen2 Samia Touileb3and Jeremy Barnes4 1University of Oslo.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:151.62KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注