Annotating Norwegian Language Varieties on Twitter for Part-of-Speech Petter Mæhlum1 Andre Kåsen2 Samia Touileb3and Jeremy Barnes4 1University of Oslo

2025-04-27 0 0 151.62KB 6 页 10玖币

侵权投诉

Annotating Norwegian Language Varieties on Twitter for Part-of-Speech

Petter Mæhlum1, Andre Kåsen2, Samia Touileb3and Jeremy Barnes4

1University of Oslo

2National Library of Norway

3University of Bergen

4University of the Basque Country

pettemae@ifi.uio.no,andre.kasen@nb.no,

samia.touileb@uib.no,jeremy.barnes@ehu.eus

Abstract

Norwegian Twitter data poses an interesting

challenge for Natural Language Processing

(NLP) tasks. These texts are difﬁcult for mod-

els trained on standardized text in one of the

two Norwegian written forms (Bokmål and

Nynorsk), as they contain both the typical vari-

ation of social media text, as well as a large

amount of dialectal variety. In this paper we

present a novel Norwegian Twitter dataset an-

notated with POS-tags. We show that models

trained on Universal Dependency (UD) data

perform worse when evaluated against this

dataset, and that models trained on Bokmål

generally perform better than those trained on

Nynorsk. We also see that performance on di-

alectal tweets is comparable to the written stan-

dards for some models. Finally we perform a

detailed analysis of the errors that models com-

monly make on this data.

1 Introduction

Norwegian Twitter data poses an interesting chal-

lenge for Natural Language Processing (NLP) tasks.

Not only do these data represent a set of noisy, user-

generated texts with the kinds of orthographic vari-

ation common on social media, but also because

there is a considerable number of tweets written

in dialectal Norwegian. These dialectal variants

are quite common and add another level of difﬁ-

culty for NLP models trained on clean data in one

of the two Norwegian written forms (Bokmål or

Nynorsk).

Barnes et al. (2021) compiled a dataset of tweets

classiﬁed according to whether they are written in

primarily Bokmål, Nynorsk, or a dialect of Nor-

wegian. We build upon this work by annotating

a subset for Part-of-Speech (POS). We investigate

to what extent available Norwegian POS tagging

models, that were trained on Bokmål and Nynorsk

Universal Dependency data (Nivre et al.,2020),

perform on this Twitter dataset.

To this end, we use ﬁve POS models: three off-

the-shelf models, and two developed for the pur-

pose of this work. Each of these models was trained

on either a dataset of Bokmål or Nynorsk texts. We

explore the performance of each model in terms

of accuracy, and investigate which standardized

written form can be used as training data and yield

good results for non-standardized dialectal texts.

The main contributions of this work are:

•

we annotate a moderately sized Twitter dataset

with POS labels and include metadata related

to which language variety it belongs (Bokmål,

Nynorsk, Dialect, or Mixed),

•

we perform a detailed error analysis of com-

mon model errors speciﬁc to our Twitter data,

•

we include our insights into the annotation

process for POS tagging of non-standardized

written forms,

•

we release two spaCy models built on top of

a Norwegian BERT model.

2 Background

Johannessen (1990) outlined a system for au-

tomatic morphosyntactic analysis of Norwegian

nouns in the framework of Koskenniemi (1983).

This was among the ﬁrst systems, if not even the

very ﬁrst, that automatically assigned Norwegian

texts any morphological information. The ﬁrst

widely used tagger, however, was developed within

the Taggerprosjektet

and came to be known as

the Oslo-Bergen Tagger

(OBT). Rather than con-

tinuing and expanding the system of Johannessen

(1990), OBT was implemented in the framework

of Karlsson (1990). OBT was initially a rule-based

Constraint Grammar tagger for Norwegian Bokmål.

1The project ran from April 1996 to December 1998.

2https://github.com/noklesta/

The-Oslo-Bergen-Tagger

arXiv:2210.06150v1 [cs.CL] 12 Oct 2022

Later, both support for Norwegian Nynorsk and a

statistical disambiguation component were added

(Johannessen et al.,2012). But one drawback of

OBT is that it is made for written, edited text, and

therefore might not scale well to sources that are

not standardised.

Extending tagger coverage to spoken Norwegian

dialect transcription, on the other hand, was the

objective of both Nøklestad and Søfteland (2007)

and Kåsen et al. (2019). Both sampled data either

from the Norwegian part of the Nordic Dialect

Corpus (NDC, Johannessen et al. (2009)) or the

Language Infrastructure made Accessible (LIA)

Corpus.

Annotations are found in the respective

treebanks of the corpora and are accounted for in

Øvrelid et al. (2018) and Kåsen et al. (2022).

Besides Norwegian, there is a large amount of

work on the difﬁculty of processing noisy data from

social media (Xu et al.,2015), including the difﬁ-

culty of POS tagging on social media (Albogamy

and Ramasy,2015), with dialectal variation (Jør-

gensen et al.,2015), or whether lexical normaliza-

tion is helpful (van der Goot et al.,2017). However,

Norwegian currently lacks any of these studies.

3 Data

Resources for evaluating NLP pipeline tasks for

Norwegian are scarce. The only dataset avail-

able for standard NLP tasks such as POS tagging,

lemmatization, and parsing is the Norwegian De-

pendency Treebank (NDT, Solberg (2013), Solberg

et al. (2014)) that has been converted to the Uni-

versal Dependencies standard (Øvrelid and Hohle,

2016). There is, however, a notable exception when

it comes to transcribed spoken dialectal data, where

the LIA and NDC treebanks as mentioned above

are available with annotations for POS tags, mor-

phological features, lemmas, and dependency-style

syntax. Despite this, the transcribed texts in the

LIA and NDC corpora do not share the same char-

acteristics as the Twitter data. Twitter contains

spelling errors and emoji,

along with mentions

and hashtags. We observe that although our Twit-

ter data contains some characteristics of spoken

Norwegian, such as subjectless sentences as in 1,

which is otherwise within the spelling norms, the

spelling conventions differ from those of LIA and

3https://tekstlab.uio.no/LIA/korpus.

html

Emoji has recently gained some interest in the linguistic

literature (see

https://ling.auf.net/lingbuzz/

005981)

NDC, making it difﬁcult to directly compare the

data.

(1) Kommer

Comes nok

probably hjem

home snart

soon .

‘(Unspeciﬁed) probably comes home soon .’

In LIA and NDC, all transcriptions are done ac-

cording to a Norwegian-based semi-phonetic stan-

dard (Hagen et al.,2015), with strict marking of

vowel quantity, palatalization, retroﬂexion, and

more. We see that writers on Twitter do not con-

form to any speciﬁc spelling norm when writing

in their own or another dialect. This means that

although not all dialectal traits from a dialect are

faithfully preserved, this still leads to much dialec-

tal variation in the Twitter data, as things that could

have had a common spelling is spelled according

to the author’s own preference. Especially pho-

netic differences are often not indicated on Twitter.

Because of this, we needed a separate dataset that

could be used to evaluate how various systems for

Norwegian POS-tagging work on dialectal text as it

is found on real data from social media platforms.

We sampled a balanced subset of the dataset in-

troduced by Barnes et al. (2021), who developed to

develop a dialect classiﬁer for Norwegian tweets,

with the aim to be able to further investigate issues

related to dealing with dialectal data on Twitter.

This subset includes a selection of 38 tweets in

Bokmål, 31 tweets in Nynorsk, and 35 in dialects,

which comprises their full test set. We acknowl-

edge that the size of the dataset is small. The POS-

tagged dataset is subject to restrictions due to it

containing personal information, but is available

upon request.

3.1 Norwegian Dialects

Norwegian is considered to have four main dialect

groups based on four different traits. This has

been a controversial matter and the four-way di-

vide essentially follows Christiansen (1954). There

are also recent proponents of a two-way divide

(Skjekkeland,1997). The four-way distinctions

have a Northern, Middle, Western, and Eastern

group, whereas the two-way divide only operates

with a Western and Eastern group. But these dis-

tinctions are made with traits from the spoken lan-

guage. And, as Mæhlum and Røyneland (2012,

p. 29) point out, there is a discrepancy between

how dialectologists and lay people classify dialects.

What sort of dialectal traits Twitter users choose to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnnotatingNorwegianLanguageVarietiesonTwitterforPart-of-SpeechPetterMæhlum1,AndreKåsen2,SamiaTouileb3andJeremyBarnes41UniversityofOslo2NationalLibraryofNorway3UniversityofBergen4UniversityoftheBasqueCountrypettemae@ifi.uio.no,andre.kasen@nb.no,samia.touileb@uib.no,jeremy.barnes@ehu.eusAbstractNorweg...

展开>> 收起<<

Annotating Norwegian Language Varieties on Twitter for Part-of-Speech Petter Mæhlum1 Andre Kåsen2 Samia Touileb3and Jeremy Barnes4 1University of Oslo.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Annotating Norwegian Language Varieties on Twitter for Part-of-Speech Petter Mæhlum1 Andre Kåsen2 Samia Touileb3and Jeremy Barnes4 1University of Oslo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: