
Later, both support for Norwegian Nynorsk and a
statistical disambiguation component were added
(Johannessen et al.,2012). But one drawback of
OBT is that it is made for written, edited text, and
therefore might not scale well to sources that are
not standardised.
Extending tagger coverage to spoken Norwegian
dialect transcription, on the other hand, was the
objective of both Nøklestad and Søfteland (2007)
and Kåsen et al. (2019). Both sampled data either
from the Norwegian part of the Nordic Dialect
Corpus (NDC, Johannessen et al. (2009)) or the
Language Infrastructure made Accessible (LIA)
Corpus.
3
Annotations are found in the respective
treebanks of the corpora and are accounted for in
Øvrelid et al. (2018) and Kåsen et al. (2022).
Besides Norwegian, there is a large amount of
work on the difficulty of processing noisy data from
social media (Xu et al.,2015), including the diffi-
culty of POS tagging on social media (Albogamy
and Ramasy,2015), with dialectal variation (Jør-
gensen et al.,2015), or whether lexical normaliza-
tion is helpful (van der Goot et al.,2017). However,
Norwegian currently lacks any of these studies.
3 Data
Resources for evaluating NLP pipeline tasks for
Norwegian are scarce. The only dataset avail-
able for standard NLP tasks such as POS tagging,
lemmatization, and parsing is the Norwegian De-
pendency Treebank (NDT, Solberg (2013), Solberg
et al. (2014)) that has been converted to the Uni-
versal Dependencies standard (Øvrelid and Hohle,
2016). There is, however, a notable exception when
it comes to transcribed spoken dialectal data, where
the LIA and NDC treebanks as mentioned above
are available with annotations for POS tags, mor-
phological features, lemmas, and dependency-style
syntax. Despite this, the transcribed texts in the
LIA and NDC corpora do not share the same char-
acteristics as the Twitter data. Twitter contains
spelling errors and emoji,
4
along with mentions
and hashtags. We observe that although our Twit-
ter data contains some characteristics of spoken
Norwegian, such as subjectless sentences as in 1,
which is otherwise within the spelling norms, the
spelling conventions differ from those of LIA and
3https://tekstlab.uio.no/LIA/korpus.
html
4
Emoji has recently gained some interest in the linguistic
literature (see
https://ling.auf.net/lingbuzz/
005981)
NDC, making it difficult to directly compare the
data.
(1) Kommer
Comes nok
probably hjem
home snart
soon .
.
‘(Unspecified) probably comes home soon .’
In LIA and NDC, all transcriptions are done ac-
cording to a Norwegian-based semi-phonetic stan-
dard (Hagen et al.,2015), with strict marking of
vowel quantity, palatalization, retroflexion, and
more. We see that writers on Twitter do not con-
form to any specific spelling norm when writing
in their own or another dialect. This means that
although not all dialectal traits from a dialect are
faithfully preserved, this still leads to much dialec-
tal variation in the Twitter data, as things that could
have had a common spelling is spelled according
to the author’s own preference. Especially pho-
netic differences are often not indicated on Twitter.
Because of this, we needed a separate dataset that
could be used to evaluate how various systems for
Norwegian POS-tagging work on dialectal text as it
is found on real data from social media platforms.
We sampled a balanced subset of the dataset in-
troduced by Barnes et al. (2021), who developed to
develop a dialect classifier for Norwegian tweets,
with the aim to be able to further investigate issues
related to dealing with dialectal data on Twitter.
This subset includes a selection of 38 tweets in
Bokmål, 31 tweets in Nynorsk, and 35 in dialects,
which comprises their full test set. We acknowl-
edge that the size of the dataset is small. The POS-
tagged dataset is subject to restrictions due to it
containing personal information, but is available
upon request.
3.1 Norwegian Dialects
Norwegian is considered to have four main dialect
groups based on four different traits. This has
been a controversial matter and the four-way di-
vide essentially follows Christiansen (1954). There
are also recent proponents of a two-way divide
(Skjekkeland,1997). The four-way distinctions
have a Northern, Middle, Western, and Eastern
group, whereas the two-way divide only operates
with a Western and Eastern group. But these dis-
tinctions are made with traits from the spoken lan-
guage. And, as Mæhlum and Røyneland (2012,
p. 29) point out, there is a discrepancy between
how dialectologists and lay people classify dialects.
What sort of dialectal traits Twitter users choose to