makes it difficult to conduct research in the field of linguistics. There are 12 word classes in the Uzbek
language. A word can be polyfunctional depending on the state of its realization in the sentence and the
semantic valence of the N-gramm words [6]. The typical approach for most NLP applications using
tagged corpora consists of the creation of a corpus through manual annotation and then training a
machine learning model [7]. To solve the above-mentioned issues, we aimed to create an open source
tagged corpus in this research. The goal is to build a supervised tagger using the tagged corpus which
is being created. Typically, a pre-tagged corpus using a tool is required to create supervised taggers [8].
The importance of the proposed work lies beneath the complex structure of the tagset built, and the
tool to annotate given texts to create a tagged corpus, which in turn will be used for the upcoming work
of tagger tool for Uzbek, to train sequence labeling language model.
2. Related work
Since a morphological and syntctic tagset and tagged corpus is one of the fundamental must-have
resources and one of the first steps of creating resources for languages, all the well-resourced languages
can be said to have their tagsets and tagged corpora developed at some point. All the languages in use
differ from each other with their syntax, morphology and phonetics, but at teh same time, majority of
them have a similar constructive structure, which allows linguists to create multilingual resources and
tools. In an attempt to a creation of a multilingual tagset that can be used by as many languages as
possible, there has been a work by Google research to create a universal POS tagset [9], which presents
a tagset that was obtained by mapping similar features of 22 languages together. This universal POS
tagset is now used by many languages as the base of their tagset, which is then extended by more tags
that encode language-specific features. This universal POS tagset is also used by the Universal
Dependencies (UD) project [10], one of the fastest growing multilingual tagged NLP data platform that
has data over 130 languages.
On the topic of a similar work done on Uzbek language, the first work that presented the morphological
tags list and the morphological tagger [11] presented a tool created in Prolog. But the problem with the
work was that it only covered main parts of speech in Uzbek text, and was missing many tags to deal
with complex words.
In [12], the issue of tagging the Uzbek language corpus was considered. Authors proposed 14 POS
tags, that is, almost one tag is created for each word class, but in Uzbek language each word class is
divided into several types in terms of meaning and structure. In our approach, we took into consideration
those issues and created the expanded tagset by deeper analysis. The novel tagset allows us to analyze
the text in depth from a semantic point of view. In [13], the importance of rule-based and stochastic
tagging methods for the Uzbek language is discussed. The need of a tagged corpus for the Uzbek
language is indicated and the occurrence of words in sentences with different functions is described,
however, authors did not provide any morphological or syntactic tagset which can be used for tagging.
There are very limited amount of NLP work done on Uzbek, some of the important ones include
Sentiment analysis datasets [14,15], cross-lingual word embeddings over closely-related Turkic
languages [16], stopwords dataset [17], Stemmer for Uzbek verbs [18], as well as recent neural
transformer based (BERT) language model [19] which was trained on a big raaw Uzbek text. Although
there is a big amount of scientific works published claiming that they have contributed to the Uzbek
NLP, the quality of works, be it a language resource, or a tool, is nowhere near that amount. This
statement about some scientific works which claim they have done something, but not providing an
open-source code or the data itself, are mentioned as “zigglebottom” papers in a recent work done on
Uzbek [20].
Regarding related works done on similar languages, there has been a work done on the Kazakh
language [21], which syntactic and POS tags have been developed to create a tagged corpus. The authors
produced 36 morphological tags and 9 syntactic tags and developed an annotated corpus which consist
of 613 511 words based on their tagset.
3. Proposed methods