Creating a morphological and syntactic tagged corpus for the Uzbek language Maksud Sharipov 1 Jamolbek Mattiev 1 Jasur Sobirov 1 Rustam Baltayev 2

2025-05-06 0 0 458.26KB 8 页 10玖币

侵权投诉

Creating a morphological and syntactic tagged corpus for the

Uzbek language

Maksud Sharipov 1, Jamolbek Mattiev 1, Jasur Sobirov 1, Rustam Baltayev 2

1 Urgench State University, Khamid Alimdjan 14, Urgench, 220100, Uzbekistan

2 Urgench Branch of Tashkent university of Information Technologies Named After Muhammad al-Khwarizmi,

110, Al-Khwarizmi str, Urgench, 220100, Uzbekistan

Abstract

Nowadays, creation of the tagged corpora is becoming one of the most important tasks of

Natural Language Processing (NLP). There are not enough tagged corpora to build machine

learning models for the low-resource Uzbek language. In this paper, we tried to fill that gap by

developing a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and

morphologically tagged corpus of the Uzbek language. This work also includes detailed

description and presentation of a web-based application to work on a tagging as well. Based

on the developed annotation tool and the software, we share our experience results of the first

stage of the tagged corpus creaton.

Keywords

Syntactic tags, morphological tags, language corpus, Uzbek language, natural language

processing

1. Introduction

Nowadays, the Natural Language Processing (NLP) field is developing rapidly and is playing an

important role to solve the problems in the scientific, economic, and cultural fields. NLP also covers

industries such as business data analysis, web application development, corpus linguistics, computer

science, as well as artificial intelligence. The majority of the information available on the Internet is

textual, therefore, obtaining the necessary information through the analysis of textual data, through

various techniques, such as morphological and syntactic analysis of such texts, are becoming main

fields of interest in NLP.

To date, there are many language corpora of most spoken languages, some of the very early works

and also popular ones are the Brown corpus [1], and the International Corpus of English and the British

National Corpus [2]. At present, practical research is underway in the field of corpus linguistics to create

language corpus for various purposes. The usefulness of corpora for linguistic research works is

provided by the creation of tagged sub-corpus in these corpora [3].

Some research works have been done to create tagged corpora for the Uzbek language, for example:

[4,5] which provides information on the basic requirements and principles of linguistic annotation for

text processing in the creation of the electronic corpus of the Uzbek language, and the results of

theoretical and practical research on morphological tagging and morphological analyzer construction

using FST technology.

Due to the lack of language resources in Uzbek language, there are difficulties in solving NLP

problems. To solve NLP problems, we need a morphologically and syntactically tagged corpus. To

date, the lack of an open source morphologically and syntactically tagged corpus for the Uzbek language

The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing

(ALTNLP), June 7-8, 2022, Koper, Slovenia

EMAIL: maqsbek72@gmail.com (M.Sharipov. 1); jamolbek_1992@mail.ru(J. Mattiev. 2);

ORCID: 0000-0002-2363-6533(A. 1);0000-0002-7614-118X3(A. 2);

 2022 Copyright for this paper by its authors.

Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR Workshop Proceedings (CEUR-WS.org)

makes it difficult to conduct research in the field of linguistics. There are 12 word classes in the Uzbek

language. A word can be polyfunctional depending on the state of its realization in the sentence and the

semantic valence of the N-gramm words [6]. The typical approach for most NLP applications using

tagged corpora consists of the creation of a corpus through manual annotation and then training a

machine learning model [7]. To solve the above-mentioned issues, we aimed to create an open source

tagged corpus in this research. The goal is to build a supervised tagger using the tagged corpus which

is being created. Typically, a pre-tagged corpus using a tool is required to create supervised taggers [8].

The importance of the proposed work lies beneath the complex structure of the tagset built, and the

tool to annotate given texts to create a tagged corpus, which in turn will be used for the upcoming work

of tagger tool for Uzbek, to train sequence labeling language model.

2. Related work

Since a morphological and syntctic tagset and tagged corpus is one of the fundamental must-have

resources and one of the first steps of creating resources for languages, all the well-resourced languages

can be said to have their tagsets and tagged corpora developed at some point. All the languages in use

differ from each other with their syntax, morphology and phonetics, but at teh same time, majority of

them have a similar constructive structure, which allows linguists to create multilingual resources and

tools. In an attempt to a creation of a multilingual tagset that can be used by as many languages as

possible, there has been a work by Google research to create a universal POS tagset [9], which presents

a tagset that was obtained by mapping similar features of 22 languages together. This universal POS

tagset is now used by many languages as the base of their tagset, which is then extended by more tags

that encode language-specific features. This universal POS tagset is also used by the Universal

Dependencies (UD) project [10], one of the fastest growing multilingual tagged NLP data platform that

has data over 130 languages.

On the topic of a similar work done on Uzbek language, the first work that presented the morphological

tags list and the morphological tagger [11] presented a tool created in Prolog. But the problem with the

work was that it only covered main parts of speech in Uzbek text, and was missing many tags to deal

with complex words.

In [12], the issue of tagging the Uzbek language corpus was considered. Authors proposed 14 POS

tags, that is, almost one tag is created for each word class, but in Uzbek language each word class is

divided into several types in terms of meaning and structure. In our approach, we took into consideration

those issues and created the expanded tagset by deeper analysis. The novel tagset allows us to analyze

the text in depth from a semantic point of view. In [13], the importance of rule-based and stochastic

tagging methods for the Uzbek language is discussed. The need of a tagged corpus for the Uzbek

language is indicated and the occurrence of words in sentences with different functions is described,

however, authors did not provide any morphological or syntactic tagset which can be used for tagging.

There are very limited amount of NLP work done on Uzbek, some of the important ones include

Sentiment analysis datasets [14,15], cross-lingual word embeddings over closely-related Turkic

languages [16], stopwords dataset [17], Stemmer for Uzbek verbs [18], as well as recent neural

transformer based (BERT) language model [19] which was trained on a big raaw Uzbek text. Although

there is a big amount of scientific works published claiming that they have contributed to the Uzbek

NLP, the quality of works, be it a language resource, or a tool, is nowhere near that amount. This

statement about some scientific works which claim they have done something, but not providing an

open-source code or the data itself, are mentioned as “zigglebottom” papers in a recent work done on

Uzbek [20].

Regarding related works done on similar languages, there has been a work done on the Kazakh

language [21], which syntactic and POS tags have been developed to create a tagged corpus. The authors

produced 36 morphological tags and 9 syntactic tags and developed an annotated corpus which consist

of 613 511 words based on their tagset.

3. Proposed methods

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CreatingamorphologicalandsyntactictaggedcorpusfortheUzbeklanguageMaksudSharipov1,JamolbekMattiev1,JasurSobirov1,RustamBaltayev21UrgenchStateUniversity,KhamidAlimdjan14,Urgench,220100,Uzbekistan2UrgenchBranchofTashkentuniversityofInformationTechnologiesNamedAfterMuhammadal-Khwarizmi,110,Al-Khwarizmis...

展开>> 收起<<

Creating a morphological and syntactic tagged corpus for the Uzbek language Maksud Sharipov 1 Jamolbek Mattiev 1 Jasur Sobirov 1 Rustam Baltayev 2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Creating a morphological and syntactic tagged corpus for the Uzbek language Maksud Sharipov 1 Jamolbek Mattiev 1 Jasur Sobirov 1 Rustam Baltayev 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: