Creating a morphological and syntactic tagged corpus for the Uzbek language Maksud Sharipov 1 Jamolbek Mattiev 1 Jasur Sobirov 1 Rustam Baltayev 2

2025-05-06 0 0 458.26KB 8 页 10玖币
侵权投诉
Creating a morphological and syntactic tagged corpus for the
Uzbek language
Maksud Sharipov 1, Jamolbek Mattiev 1, Jasur Sobirov 1, Rustam Baltayev 2
1 Urgench State University, Khamid Alimdjan 14, Urgench, 220100, Uzbekistan
2 Urgench Branch of Tashkent university of Information Technologies Named After Muhammad al-Khwarizmi,
110, Al-Khwarizmi str, Urgench, 220100, Uzbekistan
Abstract
Nowadays, creation of the tagged corpora is becoming one of the most important tasks of
Natural Language Processing (NLP). There are not enough tagged corpora to build machine
learning models for the low-resource Uzbek language. In this paper, we tried to fill that gap by
developing a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and
morphologically tagged corpus of the Uzbek language. This work also includes detailed
description and presentation of a web-based application to work on a tagging as well. Based
on the developed annotation tool and the software, we share our experience results of the first
stage of the tagged corpus creaton.
Keywords
1
Syntactic tags, morphological tags, language corpus, Uzbek language, natural language
processing
1. Introduction
Nowadays, the Natural Language Processing (NLP) field is developing rapidly and is playing an
important role to solve the problems in the scientific, economic, and cultural fields. NLP also covers
industries such as business data analysis, web application development, corpus linguistics, computer
science, as well as artificial intelligence. The majority of the information available on the Internet is
textual, therefore, obtaining the necessary information through the analysis of textual data, through
various techniques, such as morphological and syntactic analysis of such texts, are becoming main
fields of interest in NLP.
To date, there are many language corpora of most spoken languages, some of the very early works
and also popular ones are the Brown corpus [1], and the International Corpus of English and the British
National Corpus [2]. At present, practical research is underway in the field of corpus linguistics to create
language corpus for various purposes. The usefulness of corpora for linguistic research works is
provided by the creation of tagged sub-corpus in these corpora [3].
Some research works have been done to create tagged corpora for the Uzbek language, for example:
[4,5] which provides information on the basic requirements and principles of linguistic annotation for
text processing in the creation of the electronic corpus of the Uzbek language, and the results of
theoretical and practical research on morphological tagging and morphological analyzer construction
using FST technology.
Due to the lack of language resources in Uzbek language, there are difficulties in solving NLP
problems. To solve NLP problems, we need a morphologically and syntactically tagged corpus. To
date, the lack of an open source morphologically and syntactically tagged corpus for the Uzbek language
1
The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing
(ALTNLP), June 7-8, 2022, Koper, Slovenia
EMAIL: maqsbek72@gmail.com (M.Sharipov. 1); jamolbek_1992@mail.ru(J. Mattiev. 2);
ORCID: 0000-0002-2363-6533(A. 1);0000-0002-7614-118X3(A. 2);
2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
makes it difficult to conduct research in the field of linguistics. There are 12 word classes in the Uzbek
language. A word can be polyfunctional depending on the state of its realization in the sentence and the
semantic valence of the N-gramm words [6]. The typical approach for most NLP applications using
tagged corpora consists of the creation of a corpus through manual annotation and then training a
machine learning model [7]. To solve the above-mentioned issues, we aimed to create an open source
tagged corpus in this research. The goal is to build a supervised tagger using the tagged corpus which
is being created. Typically, a pre-tagged corpus using a tool is required to create supervised taggers [8].
The importance of the proposed work lies beneath the complex structure of the tagset built, and the
tool to annotate given texts to create a tagged corpus, which in turn will be used for the upcoming work
of tagger tool for Uzbek, to train sequence labeling language model.
2. Related work
Since a morphological and syntctic tagset and tagged corpus is one of the fundamental must-have
resources and one of the first steps of creating resources for languages, all the well-resourced languages
can be said to have their tagsets and tagged corpora developed at some point. All the languages in use
differ from each other with their syntax, morphology and phonetics, but at teh same time, majority of
them have a similar constructive structure, which allows linguists to create multilingual resources and
tools. In an attempt to a creation of a multilingual tagset that can be used by as many languages as
possible, there has been a work by Google research to create a universal POS tagset [9], which presents
a tagset that was obtained by mapping similar features of 22 languages together. This universal POS
tagset is now used by many languages as the base of their tagset, which is then extended by more tags
that encode language-specific features. This universal POS tagset is also used by the Universal
Dependencies (UD) project [10], one of the fastest growing multilingual tagged NLP data platform that
has data over 130 languages.
On the topic of a similar work done on Uzbek language, the first work that presented the morphological
tags list and the morphological tagger [11] presented a tool created in Prolog. But the problem with the
work was that it only covered main parts of speech in Uzbek text, and was missing many tags to deal
with complex words.
In [12], the issue of tagging the Uzbek language corpus was considered. Authors proposed 14 POS
tags, that is, almost one tag is created for each word class, but in Uzbek language each word class is
divided into several types in terms of meaning and structure. In our approach, we took into consideration
those issues and created the expanded tagset by deeper analysis. The novel tagset allows us to analyze
the text in depth from a semantic point of view. In [13], the importance of rule-based and stochastic
tagging methods for the Uzbek language is discussed. The need of a tagged corpus for the Uzbek
language is indicated and the occurrence of words in sentences with different functions is described,
however, authors did not provide any morphological or syntactic tagset which can be used for tagging.
There are very limited amount of NLP work done on Uzbek, some of the important ones include
Sentiment analysis datasets [14,15], cross-lingual word embeddings over closely-related Turkic
languages [16], stopwords dataset [17], Stemmer for Uzbek verbs [18], as well as recent neural
transformer based (BERT) language model [19] which was trained on a big raaw Uzbek text. Although
there is a big amount of scientific works published claiming that they have contributed to the Uzbek
NLP, the quality of works, be it a language resource, or a tool, is nowhere near that amount. This
statement about some scientific works which claim they have done something, but not providing an
open-source code or the data itself, are mentioned as “zigglebottom” papers in a recent work done on
Uzbek [20].
Regarding related works done on similar languages, there has been a work done on the Kazakh
language [21], which syntactic and POS tags have been developed to create a tagged corpus. The authors
produced 36 morphological tags and 9 syntactic tags and developed an annotated corpus which consist
of 613 511 words based on their tagset.
3. Proposed methods
摘要:

CreatingamorphologicalandsyntactictaggedcorpusfortheUzbeklanguageMaksudSharipov1,JamolbekMattiev1,JasurSobirov1,RustamBaltayev21UrgenchStateUniversity,KhamidAlimdjan14,Urgench,220100,Uzbekistan2UrgenchBranchofTashkentuniversityofInformationTechnologiesNamedAfterMuhammadal-Khwarizmi,110,Al-Khwarizmis...

展开>> 收起<<
Creating a morphological and syntactic tagged corpus for the Uzbek language Maksud Sharipov 1 Jamolbek Mattiev 1 Jasur Sobirov 1 Rustam Baltayev 2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:458.26KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注