Sinhala Sentence Embedding A Two-Tiered Structure for Low-Resource Languages Gihan Weeraprameshwara1Vihanga JayawickramaNisansa de Silva and

2025-05-03 0 0 2.19MB 11 页 10玖币

侵权投诉

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource

Languages

Gihan Weeraprameshwara*,1,Vihanga Jayawickrama*,Nisansa de Silva*, and

Yudhanjaya Wijeratne**

*Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka

**LIRNEasia, Sri Lanka

1gihanravindu.17@cse.mrt.ac.lk

Abstract

In the process of numerically modeling natural

languages, developing language embeddings

is a vital step. However, it is challenging to

develop functional embeddings for resource-

poor languages such as Sinhala, for which

sufﬁciently large corpora, effective language

parsers, and any other required resources are

difﬁcult to ﬁnd. In such conditions, the ex-

ploitation of existing models to come up with

an efﬁcacious embedding methodology to nu-

merically represent text could be quite fruit-

ful. This paper explores the effectivity of sev-

eral one-tiered and two-tiered embedding ar-

chitectures in representing Sinhala text in the

sentiment analysis domain. With our ﬁndings,

the two-tiered embedding architecture where

the lower-tier consists of a word embedding

and the upper-tier consists of a sentence em-

bedding has been proven to perform better

than one-tier word embeddings, by achieving

a maximum F1 score of 88.04% in contrast to

the 83.76% achieved by word embedding mod-

els. Furthermore, embeddings in the hyper-

bolic space are also developed and compared

with Euclidean embeddings in terms of per-

formance. A sentiment data set consisting of

Facebook posts and associated reactions have

been used for this research. To effectively com-

pare the performance of different embedding

systems, the same deep neural network struc-

ture has been trained on sentiment data with

each of the embedding systems used to encode

the text associated.

1 Introduction

An effective numerical representation of the textual

content is crucial for natural language processing

models, in order to understand the underlying rela-

tional patterns among words and discover patterns

in natural languages. For resource-rich languages

like English, numerous pre-trained models as well

as the required materials to develop an embedding

system are readily available. On the contrary, for

resource-poor languages such as Sinhala, neither of

those options could be easily found (de Silva,2019).

Even the data sets that are available for training of-

ten fail to meet adequate standards (Caswell et al.,

2021). Thus, discovering a convenient method-

ology to develop embeddings for text would be

a great step forward in the NLP domain for the

Sinhala language.

Sinhala, also known as Sinhalese, is an Indo-

Aryan language that is used within Sri Lanka (Kan-

duboda,2011). The primary user base of this lan-

guage is the Sinhalese ethnic group of the country.

In total, 17 million people use Sinhala as their ﬁrst

language while 2 million people use it as a second

language (de Silva,2019). Furthermore, Sinhala

is structurally different from English, which uses

a subject-verb-object structure as opposed to the

subject-object-verb structure used by Sinhala as

shown in the ﬁgure 1thus most of the pre-trained

embedding models for English may not be effective

with Sinhala.

Figure 1: SVO grammar structure of English and SOV

grammar structure of Sinhala

This study therefore is focused on discovering

an effective embedding system for Sinhala text that

provides reasonable results when used in training

deep learning models. Sentiment analysis with

Facebook data is utilized as the use case for the

study.

Upon considering common forms of vector pre-

sentations of textual content, bag of words, word

embedding, and sentence embedding are three of

the leading methodologies in the present. Word

arXiv:2210.14472v1 [cs.CL] 26 Oct 2022

embeddings have been observed to surpass the per-

formance of bag of words for large enough data sets

(Rudkowsky et al.,2018) because bag of words of-

ten met with various problems such as disregarding

the grammatical structure of the text, large vocabu-

lary dimension and sparse representation (Le and

Mikolov,2014;El-Din,2016). In order to tackle

the above challenges, word embeddings can be

used. Since word embeddings capture the simi-

larities among ingrained sentiments in words and

represent them in the vector space, word embed-

dings tend to increase the accuracy of classiﬁcation

models (Goldberg,2016).

However, one of the major weaknesses of word

embedding models is that they fail to capture syn-

tax and polysemy; i.e. the presence of multiple

possible meanings for a certain word or a phrase

(Mu et al.,2016). In order to overcome these obsta-

cles and also to achieve ﬁne granularity in the em-

bedding, sentence embeddings are used. The idea

is to test common Euclidean space word embed-

ding techniques such as fastText (Bojanowski et al.,

2017;Joulin et al.,2016), Word2vec (Mikolov

et al.,2013), and GloVe (Pennington et al.,2014)

with sentence embedding techniques. The pooling

methods (i.e. max pooling, min pooling and avg

pooling) will be considered as the baseline meth-

ods for the test. More advanced models such as

sequence to sequence model (i.e. seq2seq model)

(Sutskever et al.,2014) and the modiﬁed version of

the sequence to sequence model introduced by the

work of Cho et al. (2014) with GRU (Chung et al.,

2014) and LSTM (Hochreiter and Schmidhuber,

1997) recurrent neural network units will be tested

against the pooling means. Furthermore, the addi-

tion of attention mechanism (Vaswani et al.,2017)

into the sequence to sequence model will also be

tested.

Most models created using word and sentence

embeddings are based on the Euclidean space.

Though this vector space is commonly used, it

poses signiﬁcant limitations when representing

complex structures (Nickel and Kiela,2017). Us-

ing the hyperbolic space provides a plausible so-

lution for such instances. The hyperbolic space

is a negatively-curved, non-Euclidean space. It

is advantageous for embedding trees as the cir-

cumference of a circle grows exponentially with

the radius. The usage of hyperbolic embedding

is still a novel research area as it was only intro-

duced recently, through the work of Nickel and

Kiela (2017); Chamberlain et al. (2017); Sala et al.

(2018). The work of Lu et al. (2019,2020) high-

light the importance of using the hyperbolic space

to improve the quality of embeddings in a practi-

cal context within the medical domain. However,

research done on the applicability of hyperbolic

embeddings in different arenas is highly limited.

Thus, the full potential of the hyperbolic space is

yet to be fully uncovered.

Through this paper, we are testing the effective-

ness of a set of two-tiered word representation mod-

els that include various word embeddings as the

lower tier and sentence embeddings as the upper

tier will be compared.

2 Related Work

The sequence to sequence model introduced by the

work of Sutskever et al. (2014) is vital in this re-

search as it is one of the core models in developing

sentence embedding. Though originally developed

for translation purposes the model has gone under

multiple modiﬁcations depending on the context

such as description generation for images (Karpa-

thy and Fei-Fei,2015), phrase representation (Cho

et al.,2014), attention models (Vaswani et al.,2017)

and BERT models (Devlin et al.,2018) thus prov-

ing the potential it holds in the machine learning

area.

The work of Nickel and Kiela (2017) intro-

duces and explores the potential of hyperbolic em-

bedding by using an n-dimension Poincaré ball.

The research work compares the hyperbolic and

Euclidean embeddings for a complex latent data

structure and comes to the conclusion that hyper-

bolic embedding surpasses the Euclidean embed-

ding in effectivity. Inspired by the above results,

both Leimeister and Wilson (2018) and Dhingra

et al. (2018) have extended the methodology intro-

duced by Nickel and Kiela (2017). Leimeister and

Wilson (2018) have developed a hyperbolic word

embedding using the skip-ngram negative sam-

pling architecture taken from Word2vec. In lower

embedding dimensions, the developed model per-

forms better in comparison to its Euclidean coun-

terpart. The work of Dhingra et al. (2018) uses

re-parameterization to extend the Poincaré embed-

ding, in order to learn the embedding of arbitrarily

parameterized objects. The framework thus created

is used to develop word and sentence embeddings.

In our research, we will be following the footsteps

of the above papers.

When considering the usage of hyperbolic em-

beddings in a practical context, the work of Lu

et al. (2019,2020) can be examined. The research

by Lu et al. (2019) improves the state-of-the-art

model used to predict ICU (intensive care unit)

re-admissions and surpasses the accepted bench-

mark used to predict in-hospital mortality using hy-

perbolic embedding of Electronic Health Records,

while the work of Lu et al. (2020) introduces a

novel network embedding method which is capable

of maintaining the consistency of the node represen-

tation across two views of networks, thus emphasiz-

ing the capabilities of hyperbolic embeddings. To

the best of our knowledge, hyperbolic embeddings

have not been previously applied to Sinhala content.

Therefore, this research may reveal novel insight

regarding hyperbolic embedding and its effectivity

in sentiment analysis.

In the research work of Senevirathne et al.

(2020), capsule-B model (Zhao et al.,2018) is

crowned as the state-of-the-art model for the Sin-

hala sentiment analysis. In this work, a set of

deep learning models are tested for the ability to

predict the sentiment of Sinhala news comments.

The GRU (Chung et al.,2014) model with a CNN

(Wang et al.,2016) layer which is used for the

testing of each embedding in this work is taken

from the aforementioned research. Furthermore,

the work of Weeraprameshwara et al. (2022) has

extended the idea and tested the same set of deep

learning models with the addition of sentiment

analysis models introduced in the work of Jayaw-

ickrama et al. (2021) using the Facebook data set

which is used in this research work. According to

their results, the 3 layer stacked BiLSTM model

(Zhou et al.,2019) outshines as the state-of-the-art

model.

3 Methodology

In order to test the feasibility of two-tiered word

representation as a means of representing Sinhala

text in the sentiment analysis domain, a series of

experiments were conducted as described in the

following subsections.

3.1 Data Set

The data set used for the project is extracted from

the work of Wijeratne and de Silva (2020), which

contains 1,820,930 Facebook posts from 533 Face-

book pages popular in Sri Lanka over the time

window of 2010 to 2020. The research work has

produced two cleaned corpora and a set of stop

words for the given context. The larger corpus

among them consists of a total of 28 to 29 million

words. The data set covers a wide range of subjects

such as politics, media, and celebrities. Table 1

illustrates the ﬁelds taken from the data set for the

embedding development, model training and test-

ing phases.

Field Name Total Count Percentage(%)

Likes 312,282,979 93.58

Loves 10,637,722 3.19

Wow 1,633,255 0.49

Haha 5,377,815 1.61

Sad 2,611,908 0.78

Angry 1,158,182 0.35

Thankful 12,933 0.00

Table 1: The counts and percentages of the reactions in

the Facebook data set

3.2 Preprocessing

Even though there are two preprocessed corpora

introduced through the work of Wijeratne and

de Silva (2020), the raw data set was used for this

research with the objective of preprocessing it to

suit our requirements. As such, numerical content,

URLs, email addresses, hashtags, words in other

languages except for Sinhala and English, and ex-

cessive spaces were removed from the text. While

the focus of this study is colloquial Sinhala, English

is included in the data set as the two languages are

often codemixed in colloquial use. Codemixing of

Sinhala with other languages is much less in com-

parison. Furthermore, stop words were removed

from the text as well, as recommended by Wijer-

atne and de Silva (2020). Posts with no textual con-

tent after thus preprocessing as well as posts with

no reaction annotations were also removed as they

yield no value in the annotation stage. The ﬁnal

preprocessed data set consists of Sinhala, English,

and Sinhala-English code mixed content, adding

up to a total of 542,871 Facebook posts consisting

of 8,605,849 words.

3.3 Annotation

Since the procedure followed in the model devel-

opment is supervised learning, the data set needed

to be annotated (Schapire and Freund,2012). It

is quite a considerable challenge to obtain sufﬁ-

ciently large annotated data sets for resource-poor

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SinhalaSentenceEmbedding:ATwo-TieredStructureforLow-ResourceLanguagesGihanWeeraprameshwara*,1,VihangaJayawickrama*,NisansadeSilva*,andYudhanjayaWijeratne***DepartmentofComputerScience&Engineering,UniversityofMoratuwa,SriLanka**LIRNEasia,SriLanka1gihanravindu.17@cse.mrt.ac.lkAbstractIntheprocessofnum...

展开>> 收起<<

Sinhala Sentence Embedding A Two-Tiered Structure for Low-Resource Languages Gihan Weeraprameshwara1Vihanga JayawickramaNisansa de Silva and.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sinhala Sentence Embedding A Two-Tiered Structure for Low-Resource Languages Gihan Weeraprameshwara1Vihanga JayawickramaNisansa de Silva and

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: