Sinhala Sentence Embedding A Two-Tiered Structure for Low-Resource Languages Gihan Weeraprameshwara1Vihanga JayawickramaNisansa de Silva and

2025-05-03 0 0 2.19MB 11 页 10玖币
侵权投诉
Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource
Languages
Gihan Weeraprameshwara*,1,Vihanga Jayawickrama*,Nisansa de Silva*, and
Yudhanjaya Wijeratne**
*Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka
**LIRNEasia, Sri Lanka
1gihanravindu.17@cse.mrt.ac.lk
Abstract
In the process of numerically modeling natural
languages, developing language embeddings
is a vital step. However, it is challenging to
develop functional embeddings for resource-
poor languages such as Sinhala, for which
sufficiently large corpora, effective language
parsers, and any other required resources are
difficult to find. In such conditions, the ex-
ploitation of existing models to come up with
an efficacious embedding methodology to nu-
merically represent text could be quite fruit-
ful. This paper explores the effectivity of sev-
eral one-tiered and two-tiered embedding ar-
chitectures in representing Sinhala text in the
sentiment analysis domain. With our findings,
the two-tiered embedding architecture where
the lower-tier consists of a word embedding
and the upper-tier consists of a sentence em-
bedding has been proven to perform better
than one-tier word embeddings, by achieving
a maximum F1 score of 88.04% in contrast to
the 83.76% achieved by word embedding mod-
els. Furthermore, embeddings in the hyper-
bolic space are also developed and compared
with Euclidean embeddings in terms of per-
formance. A sentiment data set consisting of
Facebook posts and associated reactions have
been used for this research. To effectively com-
pare the performance of different embedding
systems, the same deep neural network struc-
ture has been trained on sentiment data with
each of the embedding systems used to encode
the text associated.
1 Introduction
An effective numerical representation of the textual
content is crucial for natural language processing
models, in order to understand the underlying rela-
tional patterns among words and discover patterns
in natural languages. For resource-rich languages
like English, numerous pre-trained models as well
as the required materials to develop an embedding
system are readily available. On the contrary, for
resource-poor languages such as Sinhala, neither of
those options could be easily found (de Silva,2019).
Even the data sets that are available for training of-
ten fail to meet adequate standards (Caswell et al.,
2021). Thus, discovering a convenient method-
ology to develop embeddings for text would be
a great step forward in the NLP domain for the
Sinhala language.
Sinhala, also known as Sinhalese, is an Indo-
Aryan language that is used within Sri Lanka (Kan-
duboda,2011). The primary user base of this lan-
guage is the Sinhalese ethnic group of the country.
In total, 17 million people use Sinhala as their first
language while 2 million people use it as a second
language (de Silva,2019). Furthermore, Sinhala
is structurally different from English, which uses
a subject-verb-object structure as opposed to the
subject-object-verb structure used by Sinhala as
shown in the figure 1thus most of the pre-trained
embedding models for English may not be effective
with Sinhala.
Figure 1: SVO grammar structure of English and SOV
grammar structure of Sinhala
This study therefore is focused on discovering
an effective embedding system for Sinhala text that
provides reasonable results when used in training
deep learning models. Sentiment analysis with
Facebook data is utilized as the use case for the
study.
Upon considering common forms of vector pre-
sentations of textual content, bag of words, word
embedding, and sentence embedding are three of
the leading methodologies in the present. Word
arXiv:2210.14472v1 [cs.CL] 26 Oct 2022
embeddings have been observed to surpass the per-
formance of bag of words for large enough data sets
(Rudkowsky et al.,2018) because bag of words of-
ten met with various problems such as disregarding
the grammatical structure of the text, large vocabu-
lary dimension and sparse representation (Le and
Mikolov,2014;El-Din,2016). In order to tackle
the above challenges, word embeddings can be
used. Since word embeddings capture the simi-
larities among ingrained sentiments in words and
represent them in the vector space, word embed-
dings tend to increase the accuracy of classification
models (Goldberg,2016).
However, one of the major weaknesses of word
embedding models is that they fail to capture syn-
tax and polysemy; i.e. the presence of multiple
possible meanings for a certain word or a phrase
(Mu et al.,2016). In order to overcome these obsta-
cles and also to achieve fine granularity in the em-
bedding, sentence embeddings are used. The idea
is to test common Euclidean space word embed-
ding techniques such as fastText (Bojanowski et al.,
2017;Joulin et al.,2016), Word2vec (Mikolov
et al.,2013), and GloVe (Pennington et al.,2014)
with sentence embedding techniques. The pooling
methods (i.e. max pooling, min pooling and avg
pooling) will be considered as the baseline meth-
ods for the test. More advanced models such as
sequence to sequence model (i.e. seq2seq model)
(Sutskever et al.,2014) and the modified version of
the sequence to sequence model introduced by the
work of Cho et al. (2014) with GRU (Chung et al.,
2014) and LSTM (Hochreiter and Schmidhuber,
1997) recurrent neural network units will be tested
against the pooling means. Furthermore, the addi-
tion of attention mechanism (Vaswani et al.,2017)
into the sequence to sequence model will also be
tested.
Most models created using word and sentence
embeddings are based on the Euclidean space.
Though this vector space is commonly used, it
poses significant limitations when representing
complex structures (Nickel and Kiela,2017). Us-
ing the hyperbolic space provides a plausible so-
lution for such instances. The hyperbolic space
is a negatively-curved, non-Euclidean space. It
is advantageous for embedding trees as the cir-
cumference of a circle grows exponentially with
the radius. The usage of hyperbolic embedding
is still a novel research area as it was only intro-
duced recently, through the work of Nickel and
Kiela (2017); Chamberlain et al. (2017); Sala et al.
(2018). The work of Lu et al. (2019,2020) high-
light the importance of using the hyperbolic space
to improve the quality of embeddings in a practi-
cal context within the medical domain. However,
research done on the applicability of hyperbolic
embeddings in different arenas is highly limited.
Thus, the full potential of the hyperbolic space is
yet to be fully uncovered.
Through this paper, we are testing the effective-
ness of a set of two-tiered word representation mod-
els that include various word embeddings as the
lower tier and sentence embeddings as the upper
tier will be compared.
2 Related Work
The sequence to sequence model introduced by the
work of Sutskever et al. (2014) is vital in this re-
search as it is one of the core models in developing
sentence embedding. Though originally developed
for translation purposes the model has gone under
multiple modifications depending on the context
such as description generation for images (Karpa-
thy and Fei-Fei,2015), phrase representation (Cho
et al.,2014), attention models (Vaswani et al.,2017)
and BERT models (Devlin et al.,2018) thus prov-
ing the potential it holds in the machine learning
area.
The work of Nickel and Kiela (2017) intro-
duces and explores the potential of hyperbolic em-
bedding by using an n-dimension Poincaré ball.
The research work compares the hyperbolic and
Euclidean embeddings for a complex latent data
structure and comes to the conclusion that hyper-
bolic embedding surpasses the Euclidean embed-
ding in effectivity. Inspired by the above results,
both Leimeister and Wilson (2018) and Dhingra
et al. (2018) have extended the methodology intro-
duced by Nickel and Kiela (2017). Leimeister and
Wilson (2018) have developed a hyperbolic word
embedding using the skip-ngram negative sam-
pling architecture taken from Word2vec. In lower
embedding dimensions, the developed model per-
forms better in comparison to its Euclidean coun-
terpart. The work of Dhingra et al. (2018) uses
re-parameterization to extend the Poincaré embed-
ding, in order to learn the embedding of arbitrarily
parameterized objects. The framework thus created
is used to develop word and sentence embeddings.
In our research, we will be following the footsteps
of the above papers.
When considering the usage of hyperbolic em-
beddings in a practical context, the work of Lu
et al. (2019,2020) can be examined. The research
by Lu et al. (2019) improves the state-of-the-art
model used to predict ICU (intensive care unit)
re-admissions and surpasses the accepted bench-
mark used to predict in-hospital mortality using hy-
perbolic embedding of Electronic Health Records,
while the work of Lu et al. (2020) introduces a
novel network embedding method which is capable
of maintaining the consistency of the node represen-
tation across two views of networks, thus emphasiz-
ing the capabilities of hyperbolic embeddings. To
the best of our knowledge, hyperbolic embeddings
have not been previously applied to Sinhala content.
Therefore, this research may reveal novel insight
regarding hyperbolic embedding and its effectivity
in sentiment analysis.
In the research work of Senevirathne et al.
(2020), capsule-B model (Zhao et al.,2018) is
crowned as the state-of-the-art model for the Sin-
hala sentiment analysis. In this work, a set of
deep learning models are tested for the ability to
predict the sentiment of Sinhala news comments.
The GRU (Chung et al.,2014) model with a CNN
(Wang et al.,2016) layer which is used for the
testing of each embedding in this work is taken
from the aforementioned research. Furthermore,
the work of Weeraprameshwara et al. (2022) has
extended the idea and tested the same set of deep
learning models with the addition of sentiment
analysis models introduced in the work of Jayaw-
ickrama et al. (2021) using the Facebook data set
which is used in this research work. According to
their results, the 3 layer stacked BiLSTM model
(Zhou et al.,2019) outshines as the state-of-the-art
model.
3 Methodology
In order to test the feasibility of two-tiered word
representation as a means of representing Sinhala
text in the sentiment analysis domain, a series of
experiments were conducted as described in the
following subsections.
3.1 Data Set
The data set used for the project is extracted from
the work of Wijeratne and de Silva (2020), which
contains 1,820,930 Facebook posts from 533 Face-
book pages popular in Sri Lanka over the time
window of 2010 to 2020. The research work has
produced two cleaned corpora and a set of stop
words for the given context. The larger corpus
among them consists of a total of 28 to 29 million
words. The data set covers a wide range of subjects
such as politics, media, and celebrities. Table 1
illustrates the fields taken from the data set for the
embedding development, model training and test-
ing phases.
Field Name Total Count Percentage(%)
Likes 312,282,979 93.58
Loves 10,637,722 3.19
Wow 1,633,255 0.49
Haha 5,377,815 1.61
Sad 2,611,908 0.78
Angry 1,158,182 0.35
Thankful 12,933 0.00
Table 1: The counts and percentages of the reactions in
the Facebook data set
3.2 Preprocessing
Even though there are two preprocessed corpora
introduced through the work of Wijeratne and
de Silva (2020), the raw data set was used for this
research with the objective of preprocessing it to
suit our requirements. As such, numerical content,
URLs, email addresses, hashtags, words in other
languages except for Sinhala and English, and ex-
cessive spaces were removed from the text. While
the focus of this study is colloquial Sinhala, English
is included in the data set as the two languages are
often codemixed in colloquial use. Codemixing of
Sinhala with other languages is much less in com-
parison. Furthermore, stop words were removed
from the text as well, as recommended by Wijer-
atne and de Silva (2020). Posts with no textual con-
tent after thus preprocessing as well as posts with
no reaction annotations were also removed as they
yield no value in the annotation stage. The final
preprocessed data set consists of Sinhala, English,
and Sinhala-English code mixed content, adding
up to a total of 542,871 Facebook posts consisting
of 8,605,849 words.
3.3 Annotation
Since the procedure followed in the model devel-
opment is supervised learning, the data set needed
to be annotated (Schapire and Freund,2012). It
is quite a considerable challenge to obtain suffi-
ciently large annotated data sets for resource-poor
摘要:

SinhalaSentenceEmbedding:ATwo-TieredStructureforLow-ResourceLanguagesGihanWeeraprameshwara*,1,VihangaJayawickrama*,NisansadeSilva*,andYudhanjayaWijeratne***DepartmentofComputerScience&Engineering,UniversityofMoratuwa,SriLanka**LIRNEasia,SriLanka1gihanravindu.17@cse.mrt.ac.lkAbstractIntheprocessofnum...

展开>> 收起<<
Sinhala Sentence Embedding A Two-Tiered Structure for Low-Resource Languages Gihan Weeraprameshwara1Vihanga JayawickramaNisansa de Silva and.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:2.19MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注