Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts An Long Doan12and Son T. Luu12

2025-05-08 0 0 292.81KB 6 页 10玖币
侵权投诉
Improving Sentiment Analysis By Emotion Lexicon
Approach on Vietnamese Texts
An Long Doan1,2,* and Son T. Luu1,2,
1University of Information Technology, Ho Chi Minh City, Vietnam
2Vietnam National University, Ho Chi Minh City, Vietnam
Email: *19521173@gm.uit.edu.vn,sonlt@uit.edu.vn
Abstract—The sentiment analysis task has various applications
in practice. In the sentiment analysis task, words and phrases
that represent positive and negative emotions are important.
Finding out the words that represent the emotion from the text
can improve the performance of the classification models for the
sentiment analysis task. In this paper, we propose a methodology
that combines the emotion lexicon with the classification model
to enhance the accuracy of the models. Our experimental results
show that the emotion lexicon combined with the classification
model improves the performance of models.
Index Terms—sentiment analysis, emotion lexicons, text clas-
sification, machine learning, deep learning, transformers models
I. INTRODUCTION
The topic of sentiment analysis (SA) has attracted a lot of
academic interest and research, particularly in the development
of predictive models. SA has various applications in daily life
since it is a tool to monitor opinions from user-generated data
and assist decision-making [1]. The application of sentiment
analysis appeared in many fields, such as e-commerce, social
media, blogs, discussion forums, and education.
In the SA tasks, words and phrases which represent the
negative and positive sentiments play essential roles [2].
According to [3], the lexicon methods try to find the "prior
polarity" meaning of the word, while the machine learning
methods try to create a generic classifier from the domain
specified labeled dataset with the purpose of extracting the
"contextual priority" from the text. Those methodologies have
advantages and disadvantages since the authors in [3] propose
an approach to combine both two methodologies to improve
the performance of the sentiment classifier. Therefore, in this
paper, we propose a methodology of integrating the emotion
lexicons with the machine learning models to enhance the
performance of the classifiers for the sentiment analysis task
in the Vietnamese language.
Previous works in Vietnamese sentiment tasks have created
the dataset on specific domains such as social networks,
education, e-commerce, and the emotion lexicon. In this paper,
we use three datasets, including UIT-VSMEC [4], UIT-VSFC
[5], and ViHSD [6] with the VnEmoLex [7] to investigate
the performance of our approach on the Vietnamese sentiment
analysis task. All three datasets are the large-scale dataset and
are manually annotated by humans with a strict annotation pro-
cedure on a specific domain such as the social media domain
(UIT-VSMEC), student and education domain (UIT-VSFC),
and hate speech detection (ViHSD). Besides, VnEmoLex is
a lexicon emotion set with eight different emotion types and
contains 12,795 emotional words.
Our paper is structured as follows. Section II surveys several
current works in the Vietnamese sentiment analysis task.
Section III takes a brief look at the used datasets for our
experiments. Section IV describes our proposed methodologies
to combine the lexicon features with the machine learning
classifier. Section V illustrates our experimental results and
describes the error analysis for the proposed method. Finally,
section VI concludes our works and suggests future works.
II. RELATED WORKS
The sentiment analysis task can be categorized as the text
classification task. Various benchmark datasets are created to
serve the sentiment analysis task in Vietnamese for different
domains, such as the VLSP 2018 [8] and UIT-ABSA [9]
dataset for aspect-based Sentiment Analysis on restaurant and
hotel domains, the UIT-VSFC [5] dataset for sentiment anal-
ysis on student feedback, the UIT-VSMEC [4] for emotional
classification of user comments on social network sites, the
UIT-ViSFD [10] for aspect-based sentiment analysis about
smartphone feedback, and the ViHSD [6] and VLSP 2019
HSD [11] dataset for hate speech detection on social media
texts (According to [12], hate speech detection and sentiment
analysis tasks are related because they both treat the negative
and positive sentiment through the hate speech message). We
choose the UIT-VSMEC, UIT-VSFC, and ViHSD as three
datasets for evaluating our proposed methodology.
Besides the annotated dataset, VnEmoLex [7] and Viet-
SentiWordNet [13] are two lexicons used for the sentiment
analysis task. The VnEmoLex contains eight fundamental
levels of sentiment, including joy, sadness, anger, fear, trust,
disgust, surprise, and anticipation, while the VietSentiWordNet
contains only three levels, which are positivity, negativity, and
neutrality. In this paper, we use the VnEmoLex lexicon because
it has more levels of emotion than the VietSentiWordNet.
Finally, based on each dataset, there are several approaches
to construct the classification models to detect the sentiment
from text. The Maximum entropy model achieved the best
result on the UIT-VSFC dataset [5], the Text-CNN model
obtained the highest result on the UIT-VSMEC dataset [4], and
arXiv:2210.02063v3 [cs.CL] 4 Dec 2022
the BERT model gave the best result on the ViHSD dataset
[6]. From the current baseline models on each three datasets,
we propose our methodology, which combines the emotion
lexicon with the current classifier to boost the performance.
III. VIETNAMESE SENTIMENT ANALYSIS DATASETS
The UIT-VSMEC is created for emotion detection on Viet-
namese social media text [4]. This corpus has a total of
seven levels of emotion as described in Table I. We use this
dataset as the benchmark dataset to evaluate the efficiency
of our proposed methodology. Besides the UIT-VSMEC, we
also analyze our results on two remaining benchmark datasets,
including UIT-VSFC and the ViHSD to empathize with the
effectiveness of our methodology. The UIT-VSFC is created
to analyze the feedback of students about education activity
[5]. This corpus has two tasks: the sentiment-based task for
detecting user emotion from the text about the education
activity and the topic-based task for classifying the categories
belonging to the teaching and learning activities such as
lecturer, facility, and curriculum [5]. In this paper, we use
the sentiment-based task for our experiments. The labels of
the UIT-VSFC are shown in Table I. Finally, the ViHSD is
a dataset created for the hate speech detection task on the
Vietnamese language [6]. This corpus also has three labels as
shown in Table I. All three datasets are manually annotated
by humans with a detailed and strict annotation procedure.
Fig. 1. Distribution of the length of the comments in the three Vietnamese
Sentiment Analysis datasets.
Table I gives a summary of the labels for the three afore-
mentioned datasets along with percentages and examples. The
distribution of comment lengths in the UIT-VSMEC, UIT-
VSFC, and ViHSD datasets is depicted in Figure 1. It can
be seen from Figure 1 that the average length of sentences
of the three datasets is nearly the same, which are 14.01 for
the UIT-VSMEC, 14.31 for the UIT-VSFC, and 11.51 for the
ViHSD. In addition, both three datasets are imbalanced in the
label distribution, according to Table I. For the UIT-VSMEC,
the labels are skewed to Enjoyments, Sadness, and Disgust. For
the UIT-VSFC, the labels are skewed mainly to the Positive.
For the ViHSD dataset, the labels are skewed to the CLEAN
label. Additionally, according to examples shown in Table I,
we discovered that sentences are frequently short due to the
brevity users seek to communicate in social media texts (apart
from purposeful cases like spam and storytelling). Along with
that, emojis and acronyms are frequently employed to speed
up typing.
TABLE I
OVERVIEW STATISTICS OF THE THREE VIETNAMESE SENTIMENT
ANALYSIS DATASETS.
Dataset Size Average length Labels Percentage
UIT-VSMEC 6,927 14.01
FEAR 5.73
SURPRISE 4.36
ANGER 7.04
ENJOYMENT 28.08
SADNESS 17.07
DISGUST 19.32
OTHER 18.40
UIT-VSFC 16,175 14.31
POSITIVE 49.38
NEGATIVE 4.02
NEUTRAL 46.60
ViHSD 33,400 11.51
CLEAN 82.70
OFFENSIVE 6.67
HATE 10.63
In general, although the three datasets have different labels
because they were created for a specific domain task, they
have the same feature in the text. Hence, we use these three
datasets as the benchmark for evaluating the performance of
our methodology.
IV. METHODOLOGY
The task of sentiment analysis is categorized as the text
classification task. Figure 2 illustrates briefly our method-
ology, including pre-processing techniques, combining the
emotion lexicon to the feature vectors, and fitting them to the
classification models.
Fig. 2. Experimental procedure.
A. Data pre-processing
In [14], the authors proposed seven techniques to pre-
process the text based on the characteristic of Vietnamese so-
cial media texts. We adapt those pre-processing techniques for
our experiments. Our pre-processing techniques are described
below:
1) Standardizing words: Mistake words frequently appear
in social media datasets, for example: “ủaaa” should be
“ủa” (what?), or “đẹppp quáa” should be “đẹp quá” (so
摘要:

ImprovingSentimentAnalysisByEmotionLexiconApproachonVietnameseTextsAnLongDoan1,2,*andSonT.Luu1,2,†1UniversityofInformationTechnology,HoChiMinhCity,Vietnam2VietnamNationalUniversity,HoChiMinhCity,VietnamEmail:*19521173@gm.uit.edu.vn,†sonlt@uit.edu.vnAbstractThesentimentanalysistaskhasvariousapplicat...

展开>> 收起<<
Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts An Long Doan12and Son T. Luu12.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:292.81KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注