Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts An Long Doan12and Son T. Luu12

2025-05-08 1 0 292.81KB 6 页 10玖币

侵权投诉

Improving Sentiment Analysis By Emotion Lexicon

Approach on Vietnamese Texts

An Long Doan1,2,* and Son T. Luu1,2,†

1University of Information Technology, Ho Chi Minh City, Vietnam

2Vietnam National University, Ho Chi Minh City, Vietnam

Email: *19521173@gm.uit.edu.vn,†sonlt@uit.edu.vn

Abstract—The sentiment analysis task has various applications

in practice. In the sentiment analysis task, words and phrases

that represent positive and negative emotions are important.

Finding out the words that represent the emotion from the text

can improve the performance of the classification models for the

sentiment analysis task. In this paper, we propose a methodology

that combines the emotion lexicon with the classification model

to enhance the accuracy of the models. Our experimental results

show that the emotion lexicon combined with the classification

model improves the performance of models.

Index Terms—sentiment analysis, emotion lexicons, text clas-

sification, machine learning, deep learning, transformers models

I. INTRODUCTION

The topic of sentiment analysis (SA) has attracted a lot of

academic interest and research, particularly in the development

of predictive models. SA has various applications in daily life

since it is a tool to monitor opinions from user-generated data

and assist decision-making [1]. The application of sentiment

analysis appeared in many fields, such as e-commerce, social

media, blogs, discussion forums, and education.

In the SA tasks, words and phrases which represent the

negative and positive sentiments play essential roles [2].

According to [3], the lexicon methods try to find the "prior

polarity" meaning of the word, while the machine learning

methods try to create a generic classifier from the domain

specified labeled dataset with the purpose of extracting the

"contextual priority" from the text. Those methodologies have

advantages and disadvantages since the authors in [3] propose

an approach to combine both two methodologies to improve

the performance of the sentiment classifier. Therefore, in this

paper, we propose a methodology of integrating the emotion

lexicons with the machine learning models to enhance the

performance of the classifiers for the sentiment analysis task

in the Vietnamese language.

Previous works in Vietnamese sentiment tasks have created

the dataset on specific domains such as social networks,

education, e-commerce, and the emotion lexicon. In this paper,

we use three datasets, including UIT-VSMEC [4], UIT-VSFC

[5], and ViHSD [6] with the VnEmoLex [7] to investigate

the performance of our approach on the Vietnamese sentiment

analysis task. All three datasets are the large-scale dataset and

are manually annotated by humans with a strict annotation pro-

cedure on a specific domain such as the social media domain

(UIT-VSMEC), student and education domain (UIT-VSFC),

and hate speech detection (ViHSD). Besides, VnEmoLex is

a lexicon emotion set with eight different emotion types and

contains 12,795 emotional words.

Our paper is structured as follows. Section II surveys several

current works in the Vietnamese sentiment analysis task.

Section III takes a brief look at the used datasets for our

experiments. Section IV describes our proposed methodologies

to combine the lexicon features with the machine learning

classifier. Section V illustrates our experimental results and

describes the error analysis for the proposed method. Finally,

section VI concludes our works and suggests future works.

II. RELATED WORKS

The sentiment analysis task can be categorized as the text

classification task. Various benchmark datasets are created to

serve the sentiment analysis task in Vietnamese for different

domains, such as the VLSP 2018 [8] and UIT-ABSA [9]

dataset for aspect-based Sentiment Analysis on restaurant and

hotel domains, the UIT-VSFC [5] dataset for sentiment anal-

ysis on student feedback, the UIT-VSMEC [4] for emotional

classification of user comments on social network sites, the

UIT-ViSFD [10] for aspect-based sentiment analysis about

smartphone feedback, and the ViHSD [6] and VLSP 2019

HSD [11] dataset for hate speech detection on social media

texts (According to [12], hate speech detection and sentiment

analysis tasks are related because they both treat the negative

and positive sentiment through the hate speech message). We

choose the UIT-VSMEC, UIT-VSFC, and ViHSD as three

datasets for evaluating our proposed methodology.

Besides the annotated dataset, VnEmoLex [7] and Viet-

SentiWordNet [13] are two lexicons used for the sentiment

analysis task. The VnEmoLex contains eight fundamental

levels of sentiment, including joy, sadness, anger, fear, trust,

disgust, surprise, and anticipation, while the VietSentiWordNet

contains only three levels, which are positivity, negativity, and

neutrality. In this paper, we use the VnEmoLex lexicon because

it has more levels of emotion than the VietSentiWordNet.

Finally, based on each dataset, there are several approaches

to construct the classification models to detect the sentiment

from text. The Maximum entropy model achieved the best

result on the UIT-VSFC dataset [5], the Text-CNN model

obtained the highest result on the UIT-VSMEC dataset [4], and

arXiv:2210.02063v3 [cs.CL] 4 Dec 2022

the BERT model gave the best result on the ViHSD dataset

[6]. From the current baseline models on each three datasets,

we propose our methodology, which combines the emotion

lexicon with the current classifier to boost the performance.

III. VIETNAMESE SENTIMENT ANALYSIS DATASETS

The UIT-VSMEC is created for emotion detection on Viet-

namese social media text [4]. This corpus has a total of

seven levels of emotion as described in Table I. We use this

dataset as the benchmark dataset to evaluate the efficiency

of our proposed methodology. Besides the UIT-VSMEC, we

also analyze our results on two remaining benchmark datasets,

including UIT-VSFC and the ViHSD to empathize with the

effectiveness of our methodology. The UIT-VSFC is created

to analyze the feedback of students about education activity

[5]. This corpus has two tasks: the sentiment-based task for

detecting user emotion from the text about the education

activity and the topic-based task for classifying the categories

belonging to the teaching and learning activities such as

lecturer, facility, and curriculum [5]. In this paper, we use

the sentiment-based task for our experiments. The labels of

the UIT-VSFC are shown in Table I. Finally, the ViHSD is

a dataset created for the hate speech detection task on the

Vietnamese language [6]. This corpus also has three labels as

shown in Table I. All three datasets are manually annotated

by humans with a detailed and strict annotation procedure.

Fig. 1. Distribution of the length of the comments in the three Vietnamese

Sentiment Analysis datasets.

Table I gives a summary of the labels for the three afore-

mentioned datasets along with percentages and examples. The

distribution of comment lengths in the UIT-VSMEC, UIT-

VSFC, and ViHSD datasets is depicted in Figure 1. It can

be seen from Figure 1 that the average length of sentences

of the three datasets is nearly the same, which are 14.01 for

the UIT-VSMEC, 14.31 for the UIT-VSFC, and 11.51 for the

ViHSD. In addition, both three datasets are imbalanced in the

label distribution, according to Table I. For the UIT-VSMEC,

the labels are skewed to Enjoyments, Sadness, and Disgust. For

the UIT-VSFC, the labels are skewed mainly to the Positive.

For the ViHSD dataset, the labels are skewed to the CLEAN

label. Additionally, according to examples shown in Table I,

we discovered that sentences are frequently short due to the

brevity users seek to communicate in social media texts (apart

from purposeful cases like spam and storytelling). Along with

that, emojis and acronyms are frequently employed to speed

up typing.

TABLE I

OVERVIEW STATISTICS OF THE THREE VIETNAMESE SENTIMENT

ANALYSIS DATASETS.

Dataset Size Average length Labels Percentage

UIT-VSMEC 6,927 14.01

FEAR 5.73

SURPRISE 4.36

ANGER 7.04

ENJOYMENT 28.08

SADNESS 17.07

DISGUST 19.32

OTHER 18.40

UIT-VSFC 16,175 14.31

POSITIVE 49.38

NEGATIVE 4.02

NEUTRAL 46.60

ViHSD 33,400 11.51

CLEAN 82.70

OFFENSIVE 6.67

HATE 10.63

In general, although the three datasets have different labels

because they were created for a specific domain task, they

have the same feature in the text. Hence, we use these three

datasets as the benchmark for evaluating the performance of

our methodology.

IV. METHODOLOGY

The task of sentiment analysis is categorized as the text

classification task. Figure 2 illustrates briefly our method-

ology, including pre-processing techniques, combining the

emotion lexicon to the feature vectors, and fitting them to the

classification models.

Fig. 2. Experimental procedure.

A. Data pre-processing

In [14], the authors proposed seven techniques to pre-

process the text based on the characteristic of Vietnamese so-

cial media texts. We adapt those pre-processing techniques for

our experiments. Our pre-processing techniques are described

below:

1) Standardizing words: Mistake words frequently appear

in social media datasets, for example: “ủaaa” should be

“ủa” (what?), or “đẹppp quáa” should be “đẹp quá” (so

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingSentimentAnalysisByEmotionLexiconApproachonVietnameseTextsAnLongDoan1,2,*andSonT.Luu1,2,1UniversityofInformationTechnology,HoChiMinhCity,Vietnam2VietnamNationalUniversity,HoChiMinhCity,VietnamEmail:*19521173@gm.uit.edu.vn,sonlt@uit.edu.vnAbstractThesentimentanalysistaskhasvariousapplicat...

展开>> 收起<<

Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts An Long Doan12and Son T. Luu12.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts An Long Doan12and Son T. Luu12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: