Improving Sentiment Analysis By Emotion Lexicon
Approach on Vietnamese Texts
An Long Doan1,2,* and Son T. Luu1,2,†
1University of Information Technology, Ho Chi Minh City, Vietnam
2Vietnam National University, Ho Chi Minh City, Vietnam
Email: *19521173@gm.uit.edu.vn,†sonlt@uit.edu.vn
Abstract—The sentiment analysis task has various applications
in practice. In the sentiment analysis task, words and phrases
that represent positive and negative emotions are important.
Finding out the words that represent the emotion from the text
can improve the performance of the classification models for the
sentiment analysis task. In this paper, we propose a methodology
that combines the emotion lexicon with the classification model
to enhance the accuracy of the models. Our experimental results
show that the emotion lexicon combined with the classification
model improves the performance of models.
Index Terms—sentiment analysis, emotion lexicons, text clas-
sification, machine learning, deep learning, transformers models
I. INTRODUCTION
The topic of sentiment analysis (SA) has attracted a lot of
academic interest and research, particularly in the development
of predictive models. SA has various applications in daily life
since it is a tool to monitor opinions from user-generated data
and assist decision-making [1]. The application of sentiment
analysis appeared in many fields, such as e-commerce, social
media, blogs, discussion forums, and education.
In the SA tasks, words and phrases which represent the
negative and positive sentiments play essential roles [2].
According to [3], the lexicon methods try to find the "prior
polarity" meaning of the word, while the machine learning
methods try to create a generic classifier from the domain
specified labeled dataset with the purpose of extracting the
"contextual priority" from the text. Those methodologies have
advantages and disadvantages since the authors in [3] propose
an approach to combine both two methodologies to improve
the performance of the sentiment classifier. Therefore, in this
paper, we propose a methodology of integrating the emotion
lexicons with the machine learning models to enhance the
performance of the classifiers for the sentiment analysis task
in the Vietnamese language.
Previous works in Vietnamese sentiment tasks have created
the dataset on specific domains such as social networks,
education, e-commerce, and the emotion lexicon. In this paper,
we use three datasets, including UIT-VSMEC [4], UIT-VSFC
[5], and ViHSD [6] with the VnEmoLex [7] to investigate
the performance of our approach on the Vietnamese sentiment
analysis task. All three datasets are the large-scale dataset and
are manually annotated by humans with a strict annotation pro-
cedure on a specific domain such as the social media domain
(UIT-VSMEC), student and education domain (UIT-VSFC),
and hate speech detection (ViHSD). Besides, VnEmoLex is
a lexicon emotion set with eight different emotion types and
contains 12,795 emotional words.
Our paper is structured as follows. Section II surveys several
current works in the Vietnamese sentiment analysis task.
Section III takes a brief look at the used datasets for our
experiments. Section IV describes our proposed methodologies
to combine the lexicon features with the machine learning
classifier. Section V illustrates our experimental results and
describes the error analysis for the proposed method. Finally,
section VI concludes our works and suggests future works.
II. RELATED WORKS
The sentiment analysis task can be categorized as the text
classification task. Various benchmark datasets are created to
serve the sentiment analysis task in Vietnamese for different
domains, such as the VLSP 2018 [8] and UIT-ABSA [9]
dataset for aspect-based Sentiment Analysis on restaurant and
hotel domains, the UIT-VSFC [5] dataset for sentiment anal-
ysis on student feedback, the UIT-VSMEC [4] for emotional
classification of user comments on social network sites, the
UIT-ViSFD [10] for aspect-based sentiment analysis about
smartphone feedback, and the ViHSD [6] and VLSP 2019
HSD [11] dataset for hate speech detection on social media
texts (According to [12], hate speech detection and sentiment
analysis tasks are related because they both treat the negative
and positive sentiment through the hate speech message). We
choose the UIT-VSMEC, UIT-VSFC, and ViHSD as three
datasets for evaluating our proposed methodology.
Besides the annotated dataset, VnEmoLex [7] and Viet-
SentiWordNet [13] are two lexicons used for the sentiment
analysis task. The VnEmoLex contains eight fundamental
levels of sentiment, including joy, sadness, anger, fear, trust,
disgust, surprise, and anticipation, while the VietSentiWordNet
contains only three levels, which are positivity, negativity, and
neutrality. In this paper, we use the VnEmoLex lexicon because
it has more levels of emotion than the VietSentiWordNet.
Finally, based on each dataset, there are several approaches
to construct the classification models to detect the sentiment
from text. The Maximum entropy model achieved the best
result on the UIT-VSFC dataset [5], the Text-CNN model
obtained the highest result on the UIT-VSMEC dataset [4], and
arXiv:2210.02063v3 [cs.CL] 4 Dec 2022