Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language Istiak Ahmad1 Fahad AlQurashi1 and Rashid Mehmood2

2025-05-02 0 0 9.49MB 29 页 10玖币
侵权投诉
Machine and Deep Learning Methods with Manual and Automatic
Labelling for News Classification in Bangla Language
Istiak Ahmad 1, Fahad AlQurashi 1, and Rashid Mehmood 2,*
1Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University,
Jeddah 21589, Saudi Arabia
2High Performance Computing Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*Corresponding author: RMehmood@kau.edu.sa
ABSTRACT
Research in Natural Language Processing (NLP) has increasingly become important due to applica-
tions such as text classification, text mining, sentiment analysis, POS tagging, named entity recogni-
tion, textual entailment, and many others. This paper introduces several machine and deep learning
methods with manual and automatic labelling for news classification in the Bangla language. We
implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are
Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM),
Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term
Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL al-
gorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent
Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText
word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation
(LDA) and investigate the performance of single-label and multi-label article classification methods.
To investigate performance, we developed from scratch Potrika, the largest and the most extensive
dataset for news classification in the Bangla language, comprising 185.51 million words and 12.57
million sentences contained in 664,880 news articles in eight distinct categories, curated from six
popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83%
achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and
Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data, re-
spectively. The methods developed in this paper are expected to advance research in Bangla and
other languages.
Keywords Natural Language Processing ·news classification ·Bangla language ·word embedding ·machine
learning ·deep learning ·automatic labelling ·single label classification ·multi-label classification
1 Introduction
The primary objective of text classification is to determine the class or sentiment of the unknown texts. We can define
the problem as follows. Assume, we have n texts, x=x1, x2, ..., xn, and each of them is assigned a category from a
set of categorical values l, where l={l1, l2, ...}. The training dataset is applied for generating a classification model,
which relates the features to one of the class labels. The trained classification model can ascertain the unknown class
from the text. Typically, texts are not tagged; we have to do so manually, which is the most time-consuming and
challenging task. Additionally, without tagged texts, it’s complicated to develop a classification model. Text classi-
fication has made continuous success in many applications such as sentiment analysis [1], information retrieval [2],
information filtering, knowledge management, document summarization [3], spam mail detection [4], recommended
systems, and many others, which have become immense and boundless.
About 238 million people speak Bangla natively or as a second language throughout the world (2021) [5]. As a result,
this language has carved out a niche for itself in different information-exchanging media. Bangla, with a large number
of online newspapers, blogs, Wikipedia, eBooks, literature, and so on, may be considered to be following the NLP’s
action ground contest in the imminent future. Each day, a lot of events happening around the world, and some of those
events become more trendy discussion topics for a certain time. Most of the news media are engaged in presenting the
most popular events every time. Everyone desires to follow the most influential and frequently discussed events among
arXiv:2210.10903v1 [cs.AI] 19 Oct 2022
Ahmad et al.
a large number of events happening around us at a specific time. To get the most contemporary discussion topics and
events, text analysis can automatically detect them more precisely and speedily. The research on text analysis.
This paper introduces several machine and deep learning methods with manual and automatic labelling for news clas-
sification in the Bangla language. In the case of manual labelling, we implemented several machine (ML) and deep
learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Sup-
port Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW),
Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are
Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional
Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. To address the challenges
related to the arduous task of manual labelling, we develop automatic labelling methods using Latent Dirichlet Alloca-
tion (LDA), an unsupervised topic modelling algorithm and investigate the performance of single-label and multi-label
article classification methods.
We developed Potrika – the largest and the most extensive dataset for news classification in the Bangla language
– comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles, and used it to
investigate the proposed ML and DL methods [6]. Potrika is a single-label news article textual dataset in the Bangla
language curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq,
Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct
categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology).
GRU and Fasttext with 91.83% achieve the highest accuracy for manually-labelled data. For the automatic labelling
case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data,
respectively. The lower performance for automatic-labelling-based classification is because it uses ML algorithms
compared to the case of classification with manually-labelled data where the best performance was obtained using a
DL algorithm. We will extend our work in the future to include DL methods for automatic labelling.
The NLP methods developed in this paper and the techniques for their extensive analyses are expected to advance
research in Bangla and other languages.
Hardware and Software: We use the Quadro RTX-6000 GPU, which has 4608 CUDA Parallel-Processing Cores,
576 tensor cores, and 72 RT Cores. The GPU memory is 24 GB of GDDR6. We use Python as the programming
language along with machine and deep learning libraries like Tensorflow, Keras, Scikit-Learn, Gensim, etc. We use
data visualization libraries like Seaborn and Matplotlib to visualize the evaluation results.
The following is how the paper is structured: Section 2 describes the literature review of text classification, followed
by the Bangla text classification (Sections 2.1 to 2.3) and other language text classification (Section 2.4). Section 3
discusses the proposed methodologies of our research including, the overview of methodology and framework archi-
tecture 3.1, dataset 3.2, preprocessing 3.3, feature extraction 3.4, and word embedding techniques 3.5. Machine and
deep learning methods for manual labelling are described in the Sections 3.6 and 3.7. Section 3.8 explains the method-
ology for creating the automatically labelled dataset using the unsupervised topic modeling method, and Section 3.9
discusses the methodology of multi-label news article classification. Subsequently, the results of all proposed methods
are discussed in Sections 4 and 5, which depict the manual labeling, and automatic single labeling with multi-labeling
news article classification results, respectively. Section 6 describe the discussion of the paper. Finally, in Section 7,
we conclude with recommendations for further research.
2 Literature Review
To address text classification [7], several machine and deep learning-based approaches have been introduced. In this
part, we will go through how to classify Bangla text using sentiment analysis, multi-domain, and topic modeling
methods. We also go through several methods for classifying other language-related texts.
2.1 Sentiment Classification
The core idea of sentiment analysis, or opinion mining, is to analyses the addressed text, if the text expression holds
positive, negative, or neutral meaning. For sentiment analysis in Bangla, TF-IDF was applied to a small dataset using
machine learning algorithms (see [8, 9, 10]). The word embedding method named word2vec was proposed by [11] for
Bangla sentiment analysis based on the Bangla comments. Bangla tweet data is also used for sentiment analysis. For
example, Asimuzzaman et al. [12] used an adaptive neuro-fuzzy system for Bangla tweet classification. For sentiment
detection, Hasan et al. [13] proposed WordNet and SentiWordNet as tools but the major limitation of this research was
proposed tools were developed specifically for English. Tuhin et al. [14] predicted six individual emotions using ML
2
Ahmad et al.
algorithms such as SVM and NB. Further, NB, DT, KNN, SVM, and K-means clustering were also used by Rahman et
al. [15] to predict some basic emotions from the text. In addition, mutual information-based feature selection methods
and the multi NB algorithm proposed by Paul et al. [16] for predicting sentiment polarity. N-gram and SVM based
Bangla sentiment analysis proposed by Taher et al. [17]. A popular English tool called VADER was proposed by Amin
et al. [18] to predict Bangla sentiment.
A deep learning-based algorithm named LSTM was proposed by Hassan et al. [19] for sentiment analysis, where they
used 10k Bangla and romanized Bangla text (BRBT) dataset with binary and a categorical cross-entropy loss function.
Furthermore, the CNN-based method was proposed by Alam et al. [20].
2.2 Multi-domain Text Classification
Alam et al. [21] presented a new dataset for Bengali news articles which contains about 350K articles in five categories
(State, Economy, International, Entertainment, and Sports). In their dataset, 65% of the data is labelled as State and
13.5%, 8.5%, 8% and 5% are labelled as Sports, International, Entertainment, and Economy respectively. They have
applied machine learning algorithms with two word embedding techniques such as Word2Vec and TFIDF. In another
study, the Word2vec embedding model was implemented with KNN and SVM classification algorithms by Ahmed et
al. [22] for news document classification. A classification technique based on cosine similarity and Euclidean distance
based on a set of 1000 documents was recommended by Dhar et al. [23]. They measure the β0 threshold using the 90th
percentile formula for both the distance measures and calculate the score based on the distance. In another research, the
dimensional reduction technique with TFIDF (40% of TF) was developed by Dhar et al. [24] where they used a total of
1960 Bangla text documents from five categories (Sports, Business, Science, Medical, and State) with 632,924 tokens
and applied machine learning algorithms. The classification algorithm LIBLINEAR achieved the highest accuracy. For
40 thousand news samples divided into 12 categories, Mojumder et al. [25] suggested DL algorithms such as BiLSTM,
CNN, and convolutional BiLSTM, and fastText as word embedding techniques. The Bangla article classification based
on transformers was proposed by Alam et al. [26]. In this study, they used multilingual transformer models to classify
Bangla text in several areas.
2.3 Topic Modeling-based Text Classification
Scarce research has been performed to classify Bangla text using topic modeling. Helal and Mouhoub [27] find the key
topics in the Bangla news corpus using LDA with a bigram model and classify them by applying similarity measures.
They evaluated the proposed model using the LDA and Doc2Vec models and compared the similarity scores. They
point out that in some specific news articles, the LDA performance is better than the Doc2Vec model. Alam et al. [28]
also proposed an LDA-based topic modeling algorithm using 70k Bangla news articles. They detect 5 distinct news
article topics (National, Sports, International, Technology, and Economy) and another topic called ‘others’ which
exclude the following distinct topics.
Most of the above research work has been done on machine learning algorithms with small datasets, but there has been
remarkably little works on deep learning algorithms for Bangla article classification because there is no comprehensive
dataset for Bangla article classification.
2.4 Text Classification for Other Languages
This section provides an overview of different text classification methods for other languages. Shaw et al. [29] imple-
mented ML techniques including random forest, KNN, and logistic regression to classify the news into five categories
(Entertainment, Business, Politics, Sports and Technology) based on the BBC news dataset. In terms of efficiency
among these algorithms, it turned out that logistic regression has better performance for all the categories. Another
research on three machine learning algorithms, namely SVM, neural network, and decision tree, has been done by
Raychaudhuri et al. [30] for text classification. The authors used the UCI dataset on US congressional voting that con-
sists of 16 features, 435 instance examples, 335 examples of the training dataset, 50 examples of the testing dataset,
and 50 examples of the validation dataset. They used variable C, which controls the training error. When C=1, SVM
performed better than the neural network and when C=1000, the neural network performed better. The outcome also
revealed that a fully grown decision tree produced better results than a smaller decision tree.
The data augmentation technique is most popular in computer vision research when the amount of data is small
or imbalanced. Recently, the text data augmentation technique is noted for small text datasets. Wei and Zou [31]
proposed this technique to increase the text classification performance. The following operations are proposed for
data augmentation: synonym replacements, random insertion of synonyms of a word, randomly swapping two words
positions in a sentence, and randomly removing words in a sentence.
3
Ahmad et al.
Recently, the attention mechanism has become an efficient approach to determine the important erudition to achieve
excellent outcomes. Numerous studies have been carried out on attention mechanisms and architecture. For text
classification, several novel methods are also proposed [32, 33, 34, 35]. An attention-based LSTM network was
proposed by Zhou et al. [32] to classify cross-lingual sentiments, where they used English and Chinese as the source
and target languages, respectively. A Convolutional-Recurrent Attention Network (CRAN) was proposed by Du et
al. [33]. Their proposed architecture includes a text encoder using RNN, and an attention extractor using CNN. The
experimental result shows that the model effectively extracts the salient parts from sentences along with improving
the sentence classification performance. Liu et al. [34] proffered attention-based convolution layer and BiLSTM
architecture, where the attention mechanism provides a focus for the hidden layers output. The BiLSTM is used to
extract both previous and following context, while the convolutional layer retrieves the higher-level phrase from the
word embedding vectors. Their experimental results get comparable results for all the benchmark datasets.
The state-of-the-art graph-based neural network methods for text classification have been gaining increasing attention
recently. A text graph convolutional network (TextGCN) was proposed by Yao et al. [36], which is more notable for its
small training corpus for text classification. To learn the TextGCN for the corpus, word co-occurrence and the relation
between the word document based single text graph was developed. Another tensor graph convolutional network
(TensorGCN) has been proposed by Liu et al. [37]. They develop the text graph tensor based on semantic, syntactic,
and sequential contextual information. After that, two types of propagation learning are performed on the text graph
tensor called intra-graph propagation to aggregate information from neighboring nodes and inter-graph propagation to
tune heterogeneous information between graphs.
Capsule network is another state-of-the-art method for text classification that is inherent to CNNs. Several studies
based on the capsule network have been conducted [38, 39, 40]. In capsule networks, capsules are locally invariant
groups that learn to recognize the presence of visual entities and encode their characteristics into vectors. It also re-
quires a nonlinear function called squashing, whereas neurons in a CNN act independently. However, equivariance
and dynamic routing are the two most essential characteristics of Capsule Networks that distinguish them from stan-
dard Neural Networks. A Capsule network with dynamic and static routing based text classification methods was
proposed by Kim et al. [39]. Static routing achieved higher accuracy than dynamic routing. Yang et al. [38] intro-
duced a cross-domain capsule network and illustrated the transfer learning applications for single-label to multi-label
text classification and cross-domain sentiment classification. An attention mechanism-based capsule network system
called Deep Refinement was suggested by Jain et al. [40]. Their proposed method achieved 96% accuracy for text
classification compared to BiLSTM, SVM, and C-BiLSTM for the Quora insincere question dataset.
Traditional text classification techniques use manually labelled datasets that are monotonous and time-consuming.
Recently, a few dataless text classification techniques, for example, the Laplacian seed word topic model (Lap-
SWTM) [41], and seed-guided multi-label topic model (SMTM) [42] have recently been proposed to solve this chal-
lenge. Anantharaman et al. [43] proposed large and short text classification non-negative matrix factorization, LDA,
and LSA (latent semantic analysis). LSA with TFIDF was proposed by Neogi et al. [44] for text classification. To
increase the accuracy, they used entropy. A self-training LDA based semi-supervised text classification method was
proposed by pavlinek et al. [45] for text classification.
2.5 Research Gap, Novelty, and Contributions
Text datasets, often known as corpora, are used to study linguistic phenomena including text classification, morpho-
logical structure, word sense disambiguation, language evolution over time, and spelling checking. The quality and
amount of the corpus have a big impact on the research output. A well-structured, comprehensive corpus can yield far
superior study results. In comparison to the English language, there has been inadequate study done due to the paucity
of the Bangla corpus and the complicated grammatical structure. In this paper, our contributions are as follows:
We are the first to use a comprehensive Bangla newspaper article dataset called Potrika [6, 46] to classify eight
distinct news article classes, including Education, Entertainment, Sports, Politics, National, International,
Economy, and Science & Technology.
We implement both machine learning (ML) including logistic regression, SGD, SVM, RF and KNN algo-
rithms, and deep learning (DL) including CNN, LSTM, BiLSTM, and GRU algorithms for single label news
article classification. We perform BOW, TFIDF, and Doc2Vec word embedding models for ML algorithms.
For DL algorithms, we apply word embedding models such as word2vec, glove, and fasttext that were de-
veloped based on the Potrika dataset. These word embedding models are not only valuable for news article
classification but also for other NLP tasks like text summarization, named entity recognition, Bangla auto-
matic word prediction, question-answering systems, etc. Further, we evaluate and scrutinise the results for
both cases.
4
Ahmad et al.
Manual labeling is the most difficult and time-consuming task for classification datasets. In the following
paper, we investigate the possibility of using the topic modeling algorithm to automatically label the news
article dataset and compare the performance of the automatically labelled dataset with that of the manually
labelled dataset. Additionally, we also develop another multi-label dataset based on the automatic label
dataset and evaluate the multi-label news article classification’s performance.
The NLP work proposed in this paper builds on our earlier NLP works applied to several sectors and multiple languages
including transportation [47, 48, 49], healthcare [50, 51], education [52, 53], and smart cities [54, 55, 56, 57]. We
expect that this paper will significantly increase the impact of our work particularly in the Bangla language.
3 Methodology and Design
In this section, we describe our methodology for the research presented in this paper. We begin with an overview of
our methodology in Section 3.1 followed by a description of the dataset in Section 3.2. Data preprocessing and feature
extraction are explained in Sections 3.3 and 3.4. Word Embedding Models are discussed in Section 3.5. Machine
learning and deep learning techniques are discussed in Section 3.6 and 3.7, respectively. Subsequently, we explain our
methodology for automatically creating labels for the news items. The details of automatic labeling with single labels
are provided in Section 3.8 and the details of marking news items with multiple labels are given in Section 3.9.
Figure 1: System Process for Article Classification Overview
3.1 Methodology Overview
As mentioned earlier, the aim of the paper is to investigate the performance of machine and deep learning-based news
classification in the Bangla language using manual and automatic labeling of news documents. Towards this end, we
explore, firstly, the classification of manually labelled news data using machine and deep learning algorithms and,
secondly, the classification of automatically labelled news items using single label and multi-label approaches. The
automatic labeling is done using topic modeling. The overview and detailed architecture of our methodology are
depicted in Figure 1 and Figure 2 and its algorithmic flow is provided in Algorithm 1.
Algorithm 1 Master Algorithm
Input: ReadtrainDF, testDF, potrikaDF
Output: Evaluationof newsarticleclassification
1: cleanTrText, cleanTsText, cleanText preprocessing (trainDF, testDF, potrikaDF)
2: evaluation man_ML(cleanTrText, trainDF.class, cleanTsText, testDF.class)
3: w2vmodel, glovemodel, fasttextmodel getWordEmbeddingModels(cleanText)
4: evaluation man_DL(w2vmodel, glovemodel, fasttextmodel)
5: evaluation, autoLabelingDF auto_singleLabel(trainDF, testDF)
6: evaluation auto_multiLabel(autoLabelingDF)
5
摘要:

MachineandDeepLearningMethodswithManualandAutomaticLabellingforNewsClassicationinBanglaLanguageIstiakAhmad1,FahadAlQurashi1,andRashidMehmood2,*1DepartmentofComputerScience,FacultyofComputingandInformationTechnology,KingAbdulazizUniversity,Jeddah21589,SaudiArabia2HighPerformanceComputingCenter,KingA...

收起<<
Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language Istiak Ahmad1 Fahad AlQurashi1 and Rashid Mehmood2.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:29 页 大小:9.49MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注