Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classiﬁcation in Bangla Language Istiak Ahmad1 Fahad AlQurashi1 and Rashid Mehmood2

2025-05-02 1 0 9.49MB 29 页 10玖币

侵权投诉

Machine and Deep Learning Methods with Manual and Automatic

Labelling for News Classiﬁcation in Bangla Language

Istiak Ahmad 1, Fahad AlQurashi 1, and Rashid Mehmood 2,*

1Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University,

Jeddah 21589, Saudi Arabia

2High Performance Computing Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia

*Corresponding author: RMehmood@kau.edu.sa

ABSTRACT

Research in Natural Language Processing (NLP) has increasingly become important due to applica-

tions such as text classiﬁcation, text mining, sentiment analysis, POS tagging, named entity recogni-

tion, textual entailment, and many others. This paper introduces several machine and deep learning

methods with manual and automatic labelling for news classiﬁcation in the Bangla language. We

implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are

Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM),

Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term

Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL al-

gorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent

Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText

word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation

(LDA) and investigate the performance of single-label and multi-label article classiﬁcation methods.

To investigate performance, we developed from scratch Potrika, the largest and the most extensive

dataset for news classiﬁcation in the Bangla language, comprising 185.51 million words and 12.57

million sentences contained in 664,880 news articles in eight distinct categories, curated from six

popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83%

achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and

Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data, re-

spectively. The methods developed in this paper are expected to advance research in Bangla and

other languages.

Keywords Natural Language Processing ·news classiﬁcation ·Bangla language ·word embedding ·machine

learning ·deep learning ·automatic labelling ·single label classiﬁcation ·multi-label classiﬁcation

1 Introduction

The primary objective of text classiﬁcation is to determine the class or sentiment of the unknown texts. We can deﬁne

the problem as follows. Assume, we have n texts, x=x1, x2, ..., xn, and each of them is assigned a category from a

set of categorical values l, where l={l1, l2, ...}. The training dataset is applied for generating a classiﬁcation model,

which relates the features to one of the class labels. The trained classiﬁcation model can ascertain the unknown class

from the text. Typically, texts are not tagged; we have to do so manually, which is the most time-consuming and

challenging task. Additionally, without tagged texts, it’s complicated to develop a classiﬁcation model. Text classi-

ﬁcation has made continuous success in many applications such as sentiment analysis [1], information retrieval [2],

information ﬁltering, knowledge management, document summarization [3], spam mail detection [4], recommended

systems, and many others, which have become immense and boundless.

About 238 million people speak Bangla natively or as a second language throughout the world (2021) [5]. As a result,

this language has carved out a niche for itself in different information-exchanging media. Bangla, with a large number

of online newspapers, blogs, Wikipedia, eBooks, literature, and so on, may be considered to be following the NLP’s

action ground contest in the imminent future. Each day, a lot of events happening around the world, and some of those

events become more trendy discussion topics for a certain time. Most of the news media are engaged in presenting the

most popular events every time. Everyone desires to follow the most inﬂuential and frequently discussed events among

arXiv:2210.10903v1 [cs.AI] 19 Oct 2022

Ahmad et al.

a large number of events happening around us at a speciﬁc time. To get the most contemporary discussion topics and

events, text analysis can automatically detect them more precisely and speedily. The research on text analysis.

This paper introduces several machine and deep learning methods with manual and automatic labelling for news clas-

siﬁcation in the Bangla language. In the case of manual labelling, we implemented several machine (ML) and deep

learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Sup-

port Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW),

Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are

Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional

Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. To address the challenges

related to the arduous task of manual labelling, we develop automatic labelling methods using Latent Dirichlet Alloca-

tion (LDA), an unsupervised topic modelling algorithm and investigate the performance of single-label and multi-label

article classiﬁcation methods.

We developed Potrika – the largest and the most extensive dataset for news classiﬁcation in the Bangla language

– comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles, and used it to

investigate the proposed ML and DL methods [6]. Potrika is a single-label news article textual dataset in the Bangla

language curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq,

Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classiﬁed into eight distinct

categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology).

GRU and Fasttext with 91.83% achieve the highest accuracy for manually-labelled data. For the automatic labelling

case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data,

respectively. The lower performance for automatic-labelling-based classiﬁcation is because it uses ML algorithms

compared to the case of classiﬁcation with manually-labelled data where the best performance was obtained using a

DL algorithm. We will extend our work in the future to include DL methods for automatic labelling.

The NLP methods developed in this paper and the techniques for their extensive analyses are expected to advance

research in Bangla and other languages.

Hardware and Software: We use the Quadro RTX-6000 GPU, which has 4608 CUDA Parallel-Processing Cores,

576 tensor cores, and 72 RT Cores. The GPU memory is 24 GB of GDDR6. We use Python as the programming

language along with machine and deep learning libraries like Tensorﬂow, Keras, Scikit-Learn, Gensim, etc. We use

data visualization libraries like Seaborn and Matplotlib to visualize the evaluation results.

The following is how the paper is structured: Section 2 describes the literature review of text classiﬁcation, followed

by the Bangla text classiﬁcation (Sections 2.1 to 2.3) and other language text classiﬁcation (Section 2.4). Section 3

discusses the proposed methodologies of our research including, the overview of methodology and framework archi-

tecture 3.1, dataset 3.2, preprocessing 3.3, feature extraction 3.4, and word embedding techniques 3.5. Machine and

deep learning methods for manual labelling are described in the Sections 3.6 and 3.7. Section 3.8 explains the method-

ology for creating the automatically labelled dataset using the unsupervised topic modeling method, and Section 3.9

discusses the methodology of multi-label news article classiﬁcation. Subsequently, the results of all proposed methods

are discussed in Sections 4 and 5, which depict the manual labeling, and automatic single labeling with multi-labeling

news article classiﬁcation results, respectively. Section 6 describe the discussion of the paper. Finally, in Section 7,

we conclude with recommendations for further research.

2 Literature Review

To address text classiﬁcation [7], several machine and deep learning-based approaches have been introduced. In this

part, we will go through how to classify Bangla text using sentiment analysis, multi-domain, and topic modeling

methods. We also go through several methods for classifying other language-related texts.

2.1 Sentiment Classiﬁcation

The core idea of sentiment analysis, or opinion mining, is to analyses the addressed text, if the text expression holds

positive, negative, or neutral meaning. For sentiment analysis in Bangla, TF-IDF was applied to a small dataset using

machine learning algorithms (see [8, 9, 10]). The word embedding method named word2vec was proposed by [11] for

Bangla sentiment analysis based on the Bangla comments. Bangla tweet data is also used for sentiment analysis. For

example, Asimuzzaman et al. [12] used an adaptive neuro-fuzzy system for Bangla tweet classiﬁcation. For sentiment

detection, Hasan et al. [13] proposed WordNet and SentiWordNet as tools but the major limitation of this research was

proposed tools were developed speciﬁcally for English. Tuhin et al. [14] predicted six individual emotions using ML

Ahmad et al.

algorithms such as SVM and NB. Further, NB, DT, KNN, SVM, and K-means clustering were also used by Rahman et

al. [15] to predict some basic emotions from the text. In addition, mutual information-based feature selection methods

and the multi NB algorithm proposed by Paul et al. [16] for predicting sentiment polarity. N-gram and SVM based

Bangla sentiment analysis proposed by Taher et al. [17]. A popular English tool called VADER was proposed by Amin

et al. [18] to predict Bangla sentiment.

A deep learning-based algorithm named LSTM was proposed by Hassan et al. [19] for sentiment analysis, where they

used 10k Bangla and romanized Bangla text (BRBT) dataset with binary and a categorical cross-entropy loss function.

Furthermore, the CNN-based method was proposed by Alam et al. [20].

2.2 Multi-domain Text Classiﬁcation

Alam et al. [21] presented a new dataset for Bengali news articles which contains about 350K articles in ﬁve categories

(State, Economy, International, Entertainment, and Sports). In their dataset, 65% of the data is labelled as State and

13.5%, 8.5%, 8% and 5% are labelled as Sports, International, Entertainment, and Economy respectively. They have

applied machine learning algorithms with two word embedding techniques such as Word2Vec and TFIDF. In another

study, the Word2vec embedding model was implemented with KNN and SVM classiﬁcation algorithms by Ahmed et

al. [22] for news document classiﬁcation. A classiﬁcation technique based on cosine similarity and Euclidean distance

based on a set of 1000 documents was recommended by Dhar et al. [23]. They measure the β0 threshold using the 90th

percentile formula for both the distance measures and calculate the score based on the distance. In another research, the

dimensional reduction technique with TFIDF (40% of TF) was developed by Dhar et al. [24] where they used a total of

1960 Bangla text documents from ﬁve categories (Sports, Business, Science, Medical, and State) with 632,924 tokens

and applied machine learning algorithms. The classiﬁcation algorithm LIBLINEAR achieved the highest accuracy. For

40 thousand news samples divided into 12 categories, Mojumder et al. [25] suggested DL algorithms such as BiLSTM,

CNN, and convolutional BiLSTM, and fastText as word embedding techniques. The Bangla article classiﬁcation based

on transformers was proposed by Alam et al. [26]. In this study, they used multilingual transformer models to classify

Bangla text in several areas.

2.3 Topic Modeling-based Text Classiﬁcation

Scarce research has been performed to classify Bangla text using topic modeling. Helal and Mouhoub [27] ﬁnd the key

topics in the Bangla news corpus using LDA with a bigram model and classify them by applying similarity measures.

They evaluated the proposed model using the LDA and Doc2Vec models and compared the similarity scores. They

point out that in some speciﬁc news articles, the LDA performance is better than the Doc2Vec model. Alam et al. [28]

also proposed an LDA-based topic modeling algorithm using 70k Bangla news articles. They detect 5 distinct news

article topics (National, Sports, International, Technology, and Economy) and another topic called ‘others’ which

exclude the following distinct topics.

Most of the above research work has been done on machine learning algorithms with small datasets, but there has been

remarkably little works on deep learning algorithms for Bangla article classiﬁcation because there is no comprehensive

dataset for Bangla article classiﬁcation.

2.4 Text Classiﬁcation for Other Languages

This section provides an overview of different text classiﬁcation methods for other languages. Shaw et al. [29] imple-

mented ML techniques including random forest, KNN, and logistic regression to classify the news into ﬁve categories

(Entertainment, Business, Politics, Sports and Technology) based on the BBC news dataset. In terms of efﬁciency

among these algorithms, it turned out that logistic regression has better performance for all the categories. Another

research on three machine learning algorithms, namely SVM, neural network, and decision tree, has been done by

Raychaudhuri et al. [30] for text classiﬁcation. The authors used the UCI dataset on US congressional voting that con-

sists of 16 features, 435 instance examples, 335 examples of the training dataset, 50 examples of the testing dataset,

and 50 examples of the validation dataset. They used variable C, which controls the training error. When C=1, SVM

performed better than the neural network and when C=1000, the neural network performed better. The outcome also

revealed that a fully grown decision tree produced better results than a smaller decision tree.

The data augmentation technique is most popular in computer vision research when the amount of data is small

or imbalanced. Recently, the text data augmentation technique is noted for small text datasets. Wei and Zou [31]

proposed this technique to increase the text classiﬁcation performance. The following operations are proposed for

data augmentation: synonym replacements, random insertion of synonyms of a word, randomly swapping two words

positions in a sentence, and randomly removing words in a sentence.

Ahmad et al.

Recently, the attention mechanism has become an efﬁcient approach to determine the important erudition to achieve

excellent outcomes. Numerous studies have been carried out on attention mechanisms and architecture. For text

classiﬁcation, several novel methods are also proposed [32, 33, 34, 35]. An attention-based LSTM network was

proposed by Zhou et al. [32] to classify cross-lingual sentiments, where they used English and Chinese as the source

and target languages, respectively. A Convolutional-Recurrent Attention Network (CRAN) was proposed by Du et

al. [33]. Their proposed architecture includes a text encoder using RNN, and an attention extractor using CNN. The

experimental result shows that the model effectively extracts the salient parts from sentences along with improving

the sentence classiﬁcation performance. Liu et al. [34] proffered attention-based convolution layer and BiLSTM

architecture, where the attention mechanism provides a focus for the hidden layers output. The BiLSTM is used to

extract both previous and following context, while the convolutional layer retrieves the higher-level phrase from the

word embedding vectors. Their experimental results get comparable results for all the benchmark datasets.

The state-of-the-art graph-based neural network methods for text classiﬁcation have been gaining increasing attention

recently. A text graph convolutional network (TextGCN) was proposed by Yao et al. [36], which is more notable for its

small training corpus for text classiﬁcation. To learn the TextGCN for the corpus, word co-occurrence and the relation

between the word document based single text graph was developed. Another tensor graph convolutional network

(TensorGCN) has been proposed by Liu et al. [37]. They develop the text graph tensor based on semantic, syntactic,

and sequential contextual information. After that, two types of propagation learning are performed on the text graph

tensor called intra-graph propagation to aggregate information from neighboring nodes and inter-graph propagation to

tune heterogeneous information between graphs.

Capsule network is another state-of-the-art method for text classiﬁcation that is inherent to CNNs. Several studies

based on the capsule network have been conducted [38, 39, 40]. In capsule networks, capsules are locally invariant

groups that learn to recognize the presence of visual entities and encode their characteristics into vectors. It also re-

quires a nonlinear function called squashing, whereas neurons in a CNN act independently. However, equivariance

and dynamic routing are the two most essential characteristics of Capsule Networks that distinguish them from stan-

dard Neural Networks. A Capsule network with dynamic and static routing based text classiﬁcation methods was

proposed by Kim et al. [39]. Static routing achieved higher accuracy than dynamic routing. Yang et al. [38] intro-

duced a cross-domain capsule network and illustrated the transfer learning applications for single-label to multi-label

text classiﬁcation and cross-domain sentiment classiﬁcation. An attention mechanism-based capsule network system

called Deep Reﬁnement was suggested by Jain et al. [40]. Their proposed method achieved 96% accuracy for text

classiﬁcation compared to BiLSTM, SVM, and C-BiLSTM for the Quora insincere question dataset.

Traditional text classiﬁcation techniques use manually labelled datasets that are monotonous and time-consuming.

Recently, a few dataless text classiﬁcation techniques, for example, the Laplacian seed word topic model (Lap-

SWTM) [41], and seed-guided multi-label topic model (SMTM) [42] have recently been proposed to solve this chal-

lenge. Anantharaman et al. [43] proposed large and short text classiﬁcation non-negative matrix factorization, LDA,

and LSA (latent semantic analysis). LSA with TFIDF was proposed by Neogi et al. [44] for text classiﬁcation. To

increase the accuracy, they used entropy. A self-training LDA based semi-supervised text classiﬁcation method was

proposed by pavlinek et al. [45] for text classiﬁcation.

2.5 Research Gap, Novelty, and Contributions

Text datasets, often known as corpora, are used to study linguistic phenomena including text classiﬁcation, morpho-

logical structure, word sense disambiguation, language evolution over time, and spelling checking. The quality and

amount of the corpus have a big impact on the research output. A well-structured, comprehensive corpus can yield far

superior study results. In comparison to the English language, there has been inadequate study done due to the paucity

of the Bangla corpus and the complicated grammatical structure. In this paper, our contributions are as follows:

• We are the ﬁrst to use a comprehensive Bangla newspaper article dataset called Potrika [6, 46] to classify eight

distinct news article classes, including Education, Entertainment, Sports, Politics, National, International,

Economy, and Science & Technology.

• We implement both machine learning (ML) including logistic regression, SGD, SVM, RF and KNN algo-

rithms, and deep learning (DL) including CNN, LSTM, BiLSTM, and GRU algorithms for single label news

article classiﬁcation. We perform BOW, TFIDF, and Doc2Vec word embedding models for ML algorithms.

For DL algorithms, we apply word embedding models such as word2vec, glove, and fasttext that were de-

veloped based on the Potrika dataset. These word embedding models are not only valuable for news article

classiﬁcation but also for other NLP tasks like text summarization, named entity recognition, Bangla auto-

matic word prediction, question-answering systems, etc. Further, we evaluate and scrutinise the results for

both cases.

Ahmad et al.

• Manual labeling is the most difﬁcult and time-consuming task for classiﬁcation datasets. In the following

paper, we investigate the possibility of using the topic modeling algorithm to automatically label the news

article dataset and compare the performance of the automatically labelled dataset with that of the manually

labelled dataset. Additionally, we also develop another multi-label dataset based on the automatic label

dataset and evaluate the multi-label news article classiﬁcation’s performance.

The NLP work proposed in this paper builds on our earlier NLP works applied to several sectors and multiple languages

including transportation [47, 48, 49], healthcare [50, 51], education [52, 53], and smart cities [54, 55, 56, 57]. We

expect that this paper will signiﬁcantly increase the impact of our work particularly in the Bangla language.

3 Methodology and Design

In this section, we describe our methodology for the research presented in this paper. We begin with an overview of

our methodology in Section 3.1 followed by a description of the dataset in Section 3.2. Data preprocessing and feature

extraction are explained in Sections 3.3 and 3.4. Word Embedding Models are discussed in Section 3.5. Machine

learning and deep learning techniques are discussed in Section 3.6 and 3.7, respectively. Subsequently, we explain our

methodology for automatically creating labels for the news items. The details of automatic labeling with single labels

are provided in Section 3.8 and the details of marking news items with multiple labels are given in Section 3.9.

Figure 1: System Process for Article Classiﬁcation Overview

3.1 Methodology Overview

As mentioned earlier, the aim of the paper is to investigate the performance of machine and deep learning-based news

classiﬁcation in the Bangla language using manual and automatic labeling of news documents. Towards this end, we

explore, ﬁrstly, the classiﬁcation of manually labelled news data using machine and deep learning algorithms and,

secondly, the classiﬁcation of automatically labelled news items using single label and multi-label approaches. The

automatic labeling is done using topic modeling. The overview and detailed architecture of our methodology are

depicted in Figure 1 and Figure 2 and its algorithmic ﬂow is provided in Algorithm 1.

Algorithm 1 Master Algorithm

Input: ReadtrainDF, testDF, potrikaDF

Output: Evaluationof newsarticleclassification

1: cleanTrText, cleanTsText, cleanText ←preprocessing (trainDF, testDF, potrikaDF)

2: evaluation ←man_ML(cleanTrText, trainDF.class, cleanTsText, testDF.class)

3: w2vmodel, glovemodel, fasttextmodel ←getWordEmbeddingModels(cleanText)

4: evaluation ←man_DL(w2vmodel, glovemodel, fasttextmodel)

5: evaluation, autoLabelingDF ←auto_singleLabel(trainDF, testDF)

6: evaluation ←auto_multiLabel(autoLabelingDF)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MachineandDeepLearningMethodswithManualandAutomaticLabellingforNewsClassicationinBanglaLanguageIstiakAhmad1,FahadAlQurashi1,andRashidMehmood2,*1DepartmentofComputerScience,FacultyofComputingandInformationTechnology,KingAbdulazizUniversity,Jeddah21589,SaudiArabia2HighPerformanceComputingCenter,KingA...

展开>> 收起<<

Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classiﬁcation in Bangla Language Istiak Ahmad1 Fahad AlQurashi1 and Rashid Mehmood2.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classiﬁcation in Bangla Language Istiak Ahmad1 Fahad AlQurashi1 and Rashid Mehmood2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: