COVID-19-related Nepali Tweets Classification in a Low Resource Setting

2025-04-24 1 0 480.91KB 7 页 10玖币

侵权投诉

Rabin Adhikari1,2, Safal Thapaliya1,2, Nirajan Basnet1,2, Samip Poudel1,2

Aman Shakya2,Bishesh Khanal1

1NepAl Applied Mathematics and Informatics Institute for research (NAAMII)

2Institute of Engineering, Pulchowk Campus, Tribhuvan University

Abstract

Billions of people across the globe have been

using social media platforms in their local lan-

guages to voice their opinions about the various

topics related to the COVID-19 pandemic. Sev-

eral organizations, including the World Health

Organization, have developed automated so-

cial media analysis tools that classify COVID-

19-related tweets into various topics. How-

ever, these tools that help combat the pan-

demic are limited to very few languages, mak-

ing several countries unable to take their benefit.

While multi-lingual or low-resource language-

specific tools are being developed, they still

need to expand their coverage, such as for the

Nepali language. In this paper, we identify the

eight most common COVID-19 discussion top-

ics among the Twitter community using the

Nepali language, set up an online platform to

automatically gather Nepali tweets containing

the COVID-19-related keywords, classify the

tweets into the eight topics, and visualize the

results across the period in a web-based dash-

board. We compare the performance of two

state-of-the-art multi-lingual language models

for Nepali tweet classification, one generic

(mBERT) and the other Nepali language family-

specific model (MuRIL). Our results show that

the models’ relative performance depends on

the data size, with MuRIL doing better for a

larger dataset. The annotated data, models, and

the web-based dashboard are open-sourced at

https://github.com/naamiinepal/cov

id-tweet-classification.

1 Introduction

The COVID-19 pandemic has caused a global rise

in social media users who express their opinions

and share information on various topics related to

the pandemic. Public health organizations and rele-

vant agencies could analyze the social media data

for early warning on potentially new virus variants

based on symptoms discussion, for understanding

the impact of various intervention measures, the

efficacy of vaccination programs, etc. Social me-

dia data analysis can help develop strategies for

combating the pandemic (Yigitcanlar et al.,2020),

and improve the efficiency of the health industry

(Scanfeld et al.,2010;Signorini et al.,2011;Har-

ris et al.,2013;Paul and Dredze,2014;Eichstaedt

et al.,2015).

Several studies performed sentiment analysis

of tweets to understand people’s views towards

the pandemic (Dubey,2020;Jelodar et al.,2020;

Samuel et al.,2020;Alamoodi et al.,2021). Since

sentiment analysis provides limited coarse-level in-

formation, recently, there has been an interest in

building tools for early warning and topic-level dis-

course analysis. Most notably, the World Health

Organization (WHO) tracks internet discourse by

examining global pandemic-related Twitter data

and news using tools like COVID-19 News Map

and EARS

. Although a significant fraction of the

global population uses local languages in social

media, most of these tools are limited to English

or Anglo-European languages. For instance, the

WHO EARS works in only nine languages, piloted

in 30 countries.

In recent years, there has been a growing interest

in building multi-lingual language models, build-

ing low-resource language datasets, and exploring

NLP methods with smaller language models and

smaller data (Conneau et al.,2019;Wang et al.,

2020;Ogueji et al.,2021). Nepali is a low-resource

language with a significant gap in advances, data

availability, and the development of NLP tools.

While there has been some work on low-resource

languages for sentiment analysis in low-resource

languages (Addawood et al.,2020;Hosseini et al.,

2020) including Nepali (Sitaula et al.,2021;Shahi

et al.,2022), to our knowledge there is no work on

COVID-19 tweet topics classification for discourse

1https://portal.who.int/eios-coronavirus-n

ewsmap/

2https://www.who-ears.com/

Figure 1: The web app dashboard shown above uses infographics, viz. bar graphs, and line charts, to track the trend

of various topics. We can filter the tweets based on time and topics to make analysis easier. Additionally, we have

incorporated humans in the loop by developing an administrator interface to validate the predicted tweet labels from

the model and proofread the validated ones.

analysis in the Nepali language.

In this work, we propose a new dataset, deep

learning classification models based on multi-

lingual language models, and an interactive dash-

board for incremental learning and visualization of

COVID-19 tweets topic classification in the Nepali

language. Figure 1 shows a snapshot of our dash-

board. In addition to visualizing topics classifica-

tion in real-time, the dashboard can manually verify

the ML model’s prediction, correct the predictions

to annotate more data, and retrain the model via

GUI for improvement as more data becomes avail-

able.

The followings are our contributions to the sci-

entific community.

•

We release a multi-annotator multi-label

Nepali Annotated Tweets with COVID-19

Topics Classification (NAT-CTC) dataset that

contains

12,241

tweets in Devanagari script,

manually tagged with eight simplified topics.

We also provide inter-annotator agreement re-

sults on this dataset using four annotators la-

beling 400 typical tweets.

•

We release our open-source web-based plat-

form with GUI for automatic keywords-based

tweets collection, tweet pre-processing, topic

classification, and visualization. This plat-

form can be used for AI-assisted annotation

and incremental learning, where human anno-

tators can correct the labels predicted by ML

models and then retrain ML models. We use

this approach during the dataset preparation

as well.

•

We show that the benefit of using a Nepali

language family-specific model compared to

generic multi-lingual language models may

come only if there is a certain minimum num-

ber of annotated data for the downstream task.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COVID-19-relatedNepaliTweetsClassificationinaLowResourceSettingRabinAdhikari1,2,SafalThapaliya1,2,NirajanBasnet1,2,SamipPoudel1,2AmanShakya2,BisheshKhanal11NepAlAppliedMathematicsandInformaticsInstituteforresearch(NAAMII)2InstituteofEngineering,PulchowkCampus,TribhuvanUniversityAbstractBillionsofpeo...

展开>> 收起<<

COVID-19-related Nepali Tweets Classification in a Low Resource Setting.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COVID-19-related Nepali Tweets Classification in a Low Resource Setting

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: