COVID-19-related Nepali Tweets Classification in a Low Resource Setting

2025-04-24 0 0 480.91KB 7 页 10玖币
侵权投诉
COVID-19-related Nepali Tweets Classification in a Low Resource Setting
Rabin Adhikari1,2, Safal Thapaliya1,2, Nirajan Basnet1,2, Samip Poudel1,2
Aman Shakya2,Bishesh Khanal1
1NepAl Applied Mathematics and Informatics Institute for research (NAAMII)
2Institute of Engineering, Pulchowk Campus, Tribhuvan University
Abstract
Billions of people across the globe have been
using social media platforms in their local lan-
guages to voice their opinions about the various
topics related to the COVID-19 pandemic. Sev-
eral organizations, including the World Health
Organization, have developed automated so-
cial media analysis tools that classify COVID-
19-related tweets into various topics. How-
ever, these tools that help combat the pan-
demic are limited to very few languages, mak-
ing several countries unable to take their benefit.
While multi-lingual or low-resource language-
specific tools are being developed, they still
need to expand their coverage, such as for the
Nepali language. In this paper, we identify the
eight most common COVID-19 discussion top-
ics among the Twitter community using the
Nepali language, set up an online platform to
automatically gather Nepali tweets containing
the COVID-19-related keywords, classify the
tweets into the eight topics, and visualize the
results across the period in a web-based dash-
board. We compare the performance of two
state-of-the-art multi-lingual language models
for Nepali tweet classification, one generic
(mBERT) and the other Nepali language family-
specific model (MuRIL). Our results show that
the models’ relative performance depends on
the data size, with MuRIL doing better for a
larger dataset. The annotated data, models, and
the web-based dashboard are open-sourced at
https://github.com/naamiinepal/cov
id-tweet-classification.
1 Introduction
The COVID-19 pandemic has caused a global rise
in social media users who express their opinions
and share information on various topics related to
the pandemic. Public health organizations and rele-
vant agencies could analyze the social media data
for early warning on potentially new virus variants
based on symptoms discussion, for understanding
the impact of various intervention measures, the
efficacy of vaccination programs, etc. Social me-
dia data analysis can help develop strategies for
combating the pandemic (Yigitcanlar et al.,2020),
and improve the efficiency of the health industry
(Scanfeld et al.,2010;Signorini et al.,2011;Har-
ris et al.,2013;Paul and Dredze,2014;Eichstaedt
et al.,2015).
Several studies performed sentiment analysis
of tweets to understand people’s views towards
the pandemic (Dubey,2020;Jelodar et al.,2020;
Samuel et al.,2020;Alamoodi et al.,2021). Since
sentiment analysis provides limited coarse-level in-
formation, recently, there has been an interest in
building tools for early warning and topic-level dis-
course analysis. Most notably, the World Health
Organization (WHO) tracks internet discourse by
examining global pandemic-related Twitter data
and news using tools like COVID-19 News Map
1
and EARS
2
. Although a significant fraction of the
global population uses local languages in social
media, most of these tools are limited to English
or Anglo-European languages. For instance, the
WHO EARS works in only nine languages, piloted
in 30 countries.
In recent years, there has been a growing interest
in building multi-lingual language models, build-
ing low-resource language datasets, and exploring
NLP methods with smaller language models and
smaller data (Conneau et al.,2019;Wang et al.,
2020;Ogueji et al.,2021). Nepali is a low-resource
language with a significant gap in advances, data
availability, and the development of NLP tools.
While there has been some work on low-resource
languages for sentiment analysis in low-resource
languages (Addawood et al.,2020;Hosseini et al.,
2020) including Nepali (Sitaula et al.,2021;Shahi
et al.,2022), to our knowledge there is no work on
COVID-19 tweet topics classification for discourse
1https://portal.who.int/eios-coronavirus-n
ewsmap/
2https://www.who-ears.com/
Figure 1: The web app dashboard shown above uses infographics, viz. bar graphs, and line charts, to track the trend
of various topics. We can filter the tweets based on time and topics to make analysis easier. Additionally, we have
incorporated humans in the loop by developing an administrator interface to validate the predicted tweet labels from
the model and proofread the validated ones.
analysis in the Nepali language.
In this work, we propose a new dataset, deep
learning classification models based on multi-
lingual language models, and an interactive dash-
board for incremental learning and visualization of
COVID-19 tweets topic classification in the Nepali
language. Figure 1 shows a snapshot of our dash-
board. In addition to visualizing topics classifica-
tion in real-time, the dashboard can manually verify
the ML model’s prediction, correct the predictions
to annotate more data, and retrain the model via
GUI for improvement as more data becomes avail-
able.
The followings are our contributions to the sci-
entific community.
We release a multi-annotator multi-label
Nepali Annotated Tweets with COVID-19
Topics Classification (NAT-CTC) dataset that
contains
12,241
tweets in Devanagari script,
manually tagged with eight simplified topics.
We also provide inter-annotator agreement re-
sults on this dataset using four annotators la-
beling 400 typical tweets.
We release our open-source web-based plat-
form with GUI for automatic keywords-based
tweets collection, tweet pre-processing, topic
classification, and visualization. This plat-
form can be used for AI-assisted annotation
and incremental learning, where human anno-
tators can correct the labels predicted by ML
models and then retrain ML models. We use
this approach during the dataset preparation
as well.
We show that the benefit of using a Nepali
language family-specific model compared to
generic multi-lingual language models may
come only if there is a certain minimum num-
ber of annotated data for the downstream task.
摘要:

COVID-19-relatedNepaliTweetsClassificationinaLowResourceSettingRabinAdhikari1,2,SafalThapaliya1,2,NirajanBasnet1,2,SamipPoudel1,2AmanShakya2,BisheshKhanal11NepAlAppliedMathematicsandInformaticsInstituteforresearch(NAAMII)2InstituteofEngineering,PulchowkCampus,TribhuvanUniversityAbstractBillionsofpeo...

展开>> 收起<<
COVID-19-related Nepali Tweets Classification in a Low Resource Setting.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:480.91KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注