
COVID-19-related Nepali Tweets Classification in a Low Resource Setting
Rabin Adhikari1,2, Safal Thapaliya1,2, Nirajan Basnet1,2, Samip Poudel1,2
Aman Shakya2,Bishesh Khanal1
1NepAl Applied Mathematics and Informatics Institute for research (NAAMII)
2Institute of Engineering, Pulchowk Campus, Tribhuvan University
Abstract
Billions of people across the globe have been
using social media platforms in their local lan-
guages to voice their opinions about the various
topics related to the COVID-19 pandemic. Sev-
eral organizations, including the World Health
Organization, have developed automated so-
cial media analysis tools that classify COVID-
19-related tweets into various topics. How-
ever, these tools that help combat the pan-
demic are limited to very few languages, mak-
ing several countries unable to take their benefit.
While multi-lingual or low-resource language-
specific tools are being developed, they still
need to expand their coverage, such as for the
Nepali language. In this paper, we identify the
eight most common COVID-19 discussion top-
ics among the Twitter community using the
Nepali language, set up an online platform to
automatically gather Nepali tweets containing
the COVID-19-related keywords, classify the
tweets into the eight topics, and visualize the
results across the period in a web-based dash-
board. We compare the performance of two
state-of-the-art multi-lingual language models
for Nepali tweet classification, one generic
(mBERT) and the other Nepali language family-
specific model (MuRIL). Our results show that
the models’ relative performance depends on
the data size, with MuRIL doing better for a
larger dataset. The annotated data, models, and
the web-based dashboard are open-sourced at
https://github.com/naamiinepal/cov
id-tweet-classification.
1 Introduction
The COVID-19 pandemic has caused a global rise
in social media users who express their opinions
and share information on various topics related to
the pandemic. Public health organizations and rele-
vant agencies could analyze the social media data
for early warning on potentially new virus variants
based on symptoms discussion, for understanding
the impact of various intervention measures, the
efficacy of vaccination programs, etc. Social me-
dia data analysis can help develop strategies for
combating the pandemic (Yigitcanlar et al.,2020),
and improve the efficiency of the health industry
(Scanfeld et al.,2010;Signorini et al.,2011;Har-
ris et al.,2013;Paul and Dredze,2014;Eichstaedt
et al.,2015).
Several studies performed sentiment analysis
of tweets to understand people’s views towards
the pandemic (Dubey,2020;Jelodar et al.,2020;
Samuel et al.,2020;Alamoodi et al.,2021). Since
sentiment analysis provides limited coarse-level in-
formation, recently, there has been an interest in
building tools for early warning and topic-level dis-
course analysis. Most notably, the World Health
Organization (WHO) tracks internet discourse by
examining global pandemic-related Twitter data
and news using tools like COVID-19 News Map
1
and EARS
2
. Although a significant fraction of the
global population uses local languages in social
media, most of these tools are limited to English
or Anglo-European languages. For instance, the
WHO EARS works in only nine languages, piloted
in 30 countries.
In recent years, there has been a growing interest
in building multi-lingual language models, build-
ing low-resource language datasets, and exploring
NLP methods with smaller language models and
smaller data (Conneau et al.,2019;Wang et al.,
2020;Ogueji et al.,2021). Nepali is a low-resource
language with a significant gap in advances, data
availability, and the development of NLP tools.
While there has been some work on low-resource
languages for sentiment analysis in low-resource
languages (Addawood et al.,2020;Hosseini et al.,
2020) including Nepali (Sitaula et al.,2021;Shahi
et al.,2022), to our knowledge there is no work on
COVID-19 tweet topics classification for discourse
1https://portal.who.int/eios-coronavirus-n
ewsmap/
2https://www.who-ears.com/