
content in certain popular languages such as En-
glish, Spanish, etc. (Perrigo,2019) So far, several
investigations have been conducted to identify hate
speech automatically, focusing mainly on the En-
glish language; therefore, an effort is required to
determine and diminish such hateful content in low-
resource languages.
With more than 210 million speakers, Bengali
is the seventh most widely spoken language
5
, with
around 100 million Bengali speakers in Bangladesh
and 85 million in India. Apart from Bangladesh and
India, Bengali is spoken in many countries, includ-
ing the United Kingdom, the United States, and the
Middle East
6
. Also, a current trend on social media
platforms is that apart from actual Bengali, people
tend to write Bengali using Latin scripts(English
characters) and often use English phrases in the
same conversation. This unique and informal com-
munication dialect is called code-mixed Bengali
or Roman Bengali. Code-mixing makes it easier
for speakers to communicate with one another by
providing a more comprehensive range of idioms
and phrases. However, as emphasized by Chittaran-
jan et al. (Chittaranjan et al.,2014), this has made
the task of creating NLP tools more challenging.
Along with these challenges, the challenges spe-
cific to identifying hate speech in Roman Bengali
contain the following: Absence of a hate speech
dataset,Lack of benchmark models. Thus, there is
a need to develop open efficient datasets and mod-
els to detect hate speech in Bengali. Although few
studies have been conducted in developing Ben-
gali hate speech datasets, most of these have been
crawled with comments from Facebook pages, and
all of them are in actual Bengali. Hence, there is a
need for developing more benchmarking datasets
considering other popular platforms. To address
these limitations, in this study, we make the follow-
ing contributions.
-
First, we create a gold-standard dataset of 10K
tweets among which 5K tweets are actual Ben-
gali and 5K tweets are Roman Bengali.
-
Second, we implement several baseline mod-
els to identify such hateful and offensive con-
tent automatically for both actual & Roman
Bengali tweets.
-
Third, we explore several interlingual transfer
mechanisms to boost the classification perfor-
5https://www.berlitz.com/en-uy/blog/
most-spoken-languages-world
6https://www.britannica.com/topic/
Bengali-language
mance.
-
Finally, we perform in-depth error analysis
by looking into a sample of posts where the
models mis-classify some of the test instances.
2 Related Work
Over the past few years, research around automated
hate speech detection has been evolved tremen-
dously. The earlier effort in developing resources
for the hate speech detection was mainly focused
around English language (Waseem and Hovy,2016;
Davidson et al.,2017;Founta et al.,2018). Re-
cently, in an effort to create multilingual hate
speech datasets, several shared task competitions
have been organized (HASOC (Mandl et al.,2019),
OffensEval (Zampieri et al.,2019)„ TRAC (Kumar
et al.,2020), etc.), and multiple datasets such as
Hindi (Modha et al.,2021), Danish (Sigurbergs-
son and Derczynski,2020), Greek (Pitenis et al.,
2020), Turkish (Çöltekin,2020), Mexican Span-
ish (Aragón et al.,2019), etc. have been made
public. There is also some work to detect hate
speech in actual Bengali. Ismam et al. (Ishmam
and Sharmin,2019) collected and annotated 5K
comments from Facebook into six classes-inciteful,
hate speech,religious hatred,communal attack,
religious comments, and political comments. How-
ever,the dataset is not publicly available. Karim
et al. (Karim et al.,2021) provided a dataset of
8K hateful posts collected from multiple sources
such as Facebook, news articles, blogs, etc. One
of the problems with this dataset is that all com-
ments are part of any hate class(personal,geopolit-
ical,religious, and political), so we cannot build
hate speech detection models using this dataset to
screen out hate speech. Romim et al. (2021) cu-
rated a dataset of 30K comments, making it one of
the most extensive datasets for hateful statements.
The author achieved 87.5% accuracy on their test
dataset using the SVM model. However, these
datasets do not consider Roman Bengali posts, a
prevalent communication method on social media
nowadays.
With regards to the detection systems, earlier
methods examined simple linguistic features such
as character and word n-grams, POS tags, tf-idf
with a traditional classifier such as LR, SVM, De-
cision Tree, etc (Davidson et al.,2017). With the
development of larger datasets, researchers have
shifted to data-hungry complex models such as
deep learning (Pitsilis et al.,2018;Zhang et al.,