2 Related Work
Linguistic Descriptions
There are several lin-
guistic references describing various aspects of
PAL (Rice and Sa’id,1979;Herzallah,1990;Hop-
kins,1995;Elihai,2004;Talmon,2004;Bassal,
2012;Cotter and Horesh,2015). These are mostly
targeting academics and language learners. We
consulted many of these resources as part of devel-
oping our annotation guidelines.
Dialectal Corpora
We can group DA corpora
based on the degree of richness in their annota-
tions. Some noteworthy examples of unannotated
or lightly annotated corpora of relevance include
the MADAR Corpus (Bouamor et al.,2018), com-
prising 2K parallel sentences spread across 25 di-
alects of Arabic, including PAL (Jerusalem variety)
and the NADI corpus for nuanced dialect identi-
fication (Abdul-Mageed et al.,2021). The Shami
Corpus (Abu Kwaik et al.,2018) includes 21K
PAL sentences, and the Parallel Arabic Dialect
Corpus (PADIC) contains 6.4K PAL sentences
(Meftouh et al.,2015). In the spirit of genre di-
versification and wider coverage across dialects, El-
Haj (2020) introduced the Habibi Corpus for song
lyrics, which comprises songs from many Arab
countries including all Levantine Arab countries.
Public and freely available morphologically an-
notated corpora are scarce for DA and often do not
agree on annotation guidelines. A notable anno-
tated dataset for PAL is the Curras corpus (Jarrar
et al.,2016), a 56K-token morphologically anno-
tated corpus. Other annotated Levantine dialect ef-
forts include the Jordan Comprehensive Contempo-
rary Arabic Corpus (JCCA) (Sawalha et al.,2019),
the Jordanian and Syrian corpora by Alshargi et al.
(2019), and the Baladi corpus of Lebanese Arabic
(Al-Haff et al.,2022).
We consulted some of the public corpora as part
of the development of Maknuune. However, most
of the above datasets are based on web scrapes,
which limits the amount of actual lemma coverage
that they could attain.
Dialectal Lexicons
Examples of machine-
readable DA lexicons include the 36K-lemma
lexicon used for the CALIMA EGY fully inflected
morphological analyzer (Habash et al.,2012),
based on the CALLHOME Egypt lexicon (Gadalla
et al.,1997), and the 51K-lemma Egyptian Arabic
Tharwa lexicon (Diab et al.,2014), which provides
some morphological annotations.
The Palestinian Colloquial Arabic Vocabulary
comprises 4.5K entries including expressions (You-
nis and Aldrich,2021), and the MADAR Lexicon
contains 2.7K entries dedicated to the Jerusalem
variety of PAL, including lemmas, phonological
transcriptions, and glosses in MSA, English and
French (Bouamor et al.,2018).
In addition to the above there are a number of
dictionaries for Levantine Arabic variants, e.g., Eli-
hai (2004) (9K entries and 17K phrases for PAL),
Halloun (2019) (for PAL), Freiha (1973) (ca. 5K
entries for Lebanese Arabic), and Stowasser and
Ani (2004) (15K entries for Syrian Arabic). These
resources include base lemma forms, occasional
plural forms, verb aspect inflections, and expres-
sions; however, none of them are available in a
machine-readable format, to the best of our knowl-
edge.
The lexicon presented in this work strives to be a
large-scale and open resource with rich entries cov-
ering phonology, morphology, and lexical expres-
sions, and with a wide-ranging coverage of PAL
sub-dialects. The lexicon may never be complete,
but by making it open to sharing and contribution,
we hope it will become central and useful to NLP
researchers and developers, as well as to linguists
working on Arabic and its dialects.
3 Linguistic Facts
In this section we present some general linguistic
facts about PAL and highlight specific challenging
phenomena that motivated many of our annotation
decisions.
3.1 Phonology and Orthography
Like all other DA, and unlike MSA, PAL has no
standard orthography rules (Jarrar et al.,2016;
Habash et al.,2018). In practice, PAL is primarily
written in Arabic script, and to a lesser extent in
Arabizi style romanization (Darwish,2014). Some
of the variations in the written form reflect the
words’ phonology, morphology, and/or etymologi-
cal connections to MSA. Orthogonal and detrimen-
tal to the orthography challenge, PAL has a high
degree of variability within it sub-dialects in phono-
logical terms. We highlight some below, noting that
some also exist in other DA.
Consonantal Variables
A number of PAL con-
sonants vary widely within sub-dialects. For exam-
ple, the voiceless velar stop /k/ is affricated to the