Maknuune A Large Open Palestinian Arabic Lexicon Shahd DibasChristian KhairallahNizar Habash Omar Fayez SadiTariq SairafyKarmel SarabtaAbrar Ardah

2025-04-29 0 0 600.77KB 11 页 10玖币
侵权投诉
Maknuune: A Large Open Palestinian Arabic Lexicon
Shahd Dibas,Christian Khairallah,Nizar Habash
Omar Fayez Sadi,*Tariq Sairafy,*Karmel Sarabta,*Abrar Ardah*
University of Oxford, New York University Abu Dhabi
*University College of Educational Sciences - UNRWA
shahd.dibas@ling-phil.ox.ac.uk,christian.khairallah@nyu.edu,nizar.habash@nyu.edu
Abstract
We present Maknuune
éKñJºÓ, a large open lex-
icon for the Palestinian Arabic dialect. Maknu-
une has over 36K entries from 17K lemmas,
and 3.7K roots. All entries include diacritized
Arabic orthography, phonological transcrip-
tion and English glosses. Some entries are
enriched with additional information such as
broken plurals and templatic feminine forms,
associated phrases and collocations, Standard
Arabic glosses, and examples or notes on
grammar, usage, or location of collected entry.
1 Introduction
Arabic is a collective of historically related vari-
ants that co-exist in a diglossic (Ferguson,1959)
relationship between a Standard variant and geo-
graphically specific dialectal variants. Standard
Arabic (SA,
új
®Ë@
éJ
K.QªË@
) is typically used to
refer to the older Classical Arabic (CA) used in
Quranic texts and pre-islamic poetry, all the way to
Modern SA (MSA), the official language of news
and culture in the Arab World. Dialectal Arabic
(DA) is classified geographically into regions such
as Egyptian, Levantine, Maghrebi, and Gulf. The
dialects, which differ among themselves and SA,
are the primary mode of spoken communication, al-
though increasingly they are dominating in written
form on social media. That said, DA has no official
prescriptive grammars or orthographic standards,
unlike the highly standardized and regulated MSA.
In the realm of natural language processing (NLP),
MSA has relatively more annotated and parallel re-
sources than DA; although there are many notable
efforts to fill gaps in all Arabic variants (Alyafeai
et al.,2022).
In this paper, we focus on Palestinian Arabic
(PAL), which is part of the South Levantine Ara-
bic dialect subgroup. PAL consists of several sub-
dialects in the region of Historic Palestine that vary
in terms of their phonology and lexical choice (Jar-
rar et al.,2016). PAL, like all other DA, has been
historically influenced by many languages, specifi-
cally, in its case, Syriac, Turkish, Persian, English
and most recently Modern Hebrew (Halloun,2019),
as well as other Arabic dialects that came in interac-
tion with PAL after the Nakba. While this research
effort was originally motivated by the need to docu-
ment and preserve the cultural heritage and unique
identities of the various PAL sub-dialects, it has
expanded to cover PALs ever-evolving nature as a
living language, and provides a resource to support
research and development in Arabic dialect NLP.
Concretely, we present
Maknuune
éKñJºÓ
,
1
a
large open lexicon for PAL, with over 36K entries
from 17K lemmas, and 3.7K roots.
2
All entries
include diacritized Arabic orthography and phono-
logical transcription following Habash et al. (2018),
as well as English glosses. Important inflectional
variants are included for some lemmas, such as bro-
ken plural and templatic feminine. About 10% of
the entries are phrases (multiword expressions) in-
dexed by their primary lemmas. And about 67% of
the entries include MSA glosses, examples, and/or
notes on grammar, usage, or location of collected
entry. To our knowledge, Maknuune is the largest
open machine-readable dictionary for PAL. Maknu-
une is publicly viewable and downloadable.3
We discuss some related work in Section 2, and
highlight some PAL linguistic facts that motivated
many of our design choices in Section 3. Section 4
presents our data collection process and annotation
guidelines. We present statistics for our lexicon
and evaluate its coverage in Section 5.
1
éKñJºÓ
/makn
¯
une/ is a PAL farming term that refers to an
egg intentionally left behind in a specific location to encourage
the chicken to lay more eggs in that location. We hope that the
lexicon will encourage other researchers and citizen linguists
to contribute to it.
2
In this initial phase of Maknuune, we focus on the PAL
sub-dialects spoken in the West Bank, an area with dialectal
diversity across many dimensions such as lifestyle (urban,
rural, bedouin), religion, gender, and social class.
3www.palestine-lexicon.org
arXiv:2210.12985v2 [cs.CL] 1 Dec 2022
2 Related Work
Linguistic Descriptions
There are several lin-
guistic references describing various aspects of
PAL (Rice and Sa’id,1979;Herzallah,1990;Hop-
kins,1995;Elihai,2004;Talmon,2004;Bassal,
2012;Cotter and Horesh,2015). These are mostly
targeting academics and language learners. We
consulted many of these resources as part of devel-
oping our annotation guidelines.
Dialectal Corpora
We can group DA corpora
based on the degree of richness in their annota-
tions. Some noteworthy examples of unannotated
or lightly annotated corpora of relevance include
the MADAR Corpus (Bouamor et al.,2018), com-
prising 2K parallel sentences spread across 25 di-
alects of Arabic, including PAL (Jerusalem variety)
and the NADI corpus for nuanced dialect identi-
fication (Abdul-Mageed et al.,2021). The Shami
Corpus (Abu Kwaik et al.,2018) includes 21K
PAL sentences, and the Parallel Arabic Dialect
Corpus (PADIC) contains 6.4K PAL sentences
(Meftouh et al.,2015). In the spirit of genre di-
versification and wider coverage across dialects, El-
Haj (2020) introduced the Habibi Corpus for song
lyrics, which comprises songs from many Arab
countries including all Levantine Arab countries.
Public and freely available morphologically an-
notated corpora are scarce for DA and often do not
agree on annotation guidelines. A notable anno-
tated dataset for PAL is the Curras corpus (Jarrar
et al.,2016), a 56K-token morphologically anno-
tated corpus. Other annotated Levantine dialect ef-
forts include the Jordan Comprehensive Contempo-
rary Arabic Corpus (JCCA) (Sawalha et al.,2019),
the Jordanian and Syrian corpora by Alshargi et al.
(2019), and the Baladi corpus of Lebanese Arabic
(Al-Haff et al.,2022).
We consulted some of the public corpora as part
of the development of Maknuune. However, most
of the above datasets are based on web scrapes,
which limits the amount of actual lemma coverage
that they could attain.
Dialectal Lexicons
Examples of machine-
readable DA lexicons include the 36K-lemma
lexicon used for the CALIMA EGY fully inflected
morphological analyzer (Habash et al.,2012),
based on the CALLHOME Egypt lexicon (Gadalla
et al.,1997), and the 51K-lemma Egyptian Arabic
Tharwa lexicon (Diab et al.,2014), which provides
some morphological annotations.
The Palestinian Colloquial Arabic Vocabulary
comprises 4.5K entries including expressions (You-
nis and Aldrich,2021), and the MADAR Lexicon
contains 2.7K entries dedicated to the Jerusalem
variety of PAL, including lemmas, phonological
transcriptions, and glosses in MSA, English and
French (Bouamor et al.,2018).
In addition to the above there are a number of
dictionaries for Levantine Arabic variants, e.g., Eli-
hai (2004) (9K entries and 17K phrases for PAL),
Halloun (2019) (for PAL), Freiha (1973) (ca. 5K
entries for Lebanese Arabic), and Stowasser and
Ani (2004) (15K entries for Syrian Arabic). These
resources include base lemma forms, occasional
plural forms, verb aspect inflections, and expres-
sions; however, none of them are available in a
machine-readable format, to the best of our knowl-
edge.
The lexicon presented in this work strives to be a
large-scale and open resource with rich entries cov-
ering phonology, morphology, and lexical expres-
sions, and with a wide-ranging coverage of PAL
sub-dialects. The lexicon may never be complete,
but by making it open to sharing and contribution,
we hope it will become central and useful to NLP
researchers and developers, as well as to linguists
working on Arabic and its dialects.
3 Linguistic Facts
In this section we present some general linguistic
facts about PAL and highlight specific challenging
phenomena that motivated many of our annotation
decisions.
3.1 Phonology and Orthography
Like all other DA, and unlike MSA, PAL has no
standard orthography rules (Jarrar et al.,2016;
Habash et al.,2018). In practice, PAL is primarily
written in Arabic script, and to a lesser extent in
Arabizi style romanization (Darwish,2014). Some
of the variations in the written form reflect the
words’ phonology, morphology, and/or etymologi-
cal connections to MSA. Orthogonal and detrimen-
tal to the orthography challenge, PAL has a high
degree of variability within it sub-dialects in phono-
logical terms. We highlight some below, noting that
some also exist in other DA.
Consonantal Variables
A number of PAL con-
sonants vary widely within sub-dialects. For exam-
ple, the voiceless velar stop /k/ is affricated to the
palatal /tsh/ in many PAL rural varieties (Herzal-
lah,1990), e.g.,
J
»
kayf ‘how’ appears as /k ee
f/ (urban) or /tsh ee f/ (rural).4Similarly, the MSA
voiceless uvular stop /q/ in the word
I
.
Ê
¯
qal.b
‘heart’ is realized either as glottal stop /2alb/ in
urban dialects, as a voiceless velar stop /k a l b/
in rural dialects, or a voiced velar stop /g a l b/ in
Bedouin dialects (Herzallah,1990). It should be
noted that there are some exceptions that do not
conform to the above generalizations. For exam-
ple, in Beit Fajjar,
5
the word
è
ñ
ê
¯
qah.wa
~
‘coffee’
typically varying elsewhere as /{2,q,g,k} a h w e/ is
realized as /tsh h ee w a/. Moreover, some words
do not have varying pronunciations such as
ÈA
®
«
ς.qaAl /3gaal/ ‘Egal headband’.
Monophthongization
Some PAL diphthongs
shift to different monophthongs in different loca-
tions. For example the /a y/ diphthong in
qJ
šayx
/sh a y kh/ ‘Sheikh’ shifts often to /ee/ (/sh ee kh/),
but also to /ii/ (/sh ii kh/).
6
Following the CODA*
guidelines for diacritizing DA (Habash et al.,2018),
we spell the /oo/ and /ee/ sounds using
ñ
K
aw and
ù
K
ay (without a sukun on the
ð
wor
ø
y), respec-
tively, e.g.,
Ðñ
»
kawm /k oo m/ ‘pile’ and
I
K.
bayt
/b ee t/ ‘house’.
Metathesis
In some rural dialects in villages near
Tulkarem, Jenin and Ramallah, there are words
with consonant pairs within a syllable that appear
in a different order than is the norm in PAL, e.g., a
word like
A
K.
Q
ê
»
kah.rabaA /kahraba/ ‘electricity’
realizes as /karhaba/.
Epenthesis
PAL exhibits systematic epenthesis
of the /i/ or /u/ sounds producing paired word al-
ternations such as /b a 3 d/ and /ba3id/ for
YªK.
‘still;after’ or /khubz/ and /khubuz/ or
/kh u b i z/ (in different sub-dialects) for
Q.
g
‘bread’.
We opted to use the fully epenthesized forms in the
lexicon, i.e.,
K.
ba
ς
id,
Q
.
g
xubuz, and
Q.
g
xubiz,
for the above mentioned examples.
4
Arabic orthographic transliteration is presented in the
HSB Scheme (italics) (Habash et al.,2007). Arabic script
orthography is presented in the CODA* scheme, and Arabic
phonology is presented in the CAPHI scheme (between /../)
(Habash et al.,2018).
5
A Palestinian town located 8 kilometers south of Bethle-
hem in the West Bank.
6
In the Palestinian village of Ramadin, near Hebron in the
West Bank.
3.2 Morphology
Like other DA, PAL has a complex morphology em-
ploying templatic and concatenative morphemes,
and including a rich set of morphological features:
gender, number, person, state, aspect, in addition to
numerous clitics. We highlight some specific mor-
phological phenomena that we needed to handle.
Ta Marbuta
The so-called feminine singular suf-
fix morpheme, or Ta Marbuta (
è~
), is a morpheme
that can be used to mark feminine singular nomi-
nals, but that also appears with masculine singular
and plural nominals. Morphophonemically, it has
a number of forms in PAL that vary contextually.
First, in some PAL sub-dialects, the Ta Marbuta is
pronounced as /a/ when preceded by an emphatic
consonant, velars, and pharyngeal fricatives, e.g.,
é
¢
.
baT
a
~
/b a t. t. a/ ‘duck’; otherwise it re-
alizes as /e/, e.g.,
é
.
bis
i
~
/b i s s e/. In some
northern PAL dialects, the /e/ variant appears as /i/;
and in some southern PAL dialects, the distinction
is gone and all Ta Marbutas are pronounced /a/.
Second, the Ta Marbuta turns into its allomorph
/i t/ in Idafa constructions, e.g., /b i s s i t/ ‘the/a
cat of’. Finally, for some active participle deverbal
nouns, the Ta Marbuta realizes as /aa/ or /ii t/ when
followed by a pronominal object clitic, e.g.,
èA
J.
KA¿
kaAt.baAh /k aa t b aa (h)/ or
é
J
J.
KA¿
kaAt.biy.tuh or
/katbiitu(h)/ ‘she wrote it’.
Complex Plural Forms
Besides the common
use of broken plural (templatic plural) in DA, we
encountered cases of blocked plurals where a typi-
cal sound plural or templatic plural is not generated
because another word form is used in its place
(Aronoff,1976). One example from Ramadin, is
the plural form of the word
É
J
«ς
ay
il /3 a y y i l/
‘child [lit. dependent]’, which is blocked by the
word form
–
ª
D.
ς
uwf /dh. 3 uu f/ ‘children [lit.
weaklings]’.
3.3 Syntax
Previous research on Arabic dialects reveals that
the syntactic differences between these dialects are
considered to be minor compared to the morpho-
logical ones (Brustad,2000). One particular chal-
lenging phenomenon we encountered is a class
of nouns used in adjectival constructions, but vi-
olating noun-adjective agreement rules, which in-
volve gender, number and rationality (Alkuhlani
摘要:

Maknuune:ALargeOpenPalestinianArabicLexiconShahdDibas,†ChristianKhairallah,‡NizarHabash‡OmarFayezSadi,*TariqSairafy,*KarmelSarabta,*AbrarArdah*†UniversityofOxford,‡NewYorkUniversityAbuDhabi*UniversityCollegeofEducationalSciences-UNRWAshahd.dibas@ling-phil.ox.ac.uk,christian.khairallah@nyu.edu,nizar....

展开>> 收起<<
Maknuune A Large Open Palestinian Arabic Lexicon Shahd DibasChristian KhairallahNizar Habash Omar Fayez SadiTariq SairafyKarmel SarabtaAbrar Ardah.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:600.77KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注