ArabSign A Multi-modality Dataset and Benchmark for Continuous Arabic Sign Language Recognition

2025-04-27 0 0 1.53MB 8 页 10玖币
侵权投诉
ArabSign: A Multi-modality Dataset and
Benchmark for Continuous Arabic Sign Language
Recognition
Hamzah Luqman
Information and Computer Science Department, King Fahd University of Petroleum and Minerals
SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Dhahran 31261, Saudi Arabia.
Email: hluqman@kfupm.edu.sa
Abstract—Sign language recognition has attracted the interest
of researchers in recent years. While numerous approaches
have been proposed for European and Asian sign languages
recognition, very limited attempts have been made to develop
similar systems for the Arabic sign language (ArSL). This can
be attributed partly to the lack of a dataset at the sentence
level. In this paper, we aim to make a significant contribution by
proposing ArabSign, a continuous ArSL dataset. The proposed
dataset consists of 9,335 samples performed by 6 signers. The
total time of the recorded sentences is around 10 hours and
the average sentence’s length is 3.1 signs. ArabSign dataset was
recorded using a Kinect V2 camera that provides three types
of information (color, depth, and skeleton joint points) recorded
simultaneously for each sentence. In addition, we provide the
annotation of the dataset according to ArSL and Arabic language
structures that can help in studying the linguistic characteristics
of ArSL. To benchmark this dataset, we propose an encoder-
decoder model for Continuous ArSL recognition. The model
has been evaluated on the proposed dataset, and the obtained
results show that the encoder-decoder model outperformed the
attention mechanism with an average word error rate (WER)
of 0.50 compared with 0.62 with the attention mechanism.
The data and code are available at https://github.com/Hamzah-
Luqman/ArabSign
I. INTRODUCTION
Hearing loss is a serious problem facing the world today,
and it is getting worse. It is estimated that nearly 2.5 billion
people are projected to have some degree of hearing loss by
2050, and at least 700 million will require hearing rehabil-
itation [1]. Modern lifestyles and unsafe listening practices
mean that over 1 billion young adults are at risk of permanent
hearing loss.
Sign language is the main communication language of
hearing impaired people. This language is a complete and rich
language with grammar and structure that differ from spoken
languages. Sign language has its own lexicon that is usually
smaller than spoken languages’ vocabulary.
Sign language is not a universal language and it does not
depend on spoken languages [4]. Sign languages are ”not
mutually intelligible with each other” although there are some
similarities in some signs. There are many sign languages that
differ in their gestures, lexicon, and grammar. Most of the
sign languages are related to the country more than the spoken
language of that country. There are some countries that speak
one language but have different sign languages, such as British
Fig. 1: An illustrative example from the ArabSign dataset for
the three modalities provided for each sentence sample: (a)
color, (b) depth, and (c) skeleton joint points.
Sign Language (BSL) and American Sign Language (ASL).
Other popular sign languages are Chinese (CSL), German
(GSL), Indian (ISL), and Arabic (ArSL) sign languages. ArSL
is one of the main languages used in Arab countries. It is
currently the main language used in translating television
programs such as news and interviews. This language has a
dictionary consisting of 3200 sign words published in two
parts [5], [6].
Sign language is a non-verbal language that uses multi-
modality data to express thoughts [15]. Manual and non-
manual gestures are the two modalities used in sign language
for communication. These gestures are combined during sign-
ing in a way that complements each other. Manual gestures are
the dominant element used in sign languages. These gestures
employ body movements through the hands and head. The
majority of sign language signs depend on manual gestures.
The non-manual modality consists mainly of facial expressions
that are simultaneously performed with manual gestures. Non-
manual gestures are used to show emotions and feelings in
sign language in addition to linguistic properties such as
grammatical structure, adjectival or adverbial content, and
lexical distinction.
Translating sign language into spoken language is accom-
plished through sign language recognition (SLR) and transla-
tion [28]. Automatic SLR involves using pattern recognition
and computer vision to identify sign gestures and convert them
into their equivalent words in the natural language [40]. Sign
language translation involves using natural language process-
ing and linguistics to translate the recognized sign language
arXiv:2210.03951v1 [cs.CV] 8 Oct 2022
sentences into spoken languages to meet their structure and
grammar. Extensive research has been conducted on SLR
compared with translation since translation depends on the
output of the SLR at the sentence level.
Based on the type of the recognized signs, SLR systems
can be categorized into isolated and continuous SLR systems.
Isolated sign recognition systems target isolated sign words
while continuous sign language recognition (CSLR) systems
target more than one sign performed continually. Most of
the techniques that have been proposed for SLR during the
last three decades have targeted isolated signs [15]. CSLR is
still in its infancy compared with isolated SLR, where the
growth of CSLR studies is close to linear compared with the
exponential growth of isolated SLR studies [25]. One of the
challenges associated with CSLR is the lack of movement
epenthesis clues between sentence signs and the lack of
temporal information that can help in signs segmentation.
In addition, the high variance between signs performed by
different signers made the learning of segmentation clues very
difficult. Another challenge is the lack of datasets, which can
be considered the main challenge that makes most of the
researchers target isolated signs.
To our knowledge, no available vision-based annotated
sentences of ArSL that can be used for ArSL CSLR and
translation. The datasets that have been proposed for Arabic
CSLR are collected using glove sensors. However, sensor-
based SLR requires signers to keep wearing the electronic
sensor gloves during signing. This makes these sensors unsuit-
able for real-time applications. In addition, the sensors used
for sign acquisition can not capture the non-manual features of
the sign language. This motivated us to propose a continuous
ArSL dataset that can be used for CSLR and translation. The
main contributions of this research are as follows:
Propose a continuous ArSL dataset (ArabSign). The pro-
posed dataset was collected using a multi-modality Kinect
V2 camera. The dataset is available in three modalities:
color, depth, and joint points shown in Figure 1. The
proposed dataset consists of 9,335 samples representing
50 ArSL sentences. Each sentence was performed by 6
signers, and each sentence was repeated several times by
each signer.
Provide the annotation of the performed sentences ac-
cording to the structure of ArSL and Arabic language.
This makes the dataset useful for studying the grammar
and structure of ArSL and developing machine translation
systems between ArSL and natural languages.
Propose an encoder-decoder model for benchmarking
the proposed ArabSign dataset. The model has been
trained on features extracted from the color frames of the
sentences using different pre-trained models. In addition,
the proposed model has been compared with an attention
mechanism.
This paper is organized as follows: a literature review of
the available continuous sign language datasets is presented
in Section II. A detailed description of the proposed ArabSign
dataset is presented in Section III. Section IV describes the
experimental work that has been conducted to benchmark the
proposed dataset, and the conclusions are presented in Section
V.
II. LITERATURE REVIEW
The work on SLR can be dated back to the middle of the
1990s [25]. The SLR systems at the sign level are the most
common SRL systems compared with the sentence level due
to the availability of datasets at the sign level and the similarity
between this problem and gesture recognition problems [15].
In contrast, few approaches have been proposed for CSLR due
to the challenges associated with recognizing sign languages’
sentences. One of these challenges is the lack of annotated
datasets.
Few datasets have been proposed for continuous SL com-
pared with isolated sign datasets. The majority of these
datasets target ASL and DGS. There are some datasets
that are used by researchers for their work. However, these
datasets are either limited in size or unavailable for re-
searchers. The most commonly used continuous sign lan-
guage datasets were proposed by a group at RWTH Aachen
University. This group proposed four datasets for contin-
uous ASL and DGS, namely RWTH-BOSTON-104 [13],
RWTH-BOSTON-400 [12], RWTH-PHOENIX-Weather [17],
and RWTH-PHOENIX-Weather-2014 [18]. RWTH-BOSTON-
104 [13] was recorded at Boston University and it consists
of 201 sentences of ASL performed by three signers. The
vocabulary size of this dataset is 168 sign words.
RWTH-BOSTON-400 [12] is an extension of RWTH-
BOSTON-104. It consists of 843 sentences with a vocabulary
size of 406 sign words performed by four signers. RWTH-
PHOENIX-Weather [17] includes weather forecasts collected
from German television. This dataset is performed by seven
signers and it consists of 1,980 sentences of DGS with a
vocabulary size of 911 sign words. This dataset is extended
in RWTH-PHOENIX-Weather-2014 [18] to 6,861 sentences
performed by nine signers. Both datasets were recorded in a
controlled environment where signers were wearing a dark T-
shirt with grey background. How2Sign [14] is a multi-view
ASL dataset consisting of around 35K samples performed by
11 signers for a duration of 79 hours.
SIGNUM [39] is a DGS dataset consisting of 780 sentences
performed by 25 signers. The SignsWorld Atlas [35] is an
ArSL dataset consisting of five sentences performed by four
signers. TheRuSLan [24] is a Russian SL dataset consisting
of 164 sentences performed by 13 signers. Huang et al. [22]
proposed a CSL dataset consisting of 100 sentences with a
vocabulary size of 178 sign words performed by 50 signers.
Table I summarizes the available continuous sign language
datasets. The missing information in the table is not reported
in the respective reference.
Another challenge of CSLR is sign segmentation. This can
be attributed to the lack of movement epenthesis clues between
sentence’s signs and the lack of temporal information that can
摘要:

ArabSign:AMulti-modalityDatasetandBenchmarkforContinuousArabicSignLanguageRecognitionHamzahLuqmanInformationandComputerScienceDepartment,KingFahdUniversityofPetroleumandMineralsSDAIA-KFUPMJointResearchCenterforArticialIntelligence,Dhahran31261,SaudiArabia.Email:hluqman@kfupm.edu.saAbstract—Signlang...

展开>> 收起<<
ArabSign A Multi-modality Dataset and Benchmark for Continuous Arabic Sign Language Recognition.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.53MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注