sentences into spoken languages to meet their structure and
grammar. Extensive research has been conducted on SLR
compared with translation since translation depends on the
output of the SLR at the sentence level.
Based on the type of the recognized signs, SLR systems
can be categorized into isolated and continuous SLR systems.
Isolated sign recognition systems target isolated sign words
while continuous sign language recognition (CSLR) systems
target more than one sign performed continually. Most of
the techniques that have been proposed for SLR during the
last three decades have targeted isolated signs [15]. CSLR is
still in its infancy compared with isolated SLR, where the
growth of CSLR studies is close to linear compared with the
exponential growth of isolated SLR studies [25]. One of the
challenges associated with CSLR is the lack of movement
epenthesis clues between sentence signs and the lack of
temporal information that can help in signs segmentation.
In addition, the high variance between signs performed by
different signers made the learning of segmentation clues very
difficult. Another challenge is the lack of datasets, which can
be considered the main challenge that makes most of the
researchers target isolated signs.
To our knowledge, no available vision-based annotated
sentences of ArSL that can be used for ArSL CSLR and
translation. The datasets that have been proposed for Arabic
CSLR are collected using glove sensors. However, sensor-
based SLR requires signers to keep wearing the electronic
sensor gloves during signing. This makes these sensors unsuit-
able for real-time applications. In addition, the sensors used
for sign acquisition can not capture the non-manual features of
the sign language. This motivated us to propose a continuous
ArSL dataset that can be used for CSLR and translation. The
main contributions of this research are as follows:
•Propose a continuous ArSL dataset (ArabSign). The pro-
posed dataset was collected using a multi-modality Kinect
V2 camera. The dataset is available in three modalities:
color, depth, and joint points shown in Figure 1. The
proposed dataset consists of 9,335 samples representing
50 ArSL sentences. Each sentence was performed by 6
signers, and each sentence was repeated several times by
each signer.
•Provide the annotation of the performed sentences ac-
cording to the structure of ArSL and Arabic language.
This makes the dataset useful for studying the grammar
and structure of ArSL and developing machine translation
systems between ArSL and natural languages.
•Propose an encoder-decoder model for benchmarking
the proposed ArabSign dataset. The model has been
trained on features extracted from the color frames of the
sentences using different pre-trained models. In addition,
the proposed model has been compared with an attention
mechanism.
This paper is organized as follows: a literature review of
the available continuous sign language datasets is presented
in Section II. A detailed description of the proposed ArabSign
dataset is presented in Section III. Section IV describes the
experimental work that has been conducted to benchmark the
proposed dataset, and the conclusions are presented in Section
V.
II. LITERATURE REVIEW
The work on SLR can be dated back to the middle of the
1990s [25]. The SLR systems at the sign level are the most
common SRL systems compared with the sentence level due
to the availability of datasets at the sign level and the similarity
between this problem and gesture recognition problems [15].
In contrast, few approaches have been proposed for CSLR due
to the challenges associated with recognizing sign languages’
sentences. One of these challenges is the lack of annotated
datasets.
Few datasets have been proposed for continuous SL com-
pared with isolated sign datasets. The majority of these
datasets target ASL and DGS. There are some datasets
that are used by researchers for their work. However, these
datasets are either limited in size or unavailable for re-
searchers. The most commonly used continuous sign lan-
guage datasets were proposed by a group at RWTH Aachen
University. This group proposed four datasets for contin-
uous ASL and DGS, namely RWTH-BOSTON-104 [13],
RWTH-BOSTON-400 [12], RWTH-PHOENIX-Weather [17],
and RWTH-PHOENIX-Weather-2014 [18]. RWTH-BOSTON-
104 [13] was recorded at Boston University and it consists
of 201 sentences of ASL performed by three signers. The
vocabulary size of this dataset is 168 sign words.
RWTH-BOSTON-400 [12] is an extension of RWTH-
BOSTON-104. It consists of 843 sentences with a vocabulary
size of 406 sign words performed by four signers. RWTH-
PHOENIX-Weather [17] includes weather forecasts collected
from German television. This dataset is performed by seven
signers and it consists of 1,980 sentences of DGS with a
vocabulary size of 911 sign words. This dataset is extended
in RWTH-PHOENIX-Weather-2014 [18] to 6,861 sentences
performed by nine signers. Both datasets were recorded in a
controlled environment where signers were wearing a dark T-
shirt with grey background. How2Sign [14] is a multi-view
ASL dataset consisting of around 35K samples performed by
11 signers for a duration of 79 hours.
SIGNUM [39] is a DGS dataset consisting of 780 sentences
performed by 25 signers. The SignsWorld Atlas [35] is an
ArSL dataset consisting of five sentences performed by four
signers. TheRuSLan [24] is a Russian SL dataset consisting
of 164 sentences performed by 13 signers. Huang et al. [22]
proposed a CSL dataset consisting of 100 sentences with a
vocabulary size of 178 sign words performed by 50 signers.
Table I summarizes the available continuous sign language
datasets. The missing information in the table is not reported
in the respective reference.
Another challenge of CSLR is sign segmentation. This can
be attributed to the lack of movement epenthesis clues between
sentence’s signs and the lack of temporal information that can