ArabSign A Multi-modality Dataset and Benchmark for Continuous Arabic Sign Language Recognition

2025-04-27 0 0 1.53MB 8 页 10玖币

侵权投诉

ArabSign: A Multi-modality Dataset and

Benchmark for Continuous Arabic Sign Language

Recognition

Hamzah Luqman

Information and Computer Science Department, King Fahd University of Petroleum and Minerals

SDAIA-KFUPM Joint Research Center for Artiﬁcial Intelligence, Dhahran 31261, Saudi Arabia.

Email: hluqman@kfupm.edu.sa

Abstract—Sign language recognition has attracted the interest

of researchers in recent years. While numerous approaches

have been proposed for European and Asian sign languages

recognition, very limited attempts have been made to develop

similar systems for the Arabic sign language (ArSL). This can

be attributed partly to the lack of a dataset at the sentence

level. In this paper, we aim to make a signiﬁcant contribution by

proposing ArabSign, a continuous ArSL dataset. The proposed

dataset consists of 9,335 samples performed by 6 signers. The

total time of the recorded sentences is around 10 hours and

the average sentence’s length is 3.1 signs. ArabSign dataset was

recorded using a Kinect V2 camera that provides three types

of information (color, depth, and skeleton joint points) recorded

simultaneously for each sentence. In addition, we provide the

annotation of the dataset according to ArSL and Arabic language

structures that can help in studying the linguistic characteristics

of ArSL. To benchmark this dataset, we propose an encoder-

decoder model for Continuous ArSL recognition. The model

has been evaluated on the proposed dataset, and the obtained

results show that the encoder-decoder model outperformed the

attention mechanism with an average word error rate (WER)

of 0.50 compared with 0.62 with the attention mechanism.

The data and code are available at https://github.com/Hamzah-

Luqman/ArabSign

I. INTRODUCTION

Hearing loss is a serious problem facing the world today,

and it is getting worse. It is estimated that nearly 2.5 billion

people are projected to have some degree of hearing loss by

2050, and at least 700 million will require hearing rehabil-

itation [1]. Modern lifestyles and unsafe listening practices

mean that over 1 billion young adults are at risk of permanent

hearing loss.

Sign language is the main communication language of

hearing impaired people. This language is a complete and rich

language with grammar and structure that differ from spoken

languages. Sign language has its own lexicon that is usually

smaller than spoken languages’ vocabulary.

Sign language is not a universal language and it does not

depend on spoken languages [4]. Sign languages are ”not

mutually intelligible with each other” although there are some

similarities in some signs. There are many sign languages that

differ in their gestures, lexicon, and grammar. Most of the

sign languages are related to the country more than the spoken

language of that country. There are some countries that speak

one language but have different sign languages, such as British

Fig. 1: An illustrative example from the ArabSign dataset for

the three modalities provided for each sentence sample: (a)

color, (b) depth, and (c) skeleton joint points.

Sign Language (BSL) and American Sign Language (ASL).

Other popular sign languages are Chinese (CSL), German

(GSL), Indian (ISL), and Arabic (ArSL) sign languages. ArSL

is one of the main languages used in Arab countries. It is

currently the main language used in translating television

programs such as news and interviews. This language has a

dictionary consisting of 3200 sign words published in two

parts [5], [6].

Sign language is a non-verbal language that uses multi-

modality data to express thoughts [15]. Manual and non-

manual gestures are the two modalities used in sign language

for communication. These gestures are combined during sign-

ing in a way that complements each other. Manual gestures are

the dominant element used in sign languages. These gestures

employ body movements through the hands and head. The

majority of sign language signs depend on manual gestures.

The non-manual modality consists mainly of facial expressions

that are simultaneously performed with manual gestures. Non-

manual gestures are used to show emotions and feelings in

sign language in addition to linguistic properties such as

grammatical structure, adjectival or adverbial content, and

lexical distinction.

Translating sign language into spoken language is accom-

plished through sign language recognition (SLR) and transla-

tion [28]. Automatic SLR involves using pattern recognition

and computer vision to identify sign gestures and convert them

into their equivalent words in the natural language [40]. Sign

language translation involves using natural language process-

ing and linguistics to translate the recognized sign language

arXiv:2210.03951v1 [cs.CV] 8 Oct 2022

sentences into spoken languages to meet their structure and

grammar. Extensive research has been conducted on SLR

compared with translation since translation depends on the

output of the SLR at the sentence level.

Based on the type of the recognized signs, SLR systems

can be categorized into isolated and continuous SLR systems.

Isolated sign recognition systems target isolated sign words

while continuous sign language recognition (CSLR) systems

target more than one sign performed continually. Most of

the techniques that have been proposed for SLR during the

last three decades have targeted isolated signs [15]. CSLR is

still in its infancy compared with isolated SLR, where the

growth of CSLR studies is close to linear compared with the

exponential growth of isolated SLR studies [25]. One of the

challenges associated with CSLR is the lack of movement

epenthesis clues between sentence signs and the lack of

temporal information that can help in signs segmentation.

In addition, the high variance between signs performed by

different signers made the learning of segmentation clues very

difﬁcult. Another challenge is the lack of datasets, which can

be considered the main challenge that makes most of the

researchers target isolated signs.

To our knowledge, no available vision-based annotated

sentences of ArSL that can be used for ArSL CSLR and

translation. The datasets that have been proposed for Arabic

CSLR are collected using glove sensors. However, sensor-

based SLR requires signers to keep wearing the electronic

sensor gloves during signing. This makes these sensors unsuit-

able for real-time applications. In addition, the sensors used

for sign acquisition can not capture the non-manual features of

the sign language. This motivated us to propose a continuous

ArSL dataset that can be used for CSLR and translation. The

main contributions of this research are as follows:

•Propose a continuous ArSL dataset (ArabSign). The pro-

posed dataset was collected using a multi-modality Kinect

V2 camera. The dataset is available in three modalities:

color, depth, and joint points shown in Figure 1. The

proposed dataset consists of 9,335 samples representing

50 ArSL sentences. Each sentence was performed by 6

signers, and each sentence was repeated several times by

each signer.

•Provide the annotation of the performed sentences ac-

cording to the structure of ArSL and Arabic language.

This makes the dataset useful for studying the grammar

and structure of ArSL and developing machine translation

systems between ArSL and natural languages.

•Propose an encoder-decoder model for benchmarking

the proposed ArabSign dataset. The model has been

trained on features extracted from the color frames of the

sentences using different pre-trained models. In addition,

the proposed model has been compared with an attention

mechanism.

This paper is organized as follows: a literature review of

the available continuous sign language datasets is presented

in Section II. A detailed description of the proposed ArabSign

dataset is presented in Section III. Section IV describes the

experimental work that has been conducted to benchmark the

proposed dataset, and the conclusions are presented in Section

II. LITERATURE REVIEW

The work on SLR can be dated back to the middle of the

1990s [25]. The SLR systems at the sign level are the most

common SRL systems compared with the sentence level due

to the availability of datasets at the sign level and the similarity

between this problem and gesture recognition problems [15].

In contrast, few approaches have been proposed for CSLR due

to the challenges associated with recognizing sign languages’

sentences. One of these challenges is the lack of annotated

datasets.

Few datasets have been proposed for continuous SL com-

pared with isolated sign datasets. The majority of these

datasets target ASL and DGS. There are some datasets

that are used by researchers for their work. However, these

datasets are either limited in size or unavailable for re-

searchers. The most commonly used continuous sign lan-

guage datasets were proposed by a group at RWTH Aachen

University. This group proposed four datasets for contin-

uous ASL and DGS, namely RWTH-BOSTON-104 [13],

RWTH-BOSTON-400 [12], RWTH-PHOENIX-Weather [17],

and RWTH-PHOENIX-Weather-2014 [18]. RWTH-BOSTON-

104 [13] was recorded at Boston University and it consists

of 201 sentences of ASL performed by three signers. The

vocabulary size of this dataset is 168 sign words.

RWTH-BOSTON-400 [12] is an extension of RWTH-

BOSTON-104. It consists of 843 sentences with a vocabulary

size of 406 sign words performed by four signers. RWTH-

PHOENIX-Weather [17] includes weather forecasts collected

from German television. This dataset is performed by seven

signers and it consists of 1,980 sentences of DGS with a

vocabulary size of 911 sign words. This dataset is extended

in RWTH-PHOENIX-Weather-2014 [18] to 6,861 sentences

performed by nine signers. Both datasets were recorded in a

controlled environment where signers were wearing a dark T-

shirt with grey background. How2Sign [14] is a multi-view

ASL dataset consisting of around 35K samples performed by

11 signers for a duration of 79 hours.

SIGNUM [39] is a DGS dataset consisting of 780 sentences

performed by 25 signers. The SignsWorld Atlas [35] is an

ArSL dataset consisting of ﬁve sentences performed by four

signers. TheRuSLan [24] is a Russian SL dataset consisting

of 164 sentences performed by 13 signers. Huang et al. [22]

proposed a CSL dataset consisting of 100 sentences with a

vocabulary size of 178 sign words performed by 50 signers.

Table I summarizes the available continuous sign language

datasets. The missing information in the table is not reported

in the respective reference.

Another challenge of CSLR is sign segmentation. This can

be attributed to the lack of movement epenthesis clues between

sentence’s signs and the lack of temporal information that can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ArabSign:AMulti-modalityDatasetandBenchmarkforContinuousArabicSignLanguageRecognitionHamzahLuqmanInformationandComputerScienceDepartment,KingFahdUniversityofPetroleumandMineralsSDAIA-KFUPMJointResearchCenterforArticialIntelligence,Dhahran31261,SaudiArabia.Email:hluqman@kfupm.edu.saAbstractSignlang...

展开>> 收起<<

ArabSign A Multi-modality Dataset and Benchmark for Continuous Arabic Sign Language Recognition.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ArabSign A Multi-modality Dataset and Benchmark for Continuous Arabic Sign Language Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: