Hierarchical I3D for Sign Spotting Ryan Wong Necati Cihan Camg oz and Richard Bowden University of Surrey

2025-05-06 0 0 820.43KB 14 页 10玖币

侵权投诉

Hierarchical I3D for Sign Spotting

Ryan Wong, Necati Cihan Camg¨oz, and Richard Bowden

University of Surrey

{r.wong,n.camgoz,r.bowden}@surrey.ac.uk

Abstract. Most of the vision-based sign language research to date has

focused on Isolated Sign Language Recognition (ISLR), where the ob-

jective is to predict a single sign class given a short video clip. Although

there has been signiﬁcant progress in ISLR, its real-life applications are

limited. In this paper, we focus on the challenging task of Sign Spotting

instead, where the goal is to simultaneously identify and localise signs

in continuous co-articulated sign videos. To address the limitations of

current ISLR-based models, we propose a hierarchical sign spotting ap-

proach which learns coarse-to-ﬁne spatio-temporal sign features to take

advantage of representations at various temporal levels and provide more

precise sign localisation. Speciﬁcally, we develop Hierarchical Sign I3D

model (HS-I3D) which consists of a hierarchical network head that is

attached to the existing spatio-temporal I3D model to exploit features

at diﬀerent layers of the network. We evaluate HS-I3D on the ChaLearn

2022 Sign Spotting Challenge - MSSL track and achieve a state-of-the-art

0.607 F1 score, which was the top-1 winning solution of the competition.

Keywords: Sign Language Recognition, Sign Spotting

1 Introduction

Sign Languages are visual languages that incorporate the motion of the hands,

facial expression and body movement [3]. They are the primary form of com-

munication amongst deaf communities with most countries having their own

sign languages with diﬀerent dialects across diﬀerent regions. Although there

are many commonalities across sign languages in terms of linguistics and gram-

matical rules, each has a very diﬀerent lexicon [24].

There has been increasing interest in computational sign language research.

A popular research topic has been Isolated Sign Language Recognition (ISLR),

where the goal is to identify which single sign is present in a short isolated sign

video clip [15,17]. Although there are still challenges to solve, such as signer-

independent recognition [25], the real-life applications of ISLR are limited.

In this work we focus on the closely related ﬁeld of Sign Spotting, where

the objective is to identify and localise instances of signs within a co-articulated

continuous sign video. Sign spotting is beneﬁcial for several real-life applications,

such as Sign Content Retrieval, where spotting models are used to search through

large unlabelled corpora to locate instances of signs.

arXiv:2210.00951v1 [cs.CV] 3 Oct 2022

2 R. Wong et al.

Current sign spotting approaches can be categorized under two groups. The

ﬁrst is dictionary based sign spotting approaches where given an isolated sign,

the objective is to identify and locate co-articulate instances of that sign in a

continuous sign video. This usually involves one-shot / few shot learning where

minimal annotated examples are available [26].

The second group of sign spotting approaches align closer to ISLR, but in-

stead of a sign video containing an isolated sign, the video segment is usually

longer with one or multiple instances of co-articulated signs that needs to be

identiﬁed from a set vocabulary. This involves multiple shot supervised learning

where there are multiple examples of a set vocabulary within a larger corpus of

continuous sign videos.

Fig. 1. Concept of the Hierarchical Sign I3D model which takes an input video sequence

and predicts the localisation of signs at various temporal resolutions

In this work we build up on the latter group of sign spotting. To address the

limitations of the previous approaches, we propose a novel hierarchical spatio-

temporal network architecture, named a Hierarchical Sign I3D model (HS-I3D),

and identify coarse-to-ﬁne temporal locations of signs in continuous sign videos

as shown in Fig. 1.

HS-I3D comprises of a backbone and a head. Although our approach can be

used with any spatio-temporal backbone, we’ve chosen I3D due to its success in

related sign tasks [26], which enables the use of other pretrained SLR models. The

additional hierarchical spatio-temporal network head has the ability to predict

sign labels at the frame level for better estimation of the boundaries between

signs.

The main contributions of this work can be summarised as:

1. We introduce a novel hierarchical spatio-temporal network head which can

be attached to existing spatio-temporal sign models to learn the coarse-to-

ﬁne temporal locations of signs.

2. We demonstrate the importance of incorporating random sampling tech-

niques during training and show the impact and trade-oﬀ it has between

precision and recall.

3. Our architecture achieves state-of-the-art results on the 2022 ChaLearn Sign

Spotting Challenge in the multiple shot supervised learning (MSSL) track.

Hierarchical I3D for Sign Spotting 3

2 Related work

2.1 Sign Language Recognition

Over the last few decades, signiﬁcant progress has been made towards Sign

Language Recognition. Traditional feature engineered approaches, such as hand

shape and motion modeling techniques [2,6,9,11] have been replaced by data

driven, machine learning approaches. These data driven approaches require large

annotated datasets therefore many ISLR datasets have been created, including

but not limited to Turkish Sign Language (TID) [25], American Sign Language

(ASL) [15,17], Chinese Sign Language [30] and British Sign Language (BSL)

[1,7].

Most of the current ISLR approaches use either raw RGB videos or pose-

based input. The pose based input utilizes human pose estimators such as Open-

Pose [4] and MediaPipe [19], which distill the signer to a set of keypoints. By

using keypoints, irrelevant appearance information is discarded, such as the back-

ground and a person’s visual appearance. Various models have been developed

to allow keypoint input, Pose→Sign [1] makes use of a 2D ResNet architecture

[10] and found that the keypoint inputs underperformed compared to RGB based

approaches. More recently, Graph Convolutional Networks (GCNs) using human

keypoints have achieved comparable results to RGB models [13].

RGB based approaches, which use the raw frames as input, have been ex-

tended to spatio-temporal architectures, building on existing action recognition

models, such as 3DCNNs [21] and more recently the I3D model [5]. Such archi-

tectures achieve strong classiﬁcation performance on ISLR datasets [13,15,17].

Like in other areas of computer vision, transfer learning has been shown to be

eﬀective in improving results of sign language recognition [1], which is especially

important for transferring domain knowledge across diﬀerent sign languages.

Motivated by this, we use a pretrained I3D model as the backbone model for

our proposed Hierarchical Sign I3D model which allows us to leverage models

pretrained on larger scale ISLR dataset.

2.2 Sign Spotting

While ISLR aims to identify an isolated sign in a given sequence, Sign Spotting

requires identiﬁcation of both the start and end of a sign instance from a set vo-

cabulary within a continuous sign video. Early methods utilized hand crafted fea-

tures, such as thresholding-based approaches using Conditional Random Fields,

to distinguish the diﬀerence between signs in the vocabulary and non-sign pat-

terns [28]. Another technique detected skin-coloured regions in frames and uti-

lized temporal alignment techniques, such as Dynamic Time Warping [27]. Se-

quential Interval Patterns were also proposed, which used hierarchical trees to

learn a strong classiﬁer to spot signs [20].

One-shot approaches, which use sign dictionaries have recently been explored,

where given a set vocabulary from a dictionary, the objective is to locate these

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalI3DforSignSpottingRyanWong,NecatiCihanCamg¨oz,andRichardBowdenUniversityofSurrey{r.wong,n.camgoz,r.bowden}@surrey.ac.ukAbstract.Mostofthevision-basedsignlanguageresearchtodatehasfocusedonIsolatedSignLanguageRecognition(ISLR),wheretheob-jectiveistopredictasinglesignclassgivenashortvideocl...

展开>> 收起<<

Hierarchical I3D for Sign Spotting Ryan Wong Necati Cihan Camg oz and Richard Bowden University of Surrey.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical I3D for Sign Spotting Ryan Wong Necati Cihan Camg oz and Richard Bowden University of Surrey

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: