Hierarchical I3D for Sign Spotting Ryan Wong Necati Cihan Camg oz and Richard Bowden University of Surrey

2025-05-06 0 0 820.43KB 14 页 10玖币
侵权投诉
Hierarchical I3D for Sign Spotting
Ryan Wong, Necati Cihan Camg¨oz, and Richard Bowden
University of Surrey
{r.wong,n.camgoz,r.bowden}@surrey.ac.uk
Abstract. Most of the vision-based sign language research to date has
focused on Isolated Sign Language Recognition (ISLR), where the ob-
jective is to predict a single sign class given a short video clip. Although
there has been significant progress in ISLR, its real-life applications are
limited. In this paper, we focus on the challenging task of Sign Spotting
instead, where the goal is to simultaneously identify and localise signs
in continuous co-articulated sign videos. To address the limitations of
current ISLR-based models, we propose a hierarchical sign spotting ap-
proach which learns coarse-to-fine spatio-temporal sign features to take
advantage of representations at various temporal levels and provide more
precise sign localisation. Specifically, we develop Hierarchical Sign I3D
model (HS-I3D) which consists of a hierarchical network head that is
attached to the existing spatio-temporal I3D model to exploit features
at different layers of the network. We evaluate HS-I3D on the ChaLearn
2022 Sign Spotting Challenge - MSSL track and achieve a state-of-the-art
0.607 F1 score, which was the top-1 winning solution of the competition.
Keywords: Sign Language Recognition, Sign Spotting
1 Introduction
Sign Languages are visual languages that incorporate the motion of the hands,
facial expression and body movement [3]. They are the primary form of com-
munication amongst deaf communities with most countries having their own
sign languages with different dialects across different regions. Although there
are many commonalities across sign languages in terms of linguistics and gram-
matical rules, each has a very different lexicon [24].
There has been increasing interest in computational sign language research.
A popular research topic has been Isolated Sign Language Recognition (ISLR),
where the goal is to identify which single sign is present in a short isolated sign
video clip [15,17]. Although there are still challenges to solve, such as signer-
independent recognition [25], the real-life applications of ISLR are limited.
In this work we focus on the closely related field of Sign Spotting, where
the objective is to identify and localise instances of signs within a co-articulated
continuous sign video. Sign spotting is beneficial for several real-life applications,
such as Sign Content Retrieval, where spotting models are used to search through
large unlabelled corpora to locate instances of signs.
arXiv:2210.00951v1 [cs.CV] 3 Oct 2022
2 R. Wong et al.
Current sign spotting approaches can be categorized under two groups. The
first is dictionary based sign spotting approaches where given an isolated sign,
the objective is to identify and locate co-articulate instances of that sign in a
continuous sign video. This usually involves one-shot / few shot learning where
minimal annotated examples are available [26].
The second group of sign spotting approaches align closer to ISLR, but in-
stead of a sign video containing an isolated sign, the video segment is usually
longer with one or multiple instances of co-articulated signs that needs to be
identified from a set vocabulary. This involves multiple shot supervised learning
where there are multiple examples of a set vocabulary within a larger corpus of
continuous sign videos.
Fig. 1. Concept of the Hierarchical Sign I3D model which takes an input video sequence
and predicts the localisation of signs at various temporal resolutions
In this work we build up on the latter group of sign spotting. To address the
limitations of the previous approaches, we propose a novel hierarchical spatio-
temporal network architecture, named a Hierarchical Sign I3D model (HS-I3D),
and identify coarse-to-fine temporal locations of signs in continuous sign videos
as shown in Fig. 1.
HS-I3D comprises of a backbone and a head. Although our approach can be
used with any spatio-temporal backbone, we’ve chosen I3D due to its success in
related sign tasks [26], which enables the use of other pretrained SLR models. The
additional hierarchical spatio-temporal network head has the ability to predict
sign labels at the frame level for better estimation of the boundaries between
signs.
The main contributions of this work can be summarised as:
1. We introduce a novel hierarchical spatio-temporal network head which can
be attached to existing spatio-temporal sign models to learn the coarse-to-
fine temporal locations of signs.
2. We demonstrate the importance of incorporating random sampling tech-
niques during training and show the impact and trade-off it has between
precision and recall.
3. Our architecture achieves state-of-the-art results on the 2022 ChaLearn Sign
Spotting Challenge in the multiple shot supervised learning (MSSL) track.
Hierarchical I3D for Sign Spotting 3
2 Related work
2.1 Sign Language Recognition
Over the last few decades, significant progress has been made towards Sign
Language Recognition. Traditional feature engineered approaches, such as hand
shape and motion modeling techniques [2,6,9,11] have been replaced by data
driven, machine learning approaches. These data driven approaches require large
annotated datasets therefore many ISLR datasets have been created, including
but not limited to Turkish Sign Language (TID) [25], American Sign Language
(ASL) [15,17], Chinese Sign Language [30] and British Sign Language (BSL)
[1,7].
Most of the current ISLR approaches use either raw RGB videos or pose-
based input. The pose based input utilizes human pose estimators such as Open-
Pose [4] and MediaPipe [19], which distill the signer to a set of keypoints. By
using keypoints, irrelevant appearance information is discarded, such as the back-
ground and a person’s visual appearance. Various models have been developed
to allow keypoint input, PoseSign [1] makes use of a 2D ResNet architecture
[10] and found that the keypoint inputs underperformed compared to RGB based
approaches. More recently, Graph Convolutional Networks (GCNs) using human
keypoints have achieved comparable results to RGB models [13].
RGB based approaches, which use the raw frames as input, have been ex-
tended to spatio-temporal architectures, building on existing action recognition
models, such as 3DCNNs [21] and more recently the I3D model [5]. Such archi-
tectures achieve strong classification performance on ISLR datasets [13,15,17].
Like in other areas of computer vision, transfer learning has been shown to be
effective in improving results of sign language recognition [1], which is especially
important for transferring domain knowledge across different sign languages.
Motivated by this, we use a pretrained I3D model as the backbone model for
our proposed Hierarchical Sign I3D model which allows us to leverage models
pretrained on larger scale ISLR dataset.
2.2 Sign Spotting
While ISLR aims to identify an isolated sign in a given sequence, Sign Spotting
requires identification of both the start and end of a sign instance from a set vo-
cabulary within a continuous sign video. Early methods utilized hand crafted fea-
tures, such as thresholding-based approaches using Conditional Random Fields,
to distinguish the difference between signs in the vocabulary and non-sign pat-
terns [28]. Another technique detected skin-coloured regions in frames and uti-
lized temporal alignment techniques, such as Dynamic Time Warping [27]. Se-
quential Interval Patterns were also proposed, which used hierarchical trees to
learn a strong classifier to spot signs [20].
One-shot approaches, which use sign dictionaries have recently been explored,
where given a set vocabulary from a dictionary, the objective is to locate these
摘要:

HierarchicalI3DforSignSpottingRyanWong,NecatiCihanCamg¨oz,andRichardBowdenUniversityofSurrey{r.wong,n.camgoz,r.bowden}@surrey.ac.ukAbstract.Mostofthevision-basedsignlanguageresearchtodatehasfocusedonIsolatedSignLanguageRecognition(ISLR),wheretheob-jectiveistopredictasinglesignclassgivenashortvideocl...

展开>> 收起<<
Hierarchical I3D for Sign Spotting Ryan Wong Necati Cihan Camg oz and Richard Bowden University of Surrey.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:820.43KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注