Changing the Representation Examining Language Representation for Neural Sign Language Production Harry Walsh Ben Saunders Richard Bowden

2025-04-27 0 0 585.82KB 8 页 10玖币
侵权投诉
Changing the Representation: Examining Language Representation for
Neural Sign Language Production
Harry Walsh, Ben Saunders, Richard Bowden
University of Surrey
{harry.walsh, b.saunders, r.bowden}@surrey.ac.uk
Abstract
Neural Sign Language Production (
SLP
) aims to automatically translate from spoken language sentences to sign language
videos. Historically the
SLP
task has been broken into two steps; Firstly, translating from a spoken language sentence to a gloss
sequence and secondly, producing a sign language video given a sequence of glosses. In this paper we apply Natural Language
Processing techniques to the first step of the
SLP
pipeline. We use language models such as BERT and Word2Vec to create
better sentence level embeddings, and apply several tokenization techniques, demonstrating how these improve performance on
the low resource translation task of Text to Gloss. We introduce Text to HamNoSys (T2H) translation, and show the advantages
of using a phonetic representation for sign language translation rather than a sign level gloss representation. Furthermore, we use
HamNoSys to extract the hand shape of a sign and use this as additional supervision during training, further increasing the
performance on T2H. Assembling best practise, we achieve a BLEU-4 score of 26.99 on the MineDGS dataset and 25.09 on
PHOENIX14T, two new state-of-the-art baselines.
Keywords:
Sign Language Translation (
SLT
), Natural Language Processing (
NLP
), Sign Language, Phonetic Repre-
sentation
1. Introduction
Sign languages are the dominant form of communi-
cation for Deaf communities, with 430 million users
worldwide (WHO, 2021). Sign languages are complex
multichannel languages with their own grammatical
structure and vocabulary (Stokoe, 1980). For many
people, sign language is their primary language, and
written forms of spoken language are their secondary
languages.
Sign Language Production (
SLP
) aims to bridge the
gap between hearing and Deaf communities, by trans-
lating from spoken language sentences to sign language
sequences. This problem has historically been broken
into two steps; 1) translation from spoken language to
gloss
1
and 2) subsequent production of sign language
sequences from a sequence of glosses, commonly using
a graphical avatar (Elliott et al., 2008; Efthimiou et al.,
2010; Efthimiou et al., 2009) or more recently, a photo-
realistic signer (Saunders et al., 2021a; Saunders et al.,
2021b). In this paper, we improve the
SLP
pipeline by
focusing on the Text to Gloss (
T2G
) translation task of
step 1.
Modern deep learning is heavily dependent upon data.
However, the creation of sign language datasets is both
time consuming and costly, restricting their size to or-
ders of magnitude smaller than their spoken language
counterparts. State-of-the-art datasets such as RWTH-
PHOENIX-Weather-2014
T
(
PHOENIX14T
), and the
newer MineDGS (
mDGS
), contain only 8,257 and
63,912 examples respectively (Koller et al., 2015;
Hanke et al., 2020), compared to over 15 million exam-
1Gloss is the written word associated with a sign
ples for common spoken language datasets (Vrande
ˇ
ci
´
c
and Kr
¨
otzsch, 2014). Hence, sign languages can be
considered as low resource languages.
In this work, we take inspiration from
NLP
tech-
niques to boost translation performance. We explore
how language can be modeled using different tokeniz-
ers, more specifically Byte Pair Encoding (
BPE
), Word-
Piece, word and character level tokenizers. We show
that finding the correct tokenizer for the task helps sim-
plify the translation problem.
Furthermore, to help tackle our low resource language
task, we explore using pre-trained language models such
as BERT (Devlin et al., 2018) and Word2Vec (Mikolov
et al., 2013b) to create improved sentence level em-
beddings. We also fuse contextual information from
the embedding to increase the amount of information
available to the network. We show that using models
trained on large corpuses of data improves translation
performance.
Previously the first step of the
SLP
pipeline used
T2G
translation. We explore using a phonetic representation
based on the Hamburg Notation System (HamNoSys)
which we define as Text to HamNoSys (
T2H
). Ham-
NoSys encodes signs using a set of symbols and can be
viewed as a phonetic representation of sign language
(Hanke, 2004). There are three main components when
representing a sign in HamNoSys; a) its initial configu-
ration b) it’s hand shape and c) it’s action. An example
of HamNoSys can be seen in Fig. 1 along with its gloss
and text counterparts.
We evaluate our
SLP
models on both the
mDGS
and
PHOENIX14T
datasets, showing state-of-the-art per-
formance on
T2G
(
mDGS
& PHX) and
T2H
(
mDGS
)
arXiv:2210.06312v1 [cs.CL] 16 Sep 2022
Figure 1: A graph to show the word “running” which
would be ‘glossed’ as RUN and the associated sequence
of HamNoSys, Top: Text, Middle: Gloss, Bottom: Ham-
NoSys. HamNoSys is split into: a) it’s initial configura-
tion b) it’s hand shape 3) it’s action
tasks. We achieve a BLEU-4 score of 26.99 on
mDGS
,
a significant increase compared to the state-of-the-art
score of 3.17 (Saunders et al., 2022).
The rest of this paper is structured as follows; In sec-
tion 2 we review the related work in the field. Section 3
presents our methodology. Section 4 shows quantitative
and qualitative results. Finally, we draw conclusions in
section 5 and suggest future work.
2. Related Work
Sign Language Recognition & Translation:
Compu-
tational sign language research has been studied for
over 30 years (Tamura and Kawasaki, 1988). Research
started with isolated Sign Language Recognition (
SLR
)
where individual signs were classified using CNNs (Le-
cun et al., 1998). Recently, the field has moved to
the more challenging problem of Continuous Sign Lan-
guage Recognition (
CSLR
), where a continuous sign
language video needs to be segmented and then clas-
sified (Koller et al., 2015). Most modern approaches
to
SLR
and
CSLR
rely on deep learning, but such ap-
proaches are data hungry and therefore are limited by
the size of publicly available datasets.
The distinction between
CSLR
and Sign Language
Translation (
SLT
) was stressed by Camgoz et al. (2018).
SLT
aims to translate a continuous sequence of signs to
spoken language sentences (Sign to Text (S2T)) or vice
versa (Text to Sign (T2S)), a challenging problem due
to the changes in grammar and sequence ordering.
Sign Language Production (SLP):
focusses on T2S,
the production of a continuous sign language sequence
given a spoken language input sentence. Current state-
of-the-art approaches to
SLP
use transformer based ar-
chitectures with attention (Stoll et al., 2018; Saunders
et al., 2020). In this paper, we tackle the
SLP
task of
neural sign language translation, defined as
T2G
or
T2H
translation.
HamNoSys has been used before for statistical
SLP
,
with some success (Kaur and Kumar, 2014; Kaur and
Kumar, 2016). However, the produced motion becomes
robotic and is not practical for real world applications.
Note that these approaches first convert the HamNoSys
to SiGML, an XML format of HamNoSys (Kaur and
Kumar, 2016).
Neural Machine Translation (NMT): NMT
aims to
generate a target sequence given a source sequence us-
ing neural networks (Bahdanau et al., 2014) and is com-
monly used for spoken language translations. Initial
approaches used recurrence to map a hidden state to
an output sequence (Kalchbrenner and Blunsom, 2013),
with limited performance. Encoder-decoder structures
were later introduced, that map an input sequence to an
embedding space (Wu et al., 2016). To address the bot-
tleneck problem, attention was introduced to measure
the affinity between sections of the input and embed-
ding space and allow the model to focus on specific
context (Bahdanau et al., 2014). This was improved fur-
ther with the introduction of the transformer (Vaswani
et al., 2017) that used Multi-Headed Attention (
MHA
)
to allow multiple projections of the learned attention.
More recently, model sizes have grown with architec-
tures introduced such as GPT-2 (Radford et al., 2019)
and BERT (Devlin et al., 2018).
Different encoding/decoding schemes have been ex-
plored.
BPE
was first introduced in Sennrich et al.
(2015), to create a set of tokens given a set vocabulary
size. This is achieved by merging the most commonly
occurring sequential characters. WordPiece, a similar
tokenizer to BPE, was first introduced in Schuster and
Nakajima (2012) and is commonly used when training
language models such as BERT, DistilBERT and Elec-
tra. Finally, word and character level tokenizers break
up a sentence based on white space and unique symbols
respectively.
Natural Language Processing: NLP
has many appli-
cations, for example Text Simplification, Text Classifica-
tion, and Speech Recognition. Recently, deep learning
approaches have outperformed older statistical meth-
ods (Vaswani et al., 2017). A successful
NLP
model
must understand the structure and context of language,
learned via supervised or unsupervised methods. Pre-
trained language models have been used to boost perfor-
mance in other
NLP
tasks (Clinchant et al., 2019; Zhu et
al., 2020), such as BERT (Devlin et al., 2018) achieving
state-of-the-art performance. Zhu et al., 2020 tried to
fuse the embedding of BERT into a traditional trans-
former architecture using attention, increasing the trans-
lation performance by approximately 2 BLEU score.
Other methods have used Word2Vec to model lan-
guage, this has been applied to many
NLP
tasks
(Mikolov et al., 2013b). Word2Vec is designed to give
meaning to a numerical representation of words. The
central idea being that words with similar meaning
should have a small euclidean distance between the
vector representation.
In this paper, we take inspiration from these tech-
niques to boost performance of the low resource task of
T2G and T2H sign language production.
摘要:

ChangingtheRepresentation:ExaminingLanguageRepresentationforNeuralSignLanguageProductionHarryWalsh,BenSaunders,RichardBowdenUniversityofSurreyfharry.walsh,b.saunders,r.bowdeng@surrey.ac.ukAbstractNeuralSignLanguageProduction(SLP)aimstoautomaticallytranslatefromspokenlanguagesentencestosignlanguagevi...

展开>> 收起<<
Changing the Representation Examining Language Representation for Neural Sign Language Production Harry Walsh Ben Saunders Richard Bowden.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:585.82KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注