Changing the Representation Examining Language Representation for Neural Sign Language Production Harry Walsh Ben Saunders Richard Bowden

2025-04-27 0 0 585.82KB 8 页 10玖币

侵权投诉

Changing the Representation: Examining Language Representation for

Neural Sign Language Production

Harry Walsh, Ben Saunders, Richard Bowden

University of Surrey

{harry.walsh, b.saunders, r.bowden}@surrey.ac.uk

Abstract

Neural Sign Language Production (

SLP

) aims to automatically translate from spoken language sentences to sign language

videos. Historically the

SLP

task has been broken into two steps; Firstly, translating from a spoken language sentence to a gloss

sequence and secondly, producing a sign language video given a sequence of glosses. In this paper we apply Natural Language

Processing techniques to the ﬁrst step of the

SLP

pipeline. We use language models such as BERT and Word2Vec to create

better sentence level embeddings, and apply several tokenization techniques, demonstrating how these improve performance on

the low resource translation task of Text to Gloss. We introduce Text to HamNoSys (T2H) translation, and show the advantages

of using a phonetic representation for sign language translation rather than a sign level gloss representation. Furthermore, we use

HamNoSys to extract the hand shape of a sign and use this as additional supervision during training, further increasing the

performance on T2H. Assembling best practise, we achieve a BLEU-4 score of 26.99 on the MineDGS dataset and 25.09 on

PHOENIX14T, two new state-of-the-art baselines.

Keywords:

Sign Language Translation (

SLT

), Natural Language Processing (

NLP

), Sign Language, Phonetic Repre-

sentation

1. Introduction

Sign languages are the dominant form of communi-

cation for Deaf communities, with 430 million users

worldwide (WHO, 2021). Sign languages are complex

multichannel languages with their own grammatical

structure and vocabulary (Stokoe, 1980). For many

people, sign language is their primary language, and

written forms of spoken language are their secondary

languages.

Sign Language Production (

SLP

) aims to bridge the

gap between hearing and Deaf communities, by trans-

lating from spoken language sentences to sign language

sequences. This problem has historically been broken

into two steps; 1) translation from spoken language to

gloss

and 2) subsequent production of sign language

sequences from a sequence of glosses, commonly using

a graphical avatar (Elliott et al., 2008; Efthimiou et al.,

2010; Efthimiou et al., 2009) or more recently, a photo-

realistic signer (Saunders et al., 2021a; Saunders et al.,

2021b). In this paper, we improve the

SLP

pipeline by

focusing on the Text to Gloss (

T2G

) translation task of

step 1.

Modern deep learning is heavily dependent upon data.

However, the creation of sign language datasets is both

time consuming and costly, restricting their size to or-

ders of magnitude smaller than their spoken language

counterparts. State-of-the-art datasets such as RWTH-

PHOENIX-Weather-2014

(

PHOENIX14T

), and the

newer MineDGS (

mDGS

), contain only 8,257 and

63,912 examples respectively (Koller et al., 2015;

Hanke et al., 2020), compared to over 15 million exam-

1Gloss is the written word associated with a sign

ples for common spoken language datasets (Vrande

and Kr

otzsch, 2014). Hence, sign languages can be

considered as low resource languages.

In this work, we take inspiration from

NLP

tech-

niques to boost translation performance. We explore

how language can be modeled using different tokeniz-

ers, more speciﬁcally Byte Pair Encoding (

BPE

), Word-

Piece, word and character level tokenizers. We show

that ﬁnding the correct tokenizer for the task helps sim-

plify the translation problem.

Furthermore, to help tackle our low resource language

task, we explore using pre-trained language models such

as BERT (Devlin et al., 2018) and Word2Vec (Mikolov

et al., 2013b) to create improved sentence level em-

beddings. We also fuse contextual information from

the embedding to increase the amount of information

available to the network. We show that using models

trained on large corpuses of data improves translation

performance.

Previously the ﬁrst step of the

SLP

pipeline used

T2G

translation. We explore using a phonetic representation

based on the Hamburg Notation System (HamNoSys)

which we deﬁne as Text to HamNoSys (

T2H

). Ham-

NoSys encodes signs using a set of symbols and can be

viewed as a phonetic representation of sign language

(Hanke, 2004). There are three main components when

representing a sign in HamNoSys; a) its initial conﬁgu-

ration b) it’s hand shape and c) it’s action. An example

of HamNoSys can be seen in Fig. 1 along with its gloss

and text counterparts.

We evaluate our

SLP

models on both the

mDGS

and

PHOENIX14T

datasets, showing state-of-the-art per-

formance on

T2G

(

mDGS

& PHX) and

T2H

(

mDGS

)

arXiv:2210.06312v1 [cs.CL] 16 Sep 2022

Figure 1: A graph to show the word “running” which

would be ‘glossed’ as RUN and the associated sequence

of HamNoSys, Top: Text, Middle: Gloss, Bottom: Ham-

NoSys. HamNoSys is split into: a) it’s initial conﬁgura-

tion b) it’s hand shape 3) it’s action

tasks. We achieve a BLEU-4 score of 26.99 on

mDGS

a signiﬁcant increase compared to the state-of-the-art

score of 3.17 (Saunders et al., 2022).

The rest of this paper is structured as follows; In sec-

tion 2 we review the related work in the ﬁeld. Section 3

presents our methodology. Section 4 shows quantitative

and qualitative results. Finally, we draw conclusions in

section 5 and suggest future work.

2. Related Work

Sign Language Recognition & Translation:

Compu-

tational sign language research has been studied for

over 30 years (Tamura and Kawasaki, 1988). Research

started with isolated Sign Language Recognition (

SLR

)

where individual signs were classiﬁed using CNNs (Le-

cun et al., 1998). Recently, the ﬁeld has moved to

the more challenging problem of Continuous Sign Lan-

guage Recognition (

CSLR

), where a continuous sign

language video needs to be segmented and then clas-

siﬁed (Koller et al., 2015). Most modern approaches

SLR

and

CSLR

rely on deep learning, but such ap-

proaches are data hungry and therefore are limited by

the size of publicly available datasets.

The distinction between

CSLR

and Sign Language

Translation (

SLT

) was stressed by Camgoz et al. (2018).

SLT

aims to translate a continuous sequence of signs to

spoken language sentences (Sign to Text (S2T)) or vice

versa (Text to Sign (T2S)), a challenging problem due

to the changes in grammar and sequence ordering.

Sign Language Production (SLP):

focusses on T2S,

the production of a continuous sign language sequence

given a spoken language input sentence. Current state-

of-the-art approaches to

SLP

use transformer based ar-

chitectures with attention (Stoll et al., 2018; Saunders

et al., 2020). In this paper, we tackle the

SLP

task of

neural sign language translation, deﬁned as

T2G

T2H

translation.

HamNoSys has been used before for statistical

SLP

with some success (Kaur and Kumar, 2014; Kaur and

Kumar, 2016). However, the produced motion becomes

robotic and is not practical for real world applications.

Note that these approaches ﬁrst convert the HamNoSys

to SiGML, an XML format of HamNoSys (Kaur and

Kumar, 2016).

Neural Machine Translation (NMT): NMT

aims to

generate a target sequence given a source sequence us-

ing neural networks (Bahdanau et al., 2014) and is com-

monly used for spoken language translations. Initial

approaches used recurrence to map a hidden state to

an output sequence (Kalchbrenner and Blunsom, 2013),

with limited performance. Encoder-decoder structures

were later introduced, that map an input sequence to an

embedding space (Wu et al., 2016). To address the bot-

tleneck problem, attention was introduced to measure

the afﬁnity between sections of the input and embed-

ding space and allow the model to focus on speciﬁc

context (Bahdanau et al., 2014). This was improved fur-

ther with the introduction of the transformer (Vaswani

et al., 2017) that used Multi-Headed Attention (

MHA

)

to allow multiple projections of the learned attention.

More recently, model sizes have grown with architec-

tures introduced such as GPT-2 (Radford et al., 2019)

and BERT (Devlin et al., 2018).

Different encoding/decoding schemes have been ex-

plored.

BPE

was ﬁrst introduced in Sennrich et al.

(2015), to create a set of tokens given a set vocabulary

size. This is achieved by merging the most commonly

occurring sequential characters. WordPiece, a similar

tokenizer to BPE, was ﬁrst introduced in Schuster and

Nakajima (2012) and is commonly used when training

language models such as BERT, DistilBERT and Elec-

tra. Finally, word and character level tokenizers break

up a sentence based on white space and unique symbols

respectively.

Natural Language Processing: NLP

has many appli-

cations, for example Text Simpliﬁcation, Text Classiﬁca-

tion, and Speech Recognition. Recently, deep learning

approaches have outperformed older statistical meth-

ods (Vaswani et al., 2017). A successful

NLP

model

must understand the structure and context of language,

learned via supervised or unsupervised methods. Pre-

trained language models have been used to boost perfor-

mance in other

NLP

tasks (Clinchant et al., 2019; Zhu et

al., 2020), such as BERT (Devlin et al., 2018) achieving

state-of-the-art performance. Zhu et al., 2020 tried to

fuse the embedding of BERT into a traditional trans-

former architecture using attention, increasing the trans-

lation performance by approximately 2 BLEU score.

Other methods have used Word2Vec to model lan-

guage, this has been applied to many

NLP

tasks

(Mikolov et al., 2013b). Word2Vec is designed to give

meaning to a numerical representation of words. The

central idea being that words with similar meaning

should have a small euclidean distance between the

vector representation.

In this paper, we take inspiration from these tech-

niques to boost performance of the low resource task of

T2G and T2H sign language production.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ChangingtheRepresentation:ExaminingLanguageRepresentationforNeuralSignLanguageProductionHarryWalsh,BenSaunders,RichardBowdenUniversityofSurreyfharry.walsh,b.saunders,r.bowdeng@surrey.ac.ukAbstractNeuralSignLanguageProduction(SLP)aimstoautomaticallytranslatefromspokenlanguagesentencestosignlanguagevi...

展开>> 收起<<

Changing the Representation Examining Language Representation for Neural Sign Language Production Harry Walsh Ben Saunders Richard Bowden.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Changing the Representation Examining Language Representation for Neural Sign Language Production Harry Walsh Ben Saunders Richard Bowden

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: