
Figure 1: A graph to show the word “running” which
would be ‘glossed’ as RUN and the associated sequence
of HamNoSys, Top: Text, Middle: Gloss, Bottom: Ham-
NoSys. HamNoSys is split into: a) it’s initial configura-
tion b) it’s hand shape 3) it’s action
tasks. We achieve a BLEU-4 score of 26.99 on
mDGS
,
a significant increase compared to the state-of-the-art
score of 3.17 (Saunders et al., 2022).
The rest of this paper is structured as follows; In sec-
tion 2 we review the related work in the field. Section 3
presents our methodology. Section 4 shows quantitative
and qualitative results. Finally, we draw conclusions in
section 5 and suggest future work.
2. Related Work
Sign Language Recognition & Translation:
Compu-
tational sign language research has been studied for
over 30 years (Tamura and Kawasaki, 1988). Research
started with isolated Sign Language Recognition (
SLR
)
where individual signs were classified using CNNs (Le-
cun et al., 1998). Recently, the field has moved to
the more challenging problem of Continuous Sign Lan-
guage Recognition (
CSLR
), where a continuous sign
language video needs to be segmented and then clas-
sified (Koller et al., 2015). Most modern approaches
to
SLR
and
CSLR
rely on deep learning, but such ap-
proaches are data hungry and therefore are limited by
the size of publicly available datasets.
The distinction between
CSLR
and Sign Language
Translation (
SLT
) was stressed by Camgoz et al. (2018).
SLT
aims to translate a continuous sequence of signs to
spoken language sentences (Sign to Text (S2T)) or vice
versa (Text to Sign (T2S)), a challenging problem due
to the changes in grammar and sequence ordering.
Sign Language Production (SLP):
focusses on T2S,
the production of a continuous sign language sequence
given a spoken language input sentence. Current state-
of-the-art approaches to
SLP
use transformer based ar-
chitectures with attention (Stoll et al., 2018; Saunders
et al., 2020). In this paper, we tackle the
SLP
task of
neural sign language translation, defined as
T2G
or
T2H
translation.
HamNoSys has been used before for statistical
SLP
,
with some success (Kaur and Kumar, 2014; Kaur and
Kumar, 2016). However, the produced motion becomes
robotic and is not practical for real world applications.
Note that these approaches first convert the HamNoSys
to SiGML, an XML format of HamNoSys (Kaur and
Kumar, 2016).
Neural Machine Translation (NMT): NMT
aims to
generate a target sequence given a source sequence us-
ing neural networks (Bahdanau et al., 2014) and is com-
monly used for spoken language translations. Initial
approaches used recurrence to map a hidden state to
an output sequence (Kalchbrenner and Blunsom, 2013),
with limited performance. Encoder-decoder structures
were later introduced, that map an input sequence to an
embedding space (Wu et al., 2016). To address the bot-
tleneck problem, attention was introduced to measure
the affinity between sections of the input and embed-
ding space and allow the model to focus on specific
context (Bahdanau et al., 2014). This was improved fur-
ther with the introduction of the transformer (Vaswani
et al., 2017) that used Multi-Headed Attention (
MHA
)
to allow multiple projections of the learned attention.
More recently, model sizes have grown with architec-
tures introduced such as GPT-2 (Radford et al., 2019)
and BERT (Devlin et al., 2018).
Different encoding/decoding schemes have been ex-
plored.
BPE
was first introduced in Sennrich et al.
(2015), to create a set of tokens given a set vocabulary
size. This is achieved by merging the most commonly
occurring sequential characters. WordPiece, a similar
tokenizer to BPE, was first introduced in Schuster and
Nakajima (2012) and is commonly used when training
language models such as BERT, DistilBERT and Elec-
tra. Finally, word and character level tokenizers break
up a sentence based on white space and unique symbols
respectively.
Natural Language Processing: NLP
has many appli-
cations, for example Text Simplification, Text Classifica-
tion, and Speech Recognition. Recently, deep learning
approaches have outperformed older statistical meth-
ods (Vaswani et al., 2017). A successful
NLP
model
must understand the structure and context of language,
learned via supervised or unsupervised methods. Pre-
trained language models have been used to boost perfor-
mance in other
NLP
tasks (Clinchant et al., 2019; Zhu et
al., 2020), such as BERT (Devlin et al., 2018) achieving
state-of-the-art performance. Zhu et al., 2020 tried to
fuse the embedding of BERT into a traditional trans-
former architecture using attention, increasing the trans-
lation performance by approximately 2 BLEU score.
Other methods have used Word2Vec to model lan-
guage, this has been applied to many
NLP
tasks
(Mikolov et al., 2013b). Word2Vec is designed to give
meaning to a numerical representation of words. The
central idea being that words with similar meaning
should have a small euclidean distance between the
vector representation.
In this paper, we take inspiration from these tech-
niques to boost performance of the low resource task of
T2G and T2H sign language production.