Machine Translation between Spoken Languages and Signed Languages Represented in SignWriting Zifan Jiang

2025-04-24 0 0 1.2MB 19 页 10玖币
侵权投诉
Machine Translation between Spoken Languages and Signed Languages
Represented in SignWriting
Zifan Jiang
University of Zurich
jiang@cl.uzh.ch
Amit Moryossef
Bar-Ilan University
University of Zurich
amitmoryossef@gmail.com
Mathias Müller
University of Zurich
mmueller@cl.uzh.ch
Sarah Ebling
University of Zurich
ebling@cl.uzh.ch
Abstract
This paper presents work on novel machine
translation (MT) systems between spoken and
signed languages, where signed languages are
represented in SignWriting, a sign language
writing system. Our work1seeks to address
the lack of out-of-the-box support for signed
languages in current MT systems and is based
on the SignBank dataset, which contains pairs
of spoken language text and SignWriting con-
tent. We introduce novel methods to parse,
factorize, decode, and evaluate SignWriting,
leveraging ideas from neural factored MT. In
a bilingual setup—translating from American
Sign Language to (American) English—our
method achieves over 30 BLEU, while in two
multilingual setups— translating in both direc-
tions between spoken languages and signed
languages—we achieve over 20 BLEU. We
find that common MT techniques used to im-
prove spoken language translation similarly af-
fect the performance of sign language transla-
tion. These findings validate our use of an in-
termediate text representation for signed lan-
guages to include them in NLP research.
1 Introduction
Most current machine translation (MT) systems
only support spoken language input and output
(text or speech), which excludes around 200 dif-
ferent signed languages used by up to 70 million
deaf people
2
worldwide from modern language
technology. Since signed languages are also natu-
ral languages, Yin et al. (2021) calls for including
sign language processing (SLP) in natural language
processing (NLP) research.
From a technical point of view, SLP brings novel
challenges to NLP due to the visual-gestural modal-
ity of sign language and special linguistic features
1
Code and documentation available at
https://github.
com/J22Melody/signwriting-translation
2
According to the World Federation of the Deaf:
https:
//wfdeaf.org/our-work/
15/06/2022, 14:49
Sign Translate
https://sign.mt
1/1
Sign Translate
Verse 41. He gave
her his hand and helped her up.
Then he called in the
widows and all the believers, and
he presented her to
them alive.
137 / 500
Text
 
󲇈
󵨑
󸗨
󱦡󰀡
󿌁
󲇸󲇐
󶿈
󰁈
󳈠 󳈘
󺒲 󺒡
󲇡󸗢
󲇸󲇐
󶿈
󿌁
 
UNITED STATES UNITED KINGDOM FRANCE
DETECT LANGUAGE ENGLISH FRENCH SPANISH
Figure 1: Demo application based on our models, trans-
lating from spoken languages to signed languages rep-
resented in SignWriting, then to human poses.
(e.g., the use of space, simultaneity, referencing),
which requires both computer vision (CV) and NLP
technologies. Crucially, the lack of a standardized
or widely used written form for signed languages
has hindered their inclusion in NLP research.
However, sign language writing systems do exist
and are sporadically used (e.g., SignWriting (Sut-
ton,1990) and HamNoSys (Prillwitz and Zienert,
1990)). Therefore, we adopt the proposal of Yin
et al. (2021) to formulate the sign language trans-
lation (SLT) task using a sign language writing
system as an intermediate step (illustrated by Fig-
ure 1): given spoken language text, we propose to
translate to sign language in a written form, then
transform this intermediate result into a final video
or pose output
3
—and vice versa. According to this
multi-step view of SLT, in this work we study trans-
lation between signed languages in written form
and spoken languages. We use SignWriting as the
intermediate writing system.
SignWriting has many advantages, like being
universal (multilingual), comparatively easy to un-
derstand, extensively documented, and computer-
supported. In addition, despite looking picto-
graphic, it is a well-defined writing system. Every
3
Note that the second step, animation of SignWriting into
human poses or video, is not included in this research. In the
demo application, spoken language text is translated directly
into sign language poses, resulting in low-quality output.
arXiv:2210.05404v2 [cs.CL] 23 Feb 2023
sign can be written as a sequence of symbols (box
markers, graphemes, and punctuation marks) and
their location on a 2-dimensional plane.
To our knowledge, this work is the first to create
automatic SLT systems that use SignWriting. Our
main contributions are as follows: (a) we propose
methods to parse (§3.3), factorize (§3.4), decode
4.3), and evaluate (§4.3) SignWriting sequences;
(b) we report experiments on multilingual machine
translation systems between SignWriting and spo-
ken language text (§4); (c) we demonstrate that
common techniques for low-resource MT are bene-
ficial for SignWriting translation systems (§5).
2 Background
2.1 Sign language processing (SLP)
SLP (Bragg et al.,2019;Yin et al.,2021;Moryossef
and Goldberg,2021) is an emerging subfield of
both NLP and CV, which focuses on automatic
processing and analysis of sign language content.
Prominent tasks include pose estimation from sign
language videos (Cao et al.,2017,2021;Güler
et al.,2018), gloss transcription (Mesch and Wallin,
2012;Johnston and Beuzeville,2016;Konrad et al.,
2018), sign language detection (Borg and Camilleri,
2019;Moryossef et al.,2020), sign language identi-
fication (Gebre et al.,2013;Monteiro et al.,2016),
and sign language segmentation (Bull et al.,2020;
Farag and Brock,2019;Santemiz et al.,2009).
Besides, tasks including sign language recogni-
tion (Adaloglou et al.,2021), translation, and pro-
duction involve transforming one sign language rep-
resentation to another or from/to spoken language
text, as shown in Figure 2
4
. We find that exist-
ing works cover gloss-to-text (Camgöz et al.,2018;
Yin and Read,2020) (where “text” denotes spo-
ken language text), text-to-gloss (Zhao et al.,2000;
Othman and Jemni,2012), video-to-text (Camgöz
et al.,2020b,a), pose-to-text (Ko et al.,2019), and
text-to-pose (Saunders et al.,2020a,b,c;Zelinka
and Kanis,2020;Xiao et al.,2020).
2.2 Motivation
Our work is the first to explore translation between
spoken language text and sign language content
represented in SignWriting
5
. We focus on a sign
language writing system for the following reasons:
4
In the paper, we distinguish between a phonetic “writing
system” (e.g., SignWriting) and “glosses” (lexical notation,
marking the semantics of each sign with a distinct category).
5
Related work based on HamNoSys: Morrissey (2011);
Sanaullah et al. (2021); Walsh et al. (2022)
21/06/2022, 14:40
https://raw.githubusercontent.com/sign-language-processing/sign-language-processing.github.io/eddb4ac50ffc7698d4b2b9c8c34d6397721
https://raw.githubusercontent.com/sign-language-processing/sign-language-processing.github.io/eddb4ac50ffc7698d4b2b9c8c34d63977211602c/src/assets/tasks/
1/1
Video Text
Pose Glosses
Writing System
Figure 2: SLP tasks. Every edge on the left side rep-
resents a task in CV (language-agnostic). Every edge
on the right side represents a task in NLP (language-
specific). Every edge crossing both sides represents a
task requiring a combination of CV and NLP. Figure
taken from Moryossef and Goldberg (2021).
(a) currently an end-to-end (video-to-text/text-to-
video) approach is not feasible. State-of-the-art
systems either have a BLEU score lower than 1
(Müller et al.,2022a) or work only on a very nar-
row linguistic domain, e.g., Camgöz et al. (2020b,a)
work on the RWTH-PHOENIX-Weather T data set
which covers only 1,231 unique signs from weather
reports (less than what we use in Table 2); (b) a
writing system is lower-dimensional than videos
(not all parts of a video are relevant in a linguistic
sense), while adequate to encode information of
signs; (c) written sign language is a closer fit to
current MT pipelines than videos or poses; (d) a
phonetic writing system is a more universal solu-
tion than glossing since glosses are semantic and
therefore language-specific, and are an inadequate
representation of meaning (Müller et al.,2022b).
2.3 SignWriting, FSW, and SWU
SignWriting (Sutton,1990) is a featural and vi-
sually iconic sign language writing system (intro-
duced extensively in Appendix A). Previous work
explored recognition (Stiehl et al.,2015) and ani-
mation (Bouzid and Jemni,2013) of SignWriting.
SignWriting has two computerized specifica-
tions, Formal SignWriting in ASCII (FSW) and
SignWriting in Unicode (SWU). SignWriting is
two-dimensional, but FSW and SWU are written
linearly, similar to spoken languages. Figure 3
gives an example of the relationship between Sign-
Writing, FSW, and SWU
6
. We use FSW in our
research instead of SWU to explore the potential
of factorizing SignWriting symbols and utilizing
numerical values of their position (§3.3, §3.4).
6
Online demonstration:
https://slevinski.github.
io/SuttonSignWriting/characters/index.html.
Figure 3: “Hello world. in FSW, SWU and SignWriting graphics. In FSW/SWU, A/SWA and M/SWM are the box
markers (acting as sign boundaries); S14c20 and S27106 (graphemes in SWU) are the symbols; 518 and 529 are
the x, y positional numbers on a 2-dimensional plane that denote symbols’ position within a sign box, S38800
(horizontal bold line in SWU) is the punctuation full stop symbol.
3 Data and method
The data source we use for this research is Sign-
Bank, the largest repository of SignPuddles
7
.
A SignPuddle is a community-driven dictionary
where users add parallel examples of SignWrit-
ing and spoken language text (not necessarily with
corresponding videos and glosses). The puddles
contain material from various signed languages and
linguistic domains (e.g., general literature or Bible)
without a strict writing standard. We use the Sign
Language Datasets (Moryossef and Müller,2021)
library to load SignBank as a Tensorflow Dataset.
3.1 Data statistics
In SignBank, there are roughly 220k parallel sam-
ples from 141 puddles covering 76 language pairs,
yet the distribution is unbalanced (full details in
Appendix C). Relatively high-resource language
pairs (over 10k samples) are listed in Table 1.
Notably, most of the puddles are dictionaries,
which we consider less valuable than sentence pairs
(instances of continuous signing) for a general MT
system. If dictionaries are used as training data,
we expect models to memorize word mappings and
not learn to generate sentences.
Therefore, we treat the four sentence-pair pud-
dles (Table 2) of the relatively high-resource lan-
guage pairs as primary data and the other dictionary
puddles as auxiliary data. Note that even the lan-
guage pairs constituting the high-resource pairs of
SignBank are low-resource compared to datasets
used in mature MT systems for spoken languages,
where millions of parallel sentences are common-
place (Akhbardeh et al.,2021).
7https://www.signbank.org/signpuddle/
3.2 Data preprocessing
We first perform general data cleaning to extract
the main body of spoken language text and remove
irrelevant parts such as HTML tags or samples that
are empty or too long (100 words for a dictionary
entry). We then learn a byte pair encoding (BPE)
segmentation (Sennrich et al.,2016) on the cleaned
spoken language text, using a vocabulary size of
2,000.
Multilingual models
In our multilingual experi-
ments (§4.2, §4.3), we learn a shared BPE model
across all spoken languages.
Following Johnson et al. (2017), we add special
tags at the beginning of source sequences to indi-
cate the desired target language and nature of the
training data (sentence pair or dictionary). Three
types of tags are designed to encode all necessary
information: (a) spoken language code; (b) coun-
try code
8
; (c) dictionary vs. sentence pair. For
example, an English sentence to be translated into
American Sign Language is represented as the fol-
lowing:
<2en> <4us> <sent> Hello world.
Data split
We shuffle the data and split it into
95%, 3%, and 2% for training, validation, and test
sets, respectively.
3.3 FSW parsing
On the sign language side, an appropriate segmen-
tation and tokenization strategy is needed for the
FSW data. We parse an original FSW sequence
(e.g. Figure 3) into several pieces:
box markers: A,M,L,R,B;
8
spoken language code plus country code specifies a one-
to-one mapping to a related signed language in our data.
language pair #samples #puddles
en-us (American English & American Sign Language) 43,698 7
pt-br (Brazilian Portuguese & Brazilian Sign Language) 42,454 3
de-de (Standard German & German Sign Language) 24,704 3
fr-ca (Canadian French & Quebec Sign Language) 11,189 3
Table 1: Relatively high-resource language pair statistics.
puddle name language pair #samples #signs mean sequence len
Literature US en-us 700 9,922 24
ASL Bible Books NLT en-us 11,667 51,485 24
ASL Bible Books Shores Deaf Church en-us 4,321 44,612 31
Literatura Brasil pt-br 1,884 19,221 13
Table 2: Primary sentence-pair puddles. Mean sequence length is measured by the mean number of words in the
spoken language sentences.
symbols: S1f010,S18720, etc.;
positional numbers x and y: 515,483, etc.;
punctuation marks (special symbols without
box markers): S38800, etc.
We further factorize each symbol into several
parts regarding its orientation (see Figure 7in Ap-
pendix Afor an explicit motivation of this step).
For example, the symbol S1f010 is split into:
symbol core: S1f0;
column number (from 0to 5): 1;
row number (from 0to hex F): 0.
For positional numbers, which have a large range
(from
250
to
750
) and are encoded discretely, we
hypothesise that models might have difficulty un-
derstanding their relative order. Therefore, we fur-
ther calculate two additional factors that denote a
symbol’s relative position (based on the absolute
numbers) within a sign: relative x and relative y,
both ranging from 0to #symbols - 1.
We provide a full example of the result of FSW
parsing in Listing 1in Appendix C.
3.4 Factored machine translation
We use a factored machine translation system
(Koehn and Hoang,2007;Garcia-Martinez et al.,
2016) to encode or decode parsed FSW sequences.
We argue that this architecture is suitable because
Source Target
symbol S1f010
“Hi”
X 515
Y 483
relative X 0
relative Y 1
symbol core S1f0
column 1
row 0
Figure 4: Representation of translating a FSW symbol
together with its factors to English.
concatenating all parsed FSW tokens results in se-
quences much longer than the maximum length of
many Transformer models (e.g., 512).
From another perspective, the essential infor-
mation units are the symbols. Nevertheless, the
positional numbers are necessary to determine how
symbols are assembled. The same symbols can be
arranged differently in space to convey different
meanings.
In our setup, we treat the symbols (including
punctuation marks and box markers) as the primary
source/target tokens and the rest as source/target
factors that are strictly aligned with each source/-
target token (illustrated by Figure 4).
Depending on the translation direction, factored
FSW representations need to be encoded or de-
coded. For encoding (when FSW is the source), we
embed each factor separately and then concatenate
摘要:

MachineTranslationbetweenSpokenLanguagesandSignedLanguagesRepresentedinSignWritingZifanJiangUniversityofZurichjiang@cl.uzh.chAmitMoryossefBar-IlanUniversityUniversityofZurichamitmoryossef@gmail.comMathiasMüllerUniversityofZurichmmueller@cl.uzh.chSarahEblingUniversityofZurichebling@cl.uzh.chAbstractT...

展开>> 收起<<
Machine Translation between Spoken Languages and Signed Languages Represented in SignWriting Zifan Jiang.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.2MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注