Clean Text and Full-Body Transformer Microsofts Submission to the WMT22 Shared Task on Sign Language Translation

2025-05-01 0 0 149.09KB 8 页 10玖币
侵权投诉
Clean Text and Full-Body Transformer:
Microsoft’s Submission to the WMT22 Shared Task on
Sign Language Translation
*Subhadeep Dey, Abhilash Pal, Cyrine Chaabani, *Oscar Koller
Microsoft - Munich, Germany
{subde,t-apal,t-cchaabani,oskoller}@microsoft.com
Abstract
This paper describes Microsoft’s submission
to the first shared task on sign language transla-
tion at WMT 2022, a public competition tack-
ling sign language to spoken language trans-
lation for Swiss German sign language. The
task is very challenging due to data scarcity
and an unprecedented vocabulary size of more
than 20k words on the target side. More-
over, the data is taken from real broadcast
news, includes native signing and covers sce-
narios of long videos. Motivated by recent ad-
vances in action recognition, we incorporate
full body information by extracting features
from a pre-trained I3D model and applying a
standard transformer network. The accuracy
of the system is further improved by applying
careful data cleaning on the target text. We
obtain BLEU scores of 0.6 and 0.78 on the
test and dev set respectively, which is the best
score among the participants of the shared task.
Also in the human evaluation the submission
reaches the first place. The BLEU score is fur-
ther improved to 1.08 on the dev set by ap-
plying features extracted from a lip reading
model.
1 Introduction
Sign languages are natural visual languages that
are used by deaf and hard of hearing individuals
to communicate in everyday life. Sign languages
are actively being researched. However, there is
a huge imbalance in the field of natural language
and speech processing between oral and signed
languages. Since recently, one observes the emer-
gence of a transition shifting sign language pro-
cessing to be part of the NLP mainstream (Yin
et al.,2021). We embrace this development which
manifests (among others) in the creation of the
first shared task on sign language translation as
part of WMT 2022 (Mathias et al.,2022). It is
great to have real-world sign language data (Bragg
*Equal Contribution
et al.,2019;Yin et al.,2021) as the basis of this
shared task, manifested in native signers content
and an unprecedentedly large vocabulary. Never-
theless, this leads to a very challenging task with
low performance numbers. When participating in
the advances of sign language technologies it is
worth recapping that deaf people have much at
stake, both to gain and lose from applications that
will be enabled here (Bragg et al.,2021). We aim
to advance the field and the use-cases in a positive
way and present our findings in this system paper.
In the remainder of this work we first present a
brief view on the relevant literature in Section 2,
then we present the employed data in Section 3.
Subsequently, we describe our submission in Sec-
tion 4, additional experiments in Section 5and we
end with a summary in Section 6.
2 Related Work
In this section, we present a limited overview of
related work in sign language translation. We fo-
cus this review on the translation direction from
sign language to spoken language and dismiss ap-
proaches that target the opposite direction, i.e. sign
language production.
Sign language translation started targeting writ-
ten sign language gloss to spoken language text
translations, hence no videos were involved. Re-
lated works were mainly based on phrase-based
systems employing different sets of features (Stein
et al.,2007,2010;Schmidt et al.,2013). Then,
neural machine translation revolutionized the field.
The first research publications on neural sign lan-
guage translation were based on LSTMs either
with full image input (Camgoz et al.,2018) or uti-
lized human keypoint estimation (Ko et al.,2019).
Transformer models then replaced the recurrent ar-
chitectures (Camgoz et al.,2020;Yin and Read,
2020;Yin,2020). These models perform a lot
better, but suffer from a basic drawback that the
input sequences must be limited to a maximum
arXiv:2210.13326v1 [cs.CL] 24 Oct 2022
length. Previous work (Camgoz et al.,2018;Or-
bay and Akarun,2020) has identified the need for
strong tokenizers to produce compact representa-
tions of the incoming sign language video footage.
Hence, a considerable body of publications target
creating tokenizer models that are often trained on
sign language recognition data sets (Koller et al.,
2020,2016;Zhou et al.,2022) or sign spotting
data sets (Albanie et al.,2020;Varol et al.,2021;
Belissen et al.,2019;Pfister et al.,2013).
There are several data sets relevant for
sign language translation. Some of the most
frequently encountered are RWTH-PHOENIX-
Weather 2014T (Koller et al.,2015a;Camgoz et al.,
2018) and the CSL (Huang et al.,2018) (which
could be also considered a recognition data set).
However, there are promising new data sets ap-
pearing: OpenASL (Shi et al.,2022a), SP-10
dataset (Yin et al.,2022) (covers mainly isolated
translations) and How2Sign (Duarte et al.,2021).
3 Data
To train our system, we used the training data pro-
vided by the shared task organizers. The data can
be considered real-life-authentic as it stems from
broadcast news using two different sources: Fo-
cusNews and SRF. FocusNews, henceforth FN, is
an online TV channel covering deaf signers with
videos of 5 minutes having variable sampling rates
of either 25, 30 or 50 fps. SRF represents pub-
lic Swiss TV with contents from daily news and
weather forecast which are being interpreted by
hearing interpreters. The videos are recorded with
a sampling rate of 25 fps. All data, therefore, cov-
ers Swiss German sign language (DSGS). Our fea-
ture extractors are pretrained on BSL-1k (Albanie
et al.,2020) and AV-HuBERT (Shi et al.,2022b).
Additionally, we evaluate the effect of introducing
a public sign language lexicon that covers isolated
signs
1
, which we refer to as Lex. It provides main
hand shape annotations, one or multiple (mostly
one) examples of the sign and an example of how
this sign is used in a continuous sentence. We
choose a subset that overlaps in vocabulary with ei-
ther FocusNews or SRF. As part of the competition,
independent dev and test sets are provided, which
consist of 420 and 488 utterances respectively.
Table 1shows the statistics of the training data.
We see, that there is about 35 hours of training data
in total. In raw form without any preprocessing
1https://signsuisse.sgb-fss.ch/
SRF FN Lex Total
Videos 29 197 1201 1427
Hours 15.6 19.1 0.9 35.6
Raw: no preprocessing
Vocabulary 18942 21490 34783
Singletons 12433 13624 22083
Clean: careful preprocessing
Vocabulary 13029 14555 821 22840
Singletons 7483 7923 591 12290
Table 1: Data statistics on data used for training. SRF
and FN refer to SRF broadcast and FocusNews data,
while Lex stands for a public sign language lexicon.
Singletons are words that only occur a single time dur-
ing training.
the data is case sensitive, contains punctuation and
digits. In this raw form the vocabulary amounts
to close to 35k different words on the target side
(which is written German). 22k words of these just
occur a single time in the training data (singletons).
Through careful preprocessing as described in Sec-
tion 4.3 we can shrink the vocabulary to about 22k
words and the singletons to about 12k.
4 Submitted System
Sign languages convey information through the use
of manual parameters (hand shape, orientation, lo-
cation and movement) and non-manual parameters
(lips, eyes, head, upper body). To capture most in-
formation from the signs, we opt for an RGB-based
approach, neglecting the tracked skeleton features
by the shared task organizers. For the submitted
system we rely on a pre-trained tokenizer for fea-
ture extraction and train a sequence-to-sequence
model to produce sequences of whole words (no
byte pair encoding). We further pre-process the
sentences (ground truths of the videos) to clean it.
This step is crucial to push the model to focus more
on semantics of the data. Finally, in order to adhere
to the expected output format for the submission,
we convert the text back to display format using
Microsoft’s speech service. This applies inverse
text normalization, capitalization and punctuation
to the output text to make it more readable. The
details of various components of the system are
described in the next subsections.
4.1 Features
We use a pre-trained I3D (Carreira and Zisserman,
2017) model, based on inflated inceptions with 3D
摘要:

CleanTextandFull-BodyTransformer:Microsoft'sSubmissiontotheWMT22SharedTaskonSignLanguageTranslation*SubhadeepDey,AbhilashPal,CyrineChaabani,*OscarKollerMicrosoft-Munich,Germany{subde,t-apal,t-cchaabani,oskoller}@microsoft.comAbstractThispaperdescribesMicrosoft'ssubmissiontotherstsharedtaskonsignlan...

展开>> 收起<<
Clean Text and Full-Body Transformer Microsofts Submission to the WMT22 Shared Task on Sign Language Translation.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:149.09KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注