Clean Text and Full-Body Transformer Microsofts Submission to the WMT22 Shared Task on Sign Language Translation

2025-05-01 0 0 149.09KB 8 页 10玖币

侵权投诉

Clean Text and Full-Body Transformer:

Microsoft’s Submission to the WMT22 Shared Task on

Sign Language Translation

*Subhadeep Dey, Abhilash Pal, Cyrine Chaabani, *Oscar Koller

Microsoft - Munich, Germany

{subde,t-apal,t-cchaabani,oskoller}@microsoft.com

Abstract

This paper describes Microsoft’s submission

to the ﬁrst shared task on sign language transla-

tion at WMT 2022, a public competition tack-

ling sign language to spoken language trans-

lation for Swiss German sign language. The

task is very challenging due to data scarcity

and an unprecedented vocabulary size of more

than 20k words on the target side. More-

over, the data is taken from real broadcast

news, includes native signing and covers sce-

narios of long videos. Motivated by recent ad-

vances in action recognition, we incorporate

full body information by extracting features

from a pre-trained I3D model and applying a

standard transformer network. The accuracy

of the system is further improved by applying

careful data cleaning on the target text. We

obtain BLEU scores of 0.6 and 0.78 on the

test and dev set respectively, which is the best

score among the participants of the shared task.

Also in the human evaluation the submission

reaches the ﬁrst place. The BLEU score is fur-

ther improved to 1.08 on the dev set by ap-

plying features extracted from a lip reading

model.

1 Introduction

Sign languages are natural visual languages that

are used by deaf and hard of hearing individuals

to communicate in everyday life. Sign languages

are actively being researched. However, there is

a huge imbalance in the ﬁeld of natural language

and speech processing between oral and signed

languages. Since recently, one observes the emer-

gence of a transition shifting sign language pro-

cessing to be part of the NLP mainstream (Yin

et al.,2021). We embrace this development which

manifests (among others) in the creation of the

ﬁrst shared task on sign language translation as

part of WMT 2022 (Mathias et al.,2022). It is

great to have real-world sign language data (Bragg

*Equal Contribution

et al.,2019;Yin et al.,2021) as the basis of this

shared task, manifested in native signers content

and an unprecedentedly large vocabulary. Never-

theless, this leads to a very challenging task with

low performance numbers. When participating in

the advances of sign language technologies it is

worth recapping that deaf people have much at

stake, both to gain and lose from applications that

will be enabled here (Bragg et al.,2021). We aim

to advance the ﬁeld and the use-cases in a positive

way and present our ﬁndings in this system paper.

In the remainder of this work we ﬁrst present a

brief view on the relevant literature in Section 2,

then we present the employed data in Section 3.

Subsequently, we describe our submission in Sec-

tion 4, additional experiments in Section 5and we

end with a summary in Section 6.

2 Related Work

In this section, we present a limited overview of

related work in sign language translation. We fo-

cus this review on the translation direction from

sign language to spoken language and dismiss ap-

proaches that target the opposite direction, i.e. sign

language production.

Sign language translation started targeting writ-

ten sign language gloss to spoken language text

translations, hence no videos were involved. Re-

lated works were mainly based on phrase-based

systems employing different sets of features (Stein

et al.,2007,2010;Schmidt et al.,2013). Then,

neural machine translation revolutionized the ﬁeld.

The ﬁrst research publications on neural sign lan-

guage translation were based on LSTMs either

with full image input (Camgoz et al.,2018) or uti-

lized human keypoint estimation (Ko et al.,2019).

Transformer models then replaced the recurrent ar-

chitectures (Camgoz et al.,2020;Yin and Read,

2020;Yin,2020). These models perform a lot

better, but suffer from a basic drawback that the

input sequences must be limited to a maximum

arXiv:2210.13326v1 [cs.CL] 24 Oct 2022

length. Previous work (Camgoz et al.,2018;Or-

bay and Akarun,2020) has identiﬁed the need for

strong tokenizers to produce compact representa-

tions of the incoming sign language video footage.

Hence, a considerable body of publications target

creating tokenizer models that are often trained on

sign language recognition data sets (Koller et al.,

2020,2016;Zhou et al.,2022) or sign spotting

data sets (Albanie et al.,2020;Varol et al.,2021;

Belissen et al.,2019;Pﬁster et al.,2013).

There are several data sets relevant for

sign language translation. Some of the most

frequently encountered are RWTH-PHOENIX-

Weather 2014T (Koller et al.,2015a;Camgoz et al.,

2018) and the CSL (Huang et al.,2018) (which

could be also considered a recognition data set).

However, there are promising new data sets ap-

pearing: OpenASL (Shi et al.,2022a), SP-10

dataset (Yin et al.,2022) (covers mainly isolated

translations) and How2Sign (Duarte et al.,2021).

3 Data

To train our system, we used the training data pro-

vided by the shared task organizers. The data can

be considered real-life-authentic as it stems from

broadcast news using two different sources: Fo-

cusNews and SRF. FocusNews, henceforth FN, is

an online TV channel covering deaf signers with

videos of 5 minutes having variable sampling rates

of either 25, 30 or 50 fps. SRF represents pub-

lic Swiss TV with contents from daily news and

weather forecast which are being interpreted by

hearing interpreters. The videos are recorded with

a sampling rate of 25 fps. All data, therefore, cov-

ers Swiss German sign language (DSGS). Our fea-

ture extractors are pretrained on BSL-1k (Albanie

et al.,2020) and AV-HuBERT (Shi et al.,2022b).

Additionally, we evaluate the effect of introducing

a public sign language lexicon that covers isolated

signs

, which we refer to as Lex. It provides main

hand shape annotations, one or multiple (mostly

one) examples of the sign and an example of how

this sign is used in a continuous sentence. We

choose a subset that overlaps in vocabulary with ei-

ther FocusNews or SRF. As part of the competition,

independent dev and test sets are provided, which

consist of 420 and 488 utterances respectively.

Table 1shows the statistics of the training data.

We see, that there is about 35 hours of training data

in total. In raw form without any preprocessing

1https://signsuisse.sgb-fss.ch/

SRF FN Lex Total

Videos 29 197 1201 1427

Hours 15.6 19.1 0.9 35.6

Raw: no preprocessing

Vocabulary 18942 21490 – 34783

Singletons 12433 13624 – 22083

Clean: careful preprocessing

Vocabulary 13029 14555 821 22840

Singletons 7483 7923 591 12290

Table 1: Data statistics on data used for training. SRF

and FN refer to SRF broadcast and FocusNews data,

while Lex stands for a public sign language lexicon.

Singletons are words that only occur a single time dur-

ing training.

the data is case sensitive, contains punctuation and

digits. In this raw form the vocabulary amounts

to close to 35k different words on the target side

(which is written German). 22k words of these just

occur a single time in the training data (singletons).

Through careful preprocessing as described in Sec-

tion 4.3 we can shrink the vocabulary to about 22k

words and the singletons to about 12k.

4 Submitted System

Sign languages convey information through the use

of manual parameters (hand shape, orientation, lo-

cation and movement) and non-manual parameters

(lips, eyes, head, upper body). To capture most in-

formation from the signs, we opt for an RGB-based

approach, neglecting the tracked skeleton features

by the shared task organizers. For the submitted

system we rely on a pre-trained tokenizer for fea-

ture extraction and train a sequence-to-sequence

model to produce sequences of whole words (no

byte pair encoding). We further pre-process the

sentences (ground truths of the videos) to clean it.

This step is crucial to push the model to focus more

on semantics of the data. Finally, in order to adhere

to the expected output format for the submission,

we convert the text back to display format using

Microsoft’s speech service. This applies inverse

text normalization, capitalization and punctuation

to the output text to make it more readable. The

details of various components of the system are

described in the next subsections.

4.1 Features

We use a pre-trained I3D (Carreira and Zisserman,

2017) model, based on inﬂated inceptions with 3D

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CleanTextandFull-BodyTransformer:Microsoft'sSubmissiontotheWMT22SharedTaskonSignLanguageTranslation*SubhadeepDey,AbhilashPal,CyrineChaabani,*OscarKollerMicrosoft-Munich,Germany{subde,t-apal,t-cchaabani,oskoller}@microsoft.comAbstractThispaperdescribesMicrosoft'ssubmissiontotherstsharedtaskonsignlan...

展开>> 收起<<

Clean Text and Full-Body Transformer Microsofts Submission to the WMT22 Shared Task on Sign Language Translation.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Clean Text and Full-Body Transformer Microsofts Submission to the WMT22 Shared Task on Sign Language Translation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: