
length. Previous work (Camgoz et al.,2018;Or-
bay and Akarun,2020) has identified the need for
strong tokenizers to produce compact representa-
tions of the incoming sign language video footage.
Hence, a considerable body of publications target
creating tokenizer models that are often trained on
sign language recognition data sets (Koller et al.,
2020,2016;Zhou et al.,2022) or sign spotting
data sets (Albanie et al.,2020;Varol et al.,2021;
Belissen et al.,2019;Pfister et al.,2013).
There are several data sets relevant for
sign language translation. Some of the most
frequently encountered are RWTH-PHOENIX-
Weather 2014T (Koller et al.,2015a;Camgoz et al.,
2018) and the CSL (Huang et al.,2018) (which
could be also considered a recognition data set).
However, there are promising new data sets ap-
pearing: OpenASL (Shi et al.,2022a), SP-10
dataset (Yin et al.,2022) (covers mainly isolated
translations) and How2Sign (Duarte et al.,2021).
3 Data
To train our system, we used the training data pro-
vided by the shared task organizers. The data can
be considered real-life-authentic as it stems from
broadcast news using two different sources: Fo-
cusNews and SRF. FocusNews, henceforth FN, is
an online TV channel covering deaf signers with
videos of 5 minutes having variable sampling rates
of either 25, 30 or 50 fps. SRF represents pub-
lic Swiss TV with contents from daily news and
weather forecast which are being interpreted by
hearing interpreters. The videos are recorded with
a sampling rate of 25 fps. All data, therefore, cov-
ers Swiss German sign language (DSGS). Our fea-
ture extractors are pretrained on BSL-1k (Albanie
et al.,2020) and AV-HuBERT (Shi et al.,2022b).
Additionally, we evaluate the effect of introducing
a public sign language lexicon that covers isolated
signs
1
, which we refer to as Lex. It provides main
hand shape annotations, one or multiple (mostly
one) examples of the sign and an example of how
this sign is used in a continuous sentence. We
choose a subset that overlaps in vocabulary with ei-
ther FocusNews or SRF. As part of the competition,
independent dev and test sets are provided, which
consist of 420 and 488 utterances respectively.
Table 1shows the statistics of the training data.
We see, that there is about 35 hours of training data
in total. In raw form without any preprocessing
1https://signsuisse.sgb-fss.ch/
SRF FN Lex Total
Videos 29 197 1201 1427
Hours 15.6 19.1 0.9 35.6
Raw: no preprocessing
Vocabulary 18942 21490 – 34783
Singletons 12433 13624 – 22083
Clean: careful preprocessing
Vocabulary 13029 14555 821 22840
Singletons 7483 7923 591 12290
Table 1: Data statistics on data used for training. SRF
and FN refer to SRF broadcast and FocusNews data,
while Lex stands for a public sign language lexicon.
Singletons are words that only occur a single time dur-
ing training.
the data is case sensitive, contains punctuation and
digits. In this raw form the vocabulary amounts
to close to 35k different words on the target side
(which is written German). 22k words of these just
occur a single time in the training data (singletons).
Through careful preprocessing as described in Sec-
tion 4.3 we can shrink the vocabulary to about 22k
words and the singletons to about 12k.
4 Submitted System
Sign languages convey information through the use
of manual parameters (hand shape, orientation, lo-
cation and movement) and non-manual parameters
(lips, eyes, head, upper body). To capture most in-
formation from the signs, we opt for an RGB-based
approach, neglecting the tracked skeleton features
by the shared task organizers. For the submitted
system we rely on a pre-trained tokenizer for fea-
ture extraction and train a sequence-to-sequence
model to produce sequences of whole words (no
byte pair encoding). We further pre-process the
sentences (ground truths of the videos) to clean it.
This step is crucial to push the model to focus more
on semantics of the data. Finally, in order to adhere
to the expected output format for the submission,
we convert the text back to display format using
Microsoft’s speech service. This applies inverse
text normalization, capitalization and punctuation
to the output text to make it more readable. The
details of various components of the system are
described in the next subsections.
4.1 Features
We use a pre-trained I3D (Carreira and Zisserman,
2017) model, based on inflated inceptions with 3D