
JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT
Mayumi Ohta
Computational Linguistics
Heidelberg University, Germany
ohta@cl.uni-heidelberg.de
Julia Kreutzer
Google Research
jkreutzer@google.com
Stefan Riezler
Computational Linguistics & IWR
Heidelberg University, Germany
riezler@cl.uni-heidelberg.de
Abstract
JoeyS2T is a JoeyNMT (Kreutzer et al.,
2019) extension for speech-to-text tasks such
as automatic speech recognition and end-
to-end speech translation. It inherits the
core philosophy of JoeyNMT, a minimalist
NMT toolkit built on PyTorch, seeking sim-
plicity and accessibility. JoeyS2T’s work-
flow is self-contained, starting from data pre-
processing, over model training and predic-
tion to evaluation, and is seamlessly inte-
grated into JoeyNMT’s compact and simple
code base. On top of JoeyNMT’s state-of-
the-art Transformer-based encoder-decoder ar-
chitecture, JoeyS2T provides speech-oriented
components such as convolutional layers,
SpecAugment, CTC-loss, and WER evalua-
tion. Despite its simplicity compared to prior
implementations, JoeyS2T performs compet-
itively on English speech recognition and
English-to-German speech translation bench-
marks. The implementation is accompanied
by a walk-through tutorial and available on
https://github.com/may-/joeys2t.
1 Introduction
End-to-end models recently have been shown to
be able to outperform complex pipelines of indi-
vidually trained components in many NLP tasks.
For example, in the area of automatic speech recog-
nition (ASR) and speech translation (ST), the per-
formance gap between end-to-end models and cas-
caded pipelines, where an acoustic model is fol-
lowed by an HMM for ASR, or an ASR model is
followed by a machine translation (MT) model for
ST, seems to be closed (Sperber et al.,2019;Ben-
tivogli et al.,2021). An end-to-end approach has
several advantages over a pipeline approach: First,
it mitigates error propagation through the pipeline.
Second, its data requirements are simpler since in-
termediate data interfaces to bridge components
can be skipped. Furthermore, intermediate com-
ponents such as phoneme dictionaries in ASR or
transcriptions in ST need significant amounts of ad-
ditional human expertise to build. For end-to-end
models, the overall model architecture is simpler,
consisting of a unified end-to-end neural network.
Nonetheless, end-to-end components can be ini-
tialized from non end-to-end data, e.g., in audio
encoding layers (Xu et al.,2021) or text decoding
layers (Li et al.,2021).
ASR or ST tasks usually have a higher entry
barrier than MT, especially for novices who have
little experience in machine learning, but also for
NLP researchers who have previously only worked
on text and not speech processing. This can also
be seen in the population of the different tracks
of NLP conferences. For example, the “Speech
and Multimodality” track of ACL 2022 had only
a third of the number of papers in the “Machine
Translation and Multilinguality” track.
1
However,
thanks to the end-to-end paradigm, those tasks are
now more accessible for students or entry-level
practitioners without huge resources, and without
the experience of handling the different modules
of a cascaded system or speech processing. The
increased adoption of Transformer architectures
(Vaswani et al.,2017) in both text (Kalyan et al.,
2021) and speech processing (Dong et al.,2018;
Karita et al.,2019a,b) has further eased the transfer
of knowledge between the two fields, in addition to
making joint modeling easier and more unified.
Reviewing existing code bases for end-to-end
ASR and ST—for example, DeepSpeech (Han-
nun et al.,2014), ESPnet (Inaguma et al.,2020;
Watanabe et al.,2020), fairseq S2T (Wang et al.,
2020), NeurST (Zhao et al.,2021) and Speech-
Brain (Ravanelli et al.,2021)—it becomes appar-
ent that the practical use of open-source toolkits
still requires significant experience in navigating
large-scale code, using complex data formats, pre-
processing, neural text modeling, and speech pro-
1https://public.tableau.com/views/ACL2022map/
Dashboard1?:showVizHome=no
arXiv:2210.02545v1 [cs.CL] 5 Oct 2022