JoeyS2T Minimalistic Speech-to-Text Modeling with JoeyNMT Mayumi Ohta Computational Linguistics

2025-05-05 0 0 332.39KB 10 页 10玖币
侵权投诉
JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT
Mayumi Ohta
Computational Linguistics
Heidelberg University, Germany
ohta@cl.uni-heidelberg.de
Julia Kreutzer
Google Research
jkreutzer@google.com
Stefan Riezler
Computational Linguistics & IWR
Heidelberg University, Germany
riezler@cl.uni-heidelberg.de
Abstract
JoeyS2T is a JoeyNMT (Kreutzer et al.,
2019) extension for speech-to-text tasks such
as automatic speech recognition and end-
to-end speech translation. It inherits the
core philosophy of JoeyNMT, a minimalist
NMT toolkit built on PyTorch, seeking sim-
plicity and accessibility. JoeyS2T’s work-
flow is self-contained, starting from data pre-
processing, over model training and predic-
tion to evaluation, and is seamlessly inte-
grated into JoeyNMT’s compact and simple
code base. On top of JoeyNMT’s state-of-
the-art Transformer-based encoder-decoder ar-
chitecture, JoeyS2T provides speech-oriented
components such as convolutional layers,
SpecAugment, CTC-loss, and WER evalua-
tion. Despite its simplicity compared to prior
implementations, JoeyS2T performs compet-
itively on English speech recognition and
English-to-German speech translation bench-
marks. The implementation is accompanied
by a walk-through tutorial and available on
https://github.com/may-/joeys2t.
1 Introduction
End-to-end models recently have been shown to
be able to outperform complex pipelines of indi-
vidually trained components in many NLP tasks.
For example, in the area of automatic speech recog-
nition (ASR) and speech translation (ST), the per-
formance gap between end-to-end models and cas-
caded pipelines, where an acoustic model is fol-
lowed by an HMM for ASR, or an ASR model is
followed by a machine translation (MT) model for
ST, seems to be closed (Sperber et al.,2019;Ben-
tivogli et al.,2021). An end-to-end approach has
several advantages over a pipeline approach: First,
it mitigates error propagation through the pipeline.
Second, its data requirements are simpler since in-
termediate data interfaces to bridge components
can be skipped. Furthermore, intermediate com-
ponents such as phoneme dictionaries in ASR or
transcriptions in ST need significant amounts of ad-
ditional human expertise to build. For end-to-end
models, the overall model architecture is simpler,
consisting of a unified end-to-end neural network.
Nonetheless, end-to-end components can be ini-
tialized from non end-to-end data, e.g., in audio
encoding layers (Xu et al.,2021) or text decoding
layers (Li et al.,2021).
ASR or ST tasks usually have a higher entry
barrier than MT, especially for novices who have
little experience in machine learning, but also for
NLP researchers who have previously only worked
on text and not speech processing. This can also
be seen in the population of the different tracks
of NLP conferences. For example, the “Speech
and Multimodality” track of ACL 2022 had only
a third of the number of papers in the “Machine
Translation and Multilinguality” track.
1
However,
thanks to the end-to-end paradigm, those tasks are
now more accessible for students or entry-level
practitioners without huge resources, and without
the experience of handling the different modules
of a cascaded system or speech processing. The
increased adoption of Transformer architectures
(Vaswani et al.,2017) in both text (Kalyan et al.,
2021) and speech processing (Dong et al.,2018;
Karita et al.,2019a,b) has further eased the transfer
of knowledge between the two fields, in addition to
making joint modeling easier and more unified.
Reviewing existing code bases for end-to-end
ASR and ST—for example, DeepSpeech (Han-
nun et al.,2014), ESPnet (Inaguma et al.,2020;
Watanabe et al.,2020), fairseq S2T (Wang et al.,
2020), NeurST (Zhao et al.,2021) and Speech-
Brain (Ravanelli et al.,2021)—it becomes appar-
ent that the practical use of open-source toolkits
still requires significant experience in navigating
large-scale code, using complex data formats, pre-
processing, neural text modeling, and speech pro-
1https://public.tableau.com/views/ACL2022map/
Dashboard1?:showVizHome=no
arXiv:2210.02545v1 [cs.CL] 5 Oct 2022
cessing in general. High code complexity and a
lack of documentation are frustrating hurdles for
novices. We propose JoeyS2T, a minimalist and
accessible framework, to help novices get started
with speech recognition and translation, to accel-
erate their learning process, and to make ASR and
ST more accessible and transparent, that is directly
targeting novices and their needs.
We hope that making more accessible implemen-
tations will also have trickle-down effects of mak-
ing the research built on top of it more accessible
and more linguistically and geographically diverse
(Joshi et al.,2020). This effect has already been
observed for the adoption of JoeyNMT for text MT
for low-resource languages (
et al.,2020;Camgoz
et al.,2020;Zhao et al.,2020;Zacarías Márquez
and Meza Ruiz,2021;Ranathunga et al.,2021;
Mirzakhalov et al.,2021). Furthermore, speech
technology has an even higher potential for lan-
guage inclusivity (Black,2019;Abraham et al.,
2020;Zhang et al.,2022;Liu et al.,2022).
2 Speech-to-Text Modeling
Automatic speech recognition and translation re-
quire mapping a speech feature sequence
X=
{xiRd}
to a text token sequence
Y={yt∈ V}
.
The continuous speech signal in its raw wave form
is pre-processed into a sequence of discrete frames
that are each represented as
d
-dimensional speech
feature vectors
xi
, e.g., log Mel filterbanks at the
i
-th time frame. In contrast, a textual sequence is
naturally composed of discrete symbols that can
be broken down into units of different granularity,
e.g. characters, sub-words, or words. These units
then form a vocabulary, so in the above formulation
yt
is the
t
-th target token from the vocabulary
V
.
The goal of S2T modeling is then to find the most
probable target token sequence
ˆ
Y
from all possible
vocabulary combinations V:
ˆ
Y= arg max
Y∈V
p(Y|X).(1)
2.1 Why End-to-End Modeling?
In conventional HMM modeling, the posterior
probability
p(Y|X)
from Eq. 1is decomposed
into three components by introducing the HMM
state sequences S={st}:
p(Y|X)p(X|S)
| {z }
Acoustic Model
p(S|Y)
| {z }
Lexical Model
p(Y)
|{z}
LM
.(2)
The components correspond to an acoustic model
p(X|S)
, a lexical representation model
p(S|Y)
,
and a language model p(Y).
For practitioners, this means that three individ-
ual models need to be implemented, trained and
combined. This comes with a large overhead, since
each of them requires dedicated linguistic resources
and experience in training and tuning. Attention-
based deep neural networks have reduced this bur-
den significantly since they implicitly model all
three components in a single neural network, map-
ping
X
directly to
Y
(Chorowski et al.,2015;Chan
et al.,2016).
2.2 Optimization
Most approaches to sequence-to-sequence learning
tasks like MT use the cross-entropy (Xent) loss for
optimization, and break the sequence prediction
task down to a token-level objective. The posterior
probability from above is modeled as the product
of output token probabilities conditioned on the
entire input sequence
X
and the target prefix
y<t
:
pxent(Y|X) := Y
t
p(yt|y<t;X).(3)
A popular alternative in ASR is to employ
Connectionist Temporal Classification (CTC) loss
(Graves and Jaitly,2014). CTC uses a Markov as-
sumption to model the transition of states similar
to conventional HMM:
pctc(Y|X) := X
A
Y
t
p(at|X),(4)
where
A
denotes the set of valid alignments from
X
to
Y
,
at∈ A
is one possible alignment at the
t
-th time step, and marginalizing the conditional
probability
p(at|X)
over all valid possible align-
ments yields the sequence-level probability.
This CTC formulation is suitable to learn mono-
tonic alignments between audio and text, and it
also can handle very long sequences efficiently by
solving dynamic programming on the state tran-
sition graph. The assumption of conditional in-
dependence at different time steps is a potentially
harmful simplification which is compensated for
by a token-level objective and by jointly minimiz-
ing cross-entropy and CTC loss (Hori et al.,2017;
Watanabe et al.,2017). The final optimization ob-
jective in the JoeyS2T implementation is a loga-
rithmic linear combination of the label-smoothed
摘要:

JoeyS2T:MinimalisticSpeech-to-TextModelingwithJoeyNMTMayumiOhtaComputationalLinguisticsHeidelbergUniversity,Germanyohta@cl.uni-heidelberg.deJuliaKreutzerGoogleResearchjkreutzer@google.comStefanRiezlerComputationalLinguistics&IWRHeidelbergUniversity,Germanyriezler@cl.uni-heidelberg.deAbstractJoeyS2Ti...

展开>> 收起<<
JoeyS2T Minimalistic Speech-to-Text Modeling with JoeyNMT Mayumi Ohta Computational Linguistics.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:332.39KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注