JoeyS2T Minimalistic Speech-to-Text Modeling with JoeyNMT Mayumi Ohta Computational Linguistics

2025-05-05 0 0 332.39KB 10 页 10玖币

侵权投诉

JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT

Mayumi Ohta

Computational Linguistics

Heidelberg University, Germany

ohta@cl.uni-heidelberg.de

Julia Kreutzer

Google Research

jkreutzer@google.com

Stefan Riezler

Computational Linguistics & IWR

Heidelberg University, Germany

riezler@cl.uni-heidelberg.de

Abstract

JoeyS2T is a JoeyNMT (Kreutzer et al.,

2019) extension for speech-to-text tasks such

as automatic speech recognition and end-

to-end speech translation. It inherits the

core philosophy of JoeyNMT, a minimalist

NMT toolkit built on PyTorch, seeking sim-

plicity and accessibility. JoeyS2T’s work-

ﬂow is self-contained, starting from data pre-

processing, over model training and predic-

tion to evaluation, and is seamlessly inte-

grated into JoeyNMT’s compact and simple

code base. On top of JoeyNMT’s state-of-

the-art Transformer-based encoder-decoder ar-

chitecture, JoeyS2T provides speech-oriented

components such as convolutional layers,

SpecAugment, CTC-loss, and WER evalua-

tion. Despite its simplicity compared to prior

implementations, JoeyS2T performs compet-

itively on English speech recognition and

English-to-German speech translation bench-

marks. The implementation is accompanied

by a walk-through tutorial and available on

https://github.com/may-/joeys2t.

1 Introduction

End-to-end models recently have been shown to

be able to outperform complex pipelines of indi-

vidually trained components in many NLP tasks.

For example, in the area of automatic speech recog-

nition (ASR) and speech translation (ST), the per-

formance gap between end-to-end models and cas-

caded pipelines, where an acoustic model is fol-

lowed by an HMM for ASR, or an ASR model is

followed by a machine translation (MT) model for

ST, seems to be closed (Sperber et al.,2019;Ben-

tivogli et al.,2021). An end-to-end approach has

several advantages over a pipeline approach: First,

it mitigates error propagation through the pipeline.

Second, its data requirements are simpler since in-

termediate data interfaces to bridge components

can be skipped. Furthermore, intermediate com-

ponents such as phoneme dictionaries in ASR or

transcriptions in ST need signiﬁcant amounts of ad-

ditional human expertise to build. For end-to-end

models, the overall model architecture is simpler,

consisting of a uniﬁed end-to-end neural network.

Nonetheless, end-to-end components can be ini-

tialized from non end-to-end data, e.g., in audio

encoding layers (Xu et al.,2021) or text decoding

layers (Li et al.,2021).

ASR or ST tasks usually have a higher entry

barrier than MT, especially for novices who have

little experience in machine learning, but also for

NLP researchers who have previously only worked

on text and not speech processing. This can also

be seen in the population of the different tracks

of NLP conferences. For example, the “Speech

and Multimodality” track of ACL 2022 had only

a third of the number of papers in the “Machine

Translation and Multilinguality” track.

However,

thanks to the end-to-end paradigm, those tasks are

now more accessible for students or entry-level

practitioners without huge resources, and without

the experience of handling the different modules

of a cascaded system or speech processing. The

increased adoption of Transformer architectures

(Vaswani et al.,2017) in both text (Kalyan et al.,

2021) and speech processing (Dong et al.,2018;

Karita et al.,2019a,b) has further eased the transfer

of knowledge between the two ﬁelds, in addition to

making joint modeling easier and more uniﬁed.

Reviewing existing code bases for end-to-end

ASR and ST—for example, DeepSpeech (Han-

nun et al.,2014), ESPnet (Inaguma et al.,2020;

Watanabe et al.,2020), fairseq S2T (Wang et al.,

2020), NeurST (Zhao et al.,2021) and Speech-

Brain (Ravanelli et al.,2021)—it becomes appar-

ent that the practical use of open-source toolkits

still requires signiﬁcant experience in navigating

large-scale code, using complex data formats, pre-

processing, neural text modeling, and speech pro-

1https://public.tableau.com/views/ACL2022map/

Dashboard1?:showVizHome=no

arXiv:2210.02545v1 [cs.CL] 5 Oct 2022

cessing in general. High code complexity and a

lack of documentation are frustrating hurdles for

novices. We propose JoeyS2T, a minimalist and

accessible framework, to help novices get started

with speech recognition and translation, to accel-

erate their learning process, and to make ASR and

ST more accessible and transparent, that is directly

targeting novices and their needs.

We hope that making more accessible implemen-

tations will also have trickle-down effects of mak-

ing the research built on top of it more accessible

and more linguistically and geographically diverse

(Joshi et al.,2020). This effect has already been

observed for the adoption of JoeyNMT for text MT

for low-resource languages (

∀

et al.,2020;Camgoz

et al.,2020;Zhao et al.,2020;Zacarías Márquez

and Meza Ruiz,2021;Ranathunga et al.,2021;

Mirzakhalov et al.,2021). Furthermore, speech

technology has an even higher potential for lan-

guage inclusivity (Black,2019;Abraham et al.,

2020;Zhang et al.,2022;Liu et al.,2022).

2 Speech-to-Text Modeling

Automatic speech recognition and translation re-

quire mapping a speech feature sequence

{xi∈Rd}

to a text token sequence

Y={yt∈ V}

The continuous speech signal in its raw wave form

is pre-processed into a sequence of discrete frames

that are each represented as

-dimensional speech

feature vectors

, e.g., log Mel ﬁlterbanks at the

-th time frame. In contrast, a textual sequence is

naturally composed of discrete symbols that can

be broken down into units of different granularity,

e.g. characters, sub-words, or words. These units

then form a vocabulary, so in the above formulation

is the

-th target token from the vocabulary

The goal of S2T modeling is then to ﬁnd the most

probable target token sequence

from all possible

vocabulary combinations V∗:

Y= arg max

Y∈V∗

p(Y|X).(1)

2.1 Why End-to-End Modeling?

In conventional HMM modeling, the posterior

probability

p(Y|X)

from Eq. 1is decomposed

into three components by introducing the HMM

state sequences S={st}:

p(Y|X)≈p(X|S)

| {z }

Acoustic Model

p(S|Y)

| {z }

Lexical Model

p(Y)

|{z}

.(2)

The components correspond to an acoustic model

p(X|S)

, a lexical representation model

p(S|Y)

and a language model p(Y).

For practitioners, this means that three individ-

ual models need to be implemented, trained and

combined. This comes with a large overhead, since

each of them requires dedicated linguistic resources

and experience in training and tuning. Attention-

based deep neural networks have reduced this bur-

den signiﬁcantly since they implicitly model all

three components in a single neural network, map-

ping

directly to

(Chorowski et al.,2015;Chan

et al.,2016).

2.2 Optimization

Most approaches to sequence-to-sequence learning

tasks like MT use the cross-entropy (Xent) loss for

optimization, and break the sequence prediction

task down to a token-level objective. The posterior

probability from above is modeled as the product

of output token probabilities conditioned on the

entire input sequence

and the target preﬁx

y<t

pxent(Y|X) := Y

p(yt|y<t;X).(3)

A popular alternative in ASR is to employ

Connectionist Temporal Classiﬁcation (CTC) loss

(Graves and Jaitly,2014). CTC uses a Markov as-

sumption to model the transition of states similar

to conventional HMM:

pctc(Y|X) := X

p(at|X),(4)

where

denotes the set of valid alignments from

at∈ A

is one possible alignment at the

-th time step, and marginalizing the conditional

probability

p(at|X)

over all valid possible align-

ments yields the sequence-level probability.

This CTC formulation is suitable to learn mono-

tonic alignments between audio and text, and it

also can handle very long sequences efﬁciently by

solving dynamic programming on the state tran-

sition graph. The assumption of conditional in-

dependence at different time steps is a potentially

harmful simpliﬁcation which is compensated for

by a token-level objective and by jointly minimiz-

ing cross-entropy and CTC loss (Hori et al.,2017;

Watanabe et al.,2017). The ﬁnal optimization ob-

jective in the JoeyS2T implementation is a loga-

rithmic linear combination of the label-smoothed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JoeyS2T:MinimalisticSpeech-to-TextModelingwithJoeyNMTMayumiOhtaComputationalLinguisticsHeidelbergUniversity,Germanyohta@cl.uni-heidelberg.deJuliaKreutzerGoogleResearchjkreutzer@google.comStefanRiezlerComputationalLinguistics&IWRHeidelbergUniversity,Germanyriezler@cl.uni-heidelberg.deAbstractJoeyS2Ti...

展开>> 收起<<

JoeyS2T Minimalistic Speech-to-Text Modeling with JoeyNMT Mayumi Ohta Computational Linguistics.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

JoeyS2T Minimalistic Speech-to-Text Modeling with JoeyNMT Mayumi Ohta Computational Linguistics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: