SCALING UP DELIBERATION FOR MULTILINGUAL ASR Ke Hu Bo Li Tara N. Sainath Google LLC USA

2025-04-26 4 0 516.42KB 6 页 10玖币

侵权投诉

SCALING UP DELIBERATION FOR MULTILINGUAL ASR

Ke Hu, Bo Li, Tara N. Sainath

Google LLC, USA

{huk,boboli,tsainath}@google.com

ABSTRACT

Multilingual end-to-end automatic speech recognition models are at-

tractive due to its simplicity in training and deployment. Recent

work on large-scale training of such models has shown promising

results compared to monolingual models. However, the work often

focuses on multilingual models themselves in a single-pass setup. In

this work, we investigate second-pass deliberation for multilingual

speech recognition. Our proposed deliberation is multilingual, i.e.,

the text encoder encodes hypothesis text from multiple languages,

and the decoder attends to multilingual text and audio. We inves-

tigate scaling the deliberation text encoder and decoder, and com-

pare scaling the deliberation decoder and the ﬁrst-pass cascaded en-

coder. We show that deliberation improves the average WER on 9

languages by 4% relative compared to the single-pass model. By

increasing the size of the deliberation up to 1B parameters, the aver-

age WER improvement increases to 9%, with up to 14% for certain

languages. Our deliberation rescorer is based on transformer layers

and can be parallelized during rescoring.

Index Terms—Multilingual deliberation, multilingual auto-

matic speech recognition, large-scale training

1. INTRODUCTION

There has been growing interest in developing multilingual end-to-

end (E2E) automatic speech recognition (ASR) models in the past

few years [1,2,3,4,5,6,7]. Previous work on multilingual E2E

models has explored different model structures such as connectionist

temporal classiﬁcation (CTC) models [2], attention-decoder based

models [1,3,5], and streaming models [4,6]. While the previous re-

search mainly focuses on various E2E model structures, recent large-

scale data sets have motivated work in developing giant multilingual

ASR models [8,9,10,11,12]. For example, [11] has proposed a

truly multilingual on-device transducer model for 9 languages with-

out explicitly using language information. By increasing the model

capacity to 1B parameters, [11] shows that the model performs gen-

erally better than monolingual models for all languages. In another

large-scale training, [8] builds a multilingual E2E model up to 1B

size, and the authors have improved quality of all variants of the

multilingual model [8]. However, the training data used in [8] is rel-

atively small compared to [11]. Although the aforementioned studies

pool all languages together and train from scratch, [12] has proposed

a life-long learning strategy for large-scale multilingual model train-

ing. By adding languages incrementally and increasing the model

size up to 1B, [12] shows that a lifelong learning strategy is more

effective than training from scratch.

Recently, two-pass ASR models have been shown to further im-

prove ﬁrst-pass model performance [13,14,15,16]. While increas-

ing the capacity of the multilingual models themselves improves

recognition quality, it is unclear whether a second-pass model works

in a multilingual setup and how much it can beneﬁt from a big-

ger size. In [17], the authors have investigated both multilingual

and monolingual second-pass rescoring for a 6-language multilin-

gual model and obtained signiﬁcant improvements in both scenar-

ios. However, the biggest model used is around 300M and it is

unclear how the second-pass performance changes as capacity in-

creases. Traditional shallow fusion [18] or neural language model

(LM) rescoring [19] are mostly language dependent, i.e. each lan-

guage has a monolingual LM for post-processing.

Deliberation networks are one type of two-pass models, and

have achieved state-of-the-art results in monolingual ASR for En-

glish [20,14,21,16,22]. Deliberation is a two-pass ASR model

where the ﬁrst-pass is usually a transducer [14] generating ﬁrst-pass

text hypotheses, and in the second pass, a two-source attention is

used to attend to both ﬁrst-pass hypotheses and audio encoder out-

puts for redecoding or rescoring [14]. Note that in the monolingual

scenario, the ﬁrst-pass hypotheses are in a single language to delib-

erate [14]. It is unknown how deliberation performs in a multilingual

scenario where hypothesis texts are in different languages and audio

inputs are multilingual as well.

In this work, we investigate deliberation for multilingual second-

pass rescoring. We extend the model from [11] and add a cascaded

encoder [23] as the ﬁrst-pass multilingual model. A multilingual

deliberation decoder is then used for second-pass rescoring. The

deliberation decoder consists of a multilingual text encoder and a

transformer-based multi-source attention decoder [24]. We show

that, in a truly multilingual setup without explicitly using any lan-

guage information, deliberation improves the multilingual ﬁrst-pass

by 4% in terms of average WER, despite both hypothesis text and au-

dio encoder outputs are from multiple languages. Second, we show

that by scaling up deliberation text encoder and the deliberation de-

coder, further improvements are achieved. Speciﬁcally, it is more

effective to increase the width and depth of the deliberation decoder

than the text encoder or the ﬁrst-pass cascaded encoder. By increas-

ing the multilingual deliberation up to 1B parameters, we improve

the average WER by 9% relative for 9 languages. The improvement

is uniform for all languages and up to 14% for certain languages.

Compared to a ﬁrst-pass cascaded encoder in a similar size (1B), the

deliberation model has a relative average WER improvement of 4%

and per-language improvement up to 8%. As far as we know, this is

the ﬁrst work on large-scale training of deliberation for multilingual

ASR.

2. SYSTEM DESCRIPTION

2.1. Multilingual Deliberation

Previous work shows that deliberation achieves the state-of-the-art

results for monolingual models [14,24,22]. However, it remains

unknown how deliberation works for multilingual models. There are

arXiv:2210.05785v1 [cs.CL] 11 Oct 2022

Causal

Encoder

RNN-T

decoder

Transformer

decoder

atten

<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>

<latexit sha1_base64="ka1wy5sPOUiTobRCZIwd+bLkfxg=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2k3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzU6QVhNumrab9ccavuHGSVeDmpQI5Gv/zVG8QsjVAaJqjWXc9NjJ9RZTgTOC31Uo0JZWM6xK6lkkao/Wx+75ScWWVAwljZkobM1d8TGY20nkSB7YyoGellbyb+53VTE177GZdJalCyxaIwFcTEZPY8GXCFzIiJJZQpbm8lbEQVZcZGVLIheMsvr5JWrepdVGv3l5X6TR5HEU7gFM7Bgyuowx00oAkMBDzDK7w5j86L8+58LFoLTj5zDH/gfP4AZlSQNA==</latexit>

<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>

Multilingual Text

Encoder

<latexit sha1_base64="2CEkucdJceIJ9ueDpeuaaM7RUNc=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzIa5/4NB7tsVp+pMhRfBLaCCCjV8+7PXj2kaMQlUEK27rpOAlxEFnAqWl3upZgmhIzJgXYOSREx72fT8HJ8Yp4/DWJknAU/d3xMZibQeR4HpjAgM9XxtYv5X66YQXnoZl0kKTNLZojAVGGI8yQL3uWIUxNgAoYqbWzEdEkUomMTKJgR3/suL0KpV3bNq7fa8Ur8q4iihI3SMTpGLLlAd3aAGaiKKMvSEXtCr9Wg9W2/W+6x1ySpmDtAfWR/favaWbA==</latexit>

<latexit sha1_base64="x7pCq9nD+95QHdFUq6gXgI3lYlE=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkMmmHTi7MnIghBHwVNy4UcetzuPNtnLYRtPWHgY//nMM583uJ4Aos68tYWl5ZXVuvbFQ3t7Z3ds29/Y6KU0lZm8Yilj2PKCZ4xNrAQbBeIhkJPcG63vh6Uu/eM6l4HN1BljAnJMOIB5wS0JZrHg6APYAX5Fnh/qBfuGbNqltT4UWwS6ihUi3X/Bz4MU1DFgEVRKm+bSXg5EQCp4IV1UGqWELomAxZX2NEQqacfHp+gU+04+MglvpFgKfu74mchEploac7QwIjNV+bmP/V+ikEl07OoyQFFtHZoiAVGGI8yQL7XDIKItNAqOT6VkxHRBIKOrGqDsGe//IidBp1+6zeuD2vNa/KOCroCB2jU2SjC9REN6iF2oiiHD2hF/RqPBrPxpvxPmtdMsqZA/RHxsc3iJCWfw==</latexit>

<latexit sha1_base64="90CYJkgof86fikI2U9r/I3ZJsL0=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzGju/2CQ+3bFqTpT4UVwC6igQg3f/uz1Y5pGTAIVROuu6yTgZUQBp4Ll5V6qWULoiAxY16AkEdNeNj0/xyfG6eMwVuZJwFP390RGIq3HUWA6IwJDPV+bmP/VuimEl17GZZICk3S2KEwFhhhPssB9rhgFMTZAqOLmVkyHRBEKJrGyCcGd//IitGpV96xauz2v1K+KOEroCB2jU+SiC1RHN6iBmoiiDD2hF/RqPVrP1pv1PmtdsoqZA/RH1sc3YyaWZw==</latexit>

Multilingual Deliberation Decoder

Non-causal

Encoder

Multilingual First-Pass Model

Speech

Fig. 1. Multilingual deliberation based on a truly multilingual cas-

caded encoder model as the ﬁrst pass.

multiple choices in designing a deliberation model in a multilingual

setup, e.g., one can use a per-language deliberation decoder after a

ﬁrst-pass model, or a single deliberation decoder for all languages.

The former is more similar to a monolingual deliberation and may

need explicit language information in inference, while the latter is

truly multilingual and simpler to train and inference. In this work,

we investigate a truly multilingual deliberation model, i.e. a single

deliberation decoder for deliberation of multiple languages.

There are a few distinctions between multilingual deliberation

and its monolingual versions: 1) The hypotheses from the ﬁrst-pass

model are in multiple languages, and it is unclear whether a sin-

gle deliberation text encoder can model multilingual inputs, 2) A

single deliberation decoder attends to both text and audio encod-

ings in different languages, and 3) Does the deliberation also need

increased capacity in the multilingual scenario (similar to [10]) to

achieve comparable improvements in monolingual scenarios?

We show the diagram of the proposed multilingual deliberation

model in Fig. 1. To achieve a truly multilingual system, we use

a language agnostic multilingual model proposed in [11] as the ﬁrst

pass. The ﬁrst-pass model does not require any any explicit language

information in either encoder or decoder. Note that [11] only uses

left context for audio encoders, and we use a non-causal cascaded

encoder [23] to leverage the audio right context. Note that our cas-

caded encoder is also multilingual and does not explicitly use any

language information. For all languages, we sample the ﬁrst-pass

decoder softmax [20] to generate multilingual hypothesis outputs.

To achieve a truly multilingual ASR system, our multilingual text

encoder does not use any explicit language information and encodes

text inputs of multiple languages using shared parameters. Usually a

bidirectional LSTM [14] or a conformer encoder [25] is used as the

text encoder. Similarly, to achieve a truly multilingual rescorer, we

use a single deliberation decoder based on transformer layers [24].

No explicit language information is used during rescoring. The mul-

tilingual deliberation decoder attends to both the cascaded encoder

(non-causal) output (e) and hypotheses (yr) from the non-causal en-

coder.

In an ASR task of 9 languages, we use 16,384 wordpieces as

the output vocabulary for both ﬁrst-pass multilingual model and the

second-pass deliberation. The wordpicece model is generated on the

text transcriptions of our training data by pooling all the languages

together. We use a word count threshold of 20 to drop uncommon

words when building the wordpiece model. For Chinese, due to the

ambiguity in word segmentation, we simply use characters instead

words. While for Japanese, we use MeCab [26] to segment the text

into words. During training, we similarly pool all the data together

and sample each batch according to the natural distribution. We

freeze the ﬁrst-pass model and only the deliberation network is opti-

mized using a cross entropy loss between the output of the network

and the ground truth transcripts.

Locale Language Counts (M) Hours (K)

en-US English (USA) 34.0 52.6

zh-TW Chinese 16.6 22.0

fr-FR French 16.5 23.7

de-DE German 15.3 23.4

ja-JP Japanese 14.9 20.2

es-US Spanish (USA) 14.2 23.8

es-ES Spanish (Spain) 12.9 20.1

it-IT Italian 11.8 19.8

en-GB English (UK) 6.0 8.6

Total 142.2 214.2

Table 1. Training data for 9 languages. Utterance counts are in

millions (M) and duration is in thousand (K) of hours.

2.2. Scaling Up Deliberation

[10] shows that scaling up the number of model parameters is an ef-

ﬁcient way to increase modeling capacity for multilingual ASR. In-

spired by [10], we scale up the deliberation architecture in different

ways to increase the capacity for multilingual rescoring. As shown

in Fig. 1, there are two major components in the deliberation de-

coder: 1) Multilingual text encoder, and 2) Multilingual two-source

transformer decoder. In this work, we empirically study the effect of

increasing the capacity of the two components. We experiment with

increasing both the width or the depth of the text encoder for mod-

eling the hypotheses in multiple languages. While [11] ﬁnds that

increasing encoder in a transducer model is effective in increasing

model capacity, we increase the attention-based decoder up to 1B pa-

rameters and compare that to [11] with a large cascaded encoder. We

also research alternatives such as scaling up the non-causal cascaded

encoder ﬁrst and then add deliberation or scaling up the deliberation

decoder alone. Training large-scale model is challenging and we use

Adafactor [27] and repeated layers [28] for reduce high-bandwidth

memory use. Our models are trained in Tensorﬂow [29] using the

Lingvo framework [28] on 4×4×8 Tensor Processing Units (TPU)

v4 slices [30] with a global batch size of 4,096, and optimized using

synchronized stochastic gradient descent. We cap the gradient norm

change of a parameter to 5.0 to stabilize training.

3. EXPERIMENTAL DETAILS

3.1. Data

Our multilingual training data consists of Voice Search speech from

9 languages in different locales, including English (USA), Chinese,

French, German, Japanese, Spanish (USA), Spanish (Spain), Ital-

ian and English (UK) (see Table 1for more details). They consist of

around 142M utterances and the total duration is around 214K hours.

The training data is anonymized and human transcribed. The per-

language training data ranges from 6M to 34M. For each language,

we use a test set with utterances sampled from the Voice Search traf-

ﬁc for each language and they range from 23.8K to 300K utterances.

The test set does not overlap with the training set, and they are also

anonymized and human transcribed.

3.2. Model Description

We use a baseline multilingual model similar to [11] which is lan-

guage agnostic. The baseline model consists of a 12-layer causal

conformer encoder and a 5-layer non-causal cascaded encoder. The

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SCALINGUPDELIBERATIONFORMULTILINGUALASRKeHu,BoLi,TaraN.SainathGoogleLLC,USAfhuk,boboli,tsainathg@google.comABSTRACTMultilingualend-to-endautomaticspeechrecognitionmodelsareat-tractiveduetoitssimplicityintraininganddeployment.Recentworkonlarge-scaletrainingofsuchmodelshasshownpromisingresultscompared...

展开>> 收起<<

SCALING UP DELIBERATION FOR MULTILINGUAL ASR Ke Hu Bo Li Tara N. Sainath Google LLC USA.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SCALING UP DELIBERATION FOR MULTILINGUAL ASR Ke Hu Bo Li Tara N. Sainath Google LLC USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: