UFO2 A UNIFIED PRE-TRAINING FRAMEWORK FOR ONLINE AND OFFLINE SPEECH RECOGNITION Li Fu Siqi Li Qingtao Li Liping Deng Fangzhu Li Lu Fan Meng Chen Xiaodong He

2025-05-06 0 0 390.06KB 5 页 10玖币

侵权投诉

UFO2: A UNIFIED PRE-TRAINING FRAMEWORK FOR ONLINE AND OFFLINE SPEECH

RECOGNITION

Li Fu, Siqi Li, Qingtao Li, Liping Deng, Fangzhu Li, Lu Fan, Meng Chen, Xiaodong He

JD AI Research, Beijing, China

ABSTRACT

In this paper, we propose a Uniﬁed pre-training Framework for

Online and Ofﬂine (UFO2) Automatic Speech Recognition (ASR),

which 1) simpliﬁes the two separate training workﬂows for online

and ofﬂine modes into one process, and 2) improves the Word Error

Rate (WER) performance with limited utterance annotating. Specif-

ically, we extend the conventional ofﬂine-mode Self-Supervised

Learning (SSL)-based ASR approach to a uniﬁed manner, where the

model training is conditioned on both the full-context and dynamic-

chunked inputs. To enhance the pre-trained representation model,

stop-gradient operation is applied to decouple the online-mode ob-

jectives to the quantizer. Moreover, in both the pre-training and

the downstream ﬁne-tuning stages, joint losses are proposed to

train the uniﬁed model with full-weight sharing for the two modes.

Experimental results on the LibriSpeech dataset show that UFO2

outperforms the SSL-based baseline method by 29.7% and 18.2%

relative WER reduction in ofﬂine and online modes, respectively.

Index Terms—Automatic speech recognition, self-supervised

learning, online and ofﬂine uniﬁed model

1. INTRODUCTION

In recent years, Self-Supervised Learning (SSL) has received much

attention in the Automatic Speech Recognition (ASR) domain [1–

7]. Generally, the SSL-based ASR approach ﬁrst pre-trains a speech

representation encoder on numerous unlabeled utterances via self-

supervised strategies (e.g. masking, quantization, contrasting [1]),

and then ﬁne-tunes the model on labeled data with ASR objectives.

It has shown great potential in improving the ASR performance with

limited speech labeling, which is very promising and valuable when

the human-annotated utterances are expensive or scarce [8].

Based on how the tokens are emitted, ASR systems are typi-

cally categorized by their use in: 1) online mode (a.k.a. streaming),

which is developed to emit each hypothesized word as quickly and

accurately as possible when it is spoken [9–12], and 2) ofﬂine mode

(a.k.a. non-streaming), which aims to accurately emit the complete

hypotheses after processing a full utterance [13–15]. However, most

existing SSL-based ASR methods focus on the pre-training in an

ofﬂine manner, i.e. each represented feature is conditioned on the

full-context inputs [16]. As for the downstream online ASR model

that no (or limited) future context is permitted, the accuracy perfor-

mance might be hindered due to the mode inconsistency between

the pre-training and ﬁne-tuning [17]. One might pre-train the repre-

sentation encoder in online manners, while the model might suffer a

heavy burden on the representation learning when a large proportion

of the utterance (i.e. future context) is unavailable [18].

Different from the ofﬂine-mode SSL, there are only few works

about pre-training for online ASR models. Chiu et al. [17] explored

replacing the learnable quantizer in [1] by a Random-projection

Quantizer (RQ), to make the quantized representation independent

Fig. 1. Overview of the proposed UFO2: Encoder features eare

masked and fed to Conformer blocks to extract ofﬂine/online latent

features (hoff

∗/hon

∗) conditioned on full-context/dynamic-chunked

inputs. Then context features cof f /con and quantized features q/e

(w/ stop gradient) are used for a joint of losses Loff and Lon.

to the recognition mode. Although the RQ method was separately

evaluated on online and ofﬂine models with 0.6 billion parame-

ters, as mentioned in [17] the random strategy of the quantizer and

smaller model sizes that are more efﬁcient for online ASR tasks still

need to be further investigated. Cao et al. [19] trained an SSL-based

ofﬂine ASR model (teacher model), and then adopted knowledge

distillation [20] to guide the ﬁne-tuning of an online model (student

model). Nevertheless, besides introducing the additional ofﬂine

model optimization and distillation strategies under the SSL frame-

work, the online model to be ﬁne-tuned was still initialized from an

ofﬂine-mode representation encoder. Moreover, in the existing SSL-

based works, the online and ofﬂine ASR systems were processed

separately, causing high costs in model development and training

workﬂows for applications in different modes [21–23].

In this paper, a novel Uniﬁed pre-training Framework is pro-

posed to improve the speech representation learning for downstream

Online and Ofﬂine (UFO2) ASR tasks. In particular, UFO2 simpli-

ﬁes the training workﬂows by unifying the online and ofﬂine modes

into a single model. As shown in Fig. 1, different from the most rep-

resentative SSL-based approach – Wav2vec2 [1] and its variants, e.g.

Wav2vec2-Conformer [25], we extend the ofﬂine-mode approach to

a uniﬁed manner via four strategies on the feature extraction and

training objectives. 1) Dual-mode attention. To train a uniﬁed rep-

resentation encoder, the full-context Multi-Headed Self-Attention

(MHSA) in the Conformer block [24] is used to extract ofﬂine-mode

features conditioned on the complete utterance. Simultaneously, the

dynamic-chunked MHSA [21] is adopted to mimic different latency

arXiv:2210.14515v2 [eess.AS] 3 Apr 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UFO2:AUNIFIEDPRE-TRAININGFRAMEWORKFORONLINEANDOFFLINESPEECHRECOGNITIONLiFu,SiqiLi,QingtaoLi,LipingDeng,FangzhuLi,LuFan,MengChen,XiaodongHeJDAIResearch,Beijing,ChinaABSTRACTInthispaper,weproposeaUniedpre-trainingFrameworkforOnlineandOfine(UFO2)AutomaticSpeechRecognition(ASR),which1)simpliesthetwos...

展开>> 收起<<

UFO2 A UNIFIED PRE-TRAINING FRAMEWORK FOR ONLINE AND OFFLINE SPEECH RECOGNITION Li Fu Siqi Li Qingtao Li Liping Deng Fangzhu Li Lu Fan Meng Chen Xiaodong He.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UFO2 A UNIFIED PRE-TRAINING FRAMEWORK FOR ONLINE AND OFFLINE SPEECH RECOGNITION Li Fu Siqi Li Qingtao Li Liping Deng Fangzhu Li Lu Fan Meng Chen Xiaodong He

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: