ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS Desh Raj1 Junteng Jia2 Jay Mahadeokar2 Chunyang Wu2 Niko Moritz2 Xiaohui Zhang2 Ozlem Kalinli2 1Center for Language and Speech Processing Johns Hopkins University USA2Meta AI USA

2025-04-30 0 0 995.68KB 5 页 10玖币

侵权投诉

ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS

Desh Raj∗1, Junteng Jia2, Jay Mahadeokar2, Chunyang Wu2, Niko Moritz2, Xiaohui Zhang2, Ozlem Kalinli2

1Center for Language and Speech Processing, Johns Hopkins University, USA, 2Meta AI, USA

ABSTRACT

Neural transducers have achieved human level performance on

standard speech recognition benchmarks. However, their per-

formance significantly degrades in the presence of cross-talk,

especially when the primary speaker has a low signal-to-noise

ratio. Anchored speech recognition refers to a class of methods

that use information from an anchor segment (e.g., wake-words)

to recognize device-directed speech while ignoring interfering

background speech. In this paper, we investigate anchored speech

recognition to make neural transducers robust to background

speech. We extract context information from the anchor segment

with a tiny auxiliary network, and use encoder biasing and joiner

gating to guide the transducer towards the target speech. More-

over, to improve the robustness of context embedding extraction,

we propose auxiliary training objectives to disentangle lexical con-

tent from speaking style. We evaluate our methods on synthetic

LibriSpeech-based mixtures comprising several SNR and overlap

conditions; they improve relative word error rates by 19.6% over

a strong baseline, when averaged over all conditions.

Index Terms—

RNN-T, background speech suppression, an-

chored speech recognition

1. INTRODUCTION

Neural transducers (using RNNs or transformers) [1] have be-

come the dominant modeling technique in end-to-end on-device

speech recognition [2

–

5], since they allow streaming transcription

similar to CTC models [6

–

8], while still retaining conditional

dependence, like attention-based encoder-decoders (AEDs) [9,10].

Although they have shown state-of-the-art performance on several

benchmarks [11], they still suffer from degradation caused by

interference due to background speech and noise [12, 13]. Re-

cent studies have used context audio for implicit speaker and

environment adaptation of transducer models [14].

In this paper, we focus on the problem of suppressing back-

ground speech using explicit auxiliary information (often referred

to as target speech extraction/recognition in literature [15, 16]).

Such auxiliary information is usually provided in the form of

speaker embeddings (e.g., d-vectors in VoiceFilter-Lite [17,18])

or enrollment utterances (e.g., SpeakerBeam [19,20]). However,

these strategies require the target speaker to be enrolled with the

device, which may not always be feasible or desirable from a

privacy perspective. In contrast, anchored speech recognition

∗Work done during internship at Meta AI.

EncoderPred. Network

Joiner

Softmax

Hey Assistant call John

Transducer

Auxiliary

Network

Anchor segment

Encoder

biasing

Joiner

gating

Fig. 1: Overview of transducer-based anchored speech recognition.

refers to a class of methods that use information from an anchor

segment (such as a wake-word) to recognize device-directed

speech. By relying only on the anchor segment and extracting

the auxiliary information on-the-fly, these models bypass the need

for a speaker enrollment stage. The idea was first proposed in

the context of hybrid ASR systems [21] and later extended to

AED models using a speaker encoder network to extract auxiliary

information from the anchor segment [22].

We investigate anchored speech recognition to improve the per-

formance of transducers in the presence of background speech. In

particular, we add a tiny auxiliary network to extract context infor-

mation from the anchor segment, and use it to bias the transducer

towards the primary speaker. In order to disentangle speaking style

from lexical content in the context embedding, we explore several

auxiliary training objectives. We conduct controlled evaluations

on LibriSpeech mixtures, where our models show relative word

error rate (WER) improvements of 19.6%, on average, compared

to an Emformer baseline trained with background augmentation.

2. ANCHORED SPEECH RECOGNITION

2.1. Preliminary: ASR with neural transducers

Given an utterance

x= (x1,...,xT)

, where

xt∈Rd

denotes

audio features, transducers model the conditional probability

of the output sequence

y= (y1,...,yU)

, where

yu∈ Y

denotes

output units such as graphemes or word-pieces. This is achieved

by marginalizing over the set of all alignments

a∈¯

Y∗

, where

Y=Y∪{φ}and φis called the blank label. Formally,

P(y|x)= X

a∈B−1(y)

P(a|x),(1)

arXiv:2210.11588v3 [eess.AS] 29 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANCHOREDSPEECHRECOGNITIONWITHNEURALTRANSDUCERSDeshRaj1,JuntengJia2,JayMahadeokar2,ChunyangWu2,NikoMoritz2,XiaohuiZhang2,OzlemKalinli21CenterforLanguageandSpeechProcessing,JohnsHopkinsUniversity,USA,2MetaAI,USAABSTRACTNeuraltransducershaveachievedhumanlevelperformanceonstandardspeechrecognitionbench...

展开>> 收起<<

ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS Desh Raj1 Junteng Jia2 Jay Mahadeokar2 Chunyang Wu2 Niko Moritz2 Xiaohui Zhang2 Ozlem Kalinli2 1Center for Language and Speech Processing Johns Hopkins University USA2Meta AI USA.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS Desh Raj1 Junteng Jia2 Jay Mahadeokar2 Chunyang Wu2 Niko Moritz2 Xiaohui Zhang2 Ozlem Kalinli2 1Center for Language and Speech Processing Johns Hopkins University USA2Meta AI USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: