ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS Desh Raj1 Junteng Jia2 Jay Mahadeokar2 Chunyang Wu2 Niko Moritz2 Xiaohui Zhang2 Ozlem Kalinli2 1Center for Language and Speech Processing Johns Hopkins University USA2Meta AI USA

2025-04-30 0 0 995.68KB 5 页 10玖币
侵权投诉
ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS
Desh Raj1, Junteng Jia2, Jay Mahadeokar2, Chunyang Wu2, Niko Moritz2, Xiaohui Zhang2, Ozlem Kalinli2
1Center for Language and Speech Processing, Johns Hopkins University, USA, 2Meta AI, USA
ABSTRACT
Neural transducers have achieved human level performance on
standard speech recognition benchmarks. However, their per-
formance significantly degrades in the presence of cross-talk,
especially when the primary speaker has a low signal-to-noise
ratio. Anchored speech recognition refers to a class of methods
that use information from an anchor segment (e.g., wake-words)
to recognize device-directed speech while ignoring interfering
background speech. In this paper, we investigate anchored speech
recognition to make neural transducers robust to background
speech. We extract context information from the anchor segment
with a tiny auxiliary network, and use encoder biasing and joiner
gating to guide the transducer towards the target speech. More-
over, to improve the robustness of context embedding extraction,
we propose auxiliary training objectives to disentangle lexical con-
tent from speaking style. We evaluate our methods on synthetic
LibriSpeech-based mixtures comprising several SNR and overlap
conditions; they improve relative word error rates by 19.6% over
a strong baseline, when averaged over all conditions.
Index Terms
RNN-T, background speech suppression, an-
chored speech recognition
1. INTRODUCTION
Neural transducers (using RNNs or transformers) [1] have be-
come the dominant modeling technique in end-to-end on-device
speech recognition [2
5], since they allow streaming transcription
similar to CTC models [6
8], while still retaining conditional
dependence, like attention-based encoder-decoders (AEDs) [9,10].
Although they have shown state-of-the-art performance on several
benchmarks [11], they still suffer from degradation caused by
interference due to background speech and noise [12, 13]. Re-
cent studies have used context audio for implicit speaker and
environment adaptation of transducer models [14].
In this paper, we focus on the problem of suppressing back-
ground speech using explicit auxiliary information (often referred
to as target speech extraction/recognition in literature [15, 16]).
Such auxiliary information is usually provided in the form of
speaker embeddings (e.g., d-vectors in VoiceFilter-Lite [17,18])
or enrollment utterances (e.g., SpeakerBeam [19,20]). However,
these strategies require the target speaker to be enrolled with the
device, which may not always be feasible or desirable from a
privacy perspective. In contrast, anchored speech recognition
Work done during internship at Meta AI.
EncoderPred. Network
Joiner
Softmax
Hey Assistant call John
Transducer
Auxiliary
Network
Anchor segment
Encoder
biasing
Joiner
gating
Fig. 1: Overview of transducer-based anchored speech recognition.
refers to a class of methods that use information from an anchor
segment (such as a wake-word) to recognize device-directed
speech. By relying only on the anchor segment and extracting
the auxiliary information on-the-fly, these models bypass the need
for a speaker enrollment stage. The idea was first proposed in
the context of hybrid ASR systems [21] and later extended to
AED models using a speaker encoder network to extract auxiliary
information from the anchor segment [22].
We investigate anchored speech recognition to improve the per-
formance of transducers in the presence of background speech. In
particular, we add a tiny auxiliary network to extract context infor-
mation from the anchor segment, and use it to bias the transducer
towards the primary speaker. In order to disentangle speaking style
from lexical content in the context embedding, we explore several
auxiliary training objectives. We conduct controlled evaluations
on LibriSpeech mixtures, where our models show relative word
error rate (WER) improvements of 19.6%, on average, compared
to an Emformer baseline trained with background augmentation.
2. ANCHORED SPEECH RECOGNITION
2.1. Preliminary: ASR with neural transducers
Given an utterance
x= (x1,...,xT)
, where
xtRd
denotes
audio features, transducers model the conditional probability
of the output sequence
y= (y1,...,yU)
, where
yu∈ Y
denotes
output units such as graphemes or word-pieces. This is achieved
by marginalizing over the set of all alignments
a¯
Y
, where
¯
Y=Y∪{φ}and φis called the blank label. Formally,
P(y|x)= X
aB1(y)
P(a|x),(1)
© IEEE 2023
arXiv:2210.11588v3 [eess.AS] 29 Mar 2023
摘要:

ANCHOREDSPEECHRECOGNITIONWITHNEURALTRANSDUCERSDeshRaj1,JuntengJia2,JayMahadeokar2,ChunyangWu2,NikoMoritz2,XiaohuiZhang2,OzlemKalinli21CenterforLanguageandSpeechProcessing,JohnsHopkinsUniversity,USA,2MetaAI,USAABSTRACTNeuraltransducershaveachievedhumanlevelperformanceonstandardspeechrecognitionbench...

展开>> 收起<<
ANCHORED SPEECH RECOGNITION WITH NEURAL TRANSDUCERS Desh Raj1 Junteng Jia2 Jay Mahadeokar2 Chunyang Wu2 Niko Moritz2 Xiaohui Zhang2 Ozlem Kalinli2 1Center for Language and Speech Processing Johns Hopkins University USA2Meta AI USA.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:995.68KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注