AUDIO-TO-INTENT USING ACOUSTIC-TEXTUAL SUBWORD REPRESENTATIONS FROM END-TO-END ASR Pranay Dighe Prateeth Nayak Oggi Rudovic Erik Marchi Xiaochuan Niu Ahmed Tewﬁk

2025-05-02 0 0 1MB 5 页 10玖币

侵权投诉

AUDIO-TO-INTENT USING ACOUSTIC-TEXTUAL SUBWORD REPRESENTATIONS

FROM END-TO-END ASR

Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewﬁk

Apple

ABSTRACT

Accurate prediction of the user intent to interact with a voice assis-

tant (VA) on a device (e.g. a smartphone) is critical for achieving

naturalistic, engaging, and privacy-centric interactions with the VA.

To this end, we present a novel approach to predict the user inten-

tion (whether the user is speaking to the device or not) directly from

acoustic and textual information encoded at subword tokens which

are obtained via an end-to-end (E2E) ASR model. Modeling directly

the subword tokens, compared to modeling of the phonemes and/or

full words, has at least two advantages: (i) it provides a unique vo-

cabulary representation, where each token has a semantic meaning,

in contrast to the phoneme-level representations, (ii) each subword

token has a reusable “sub”-word acoustic pattern (that can be used to

construct multiple full words), resulting in a largely reduced vocab-

ulary space than of the full words. To learn the subword represen-

tations for the audio-to-intent classiﬁcation, we extract: (i) acoustic

information from an E2E-ASR model, which provides frame-level

CTC posterior probabilities for the subword tokens, and (ii) tex-

tual information from a pretrained continuous bag-of-words model

capturing the semantic meaning of the subword tokens. The key

to our approach is that it combines acoustic subword-level poste-

riors with text information using the notion of positional-encoding

to account for multiple ASR hypotheses simultaneously. We show

that the proposed approach learns robust representations for audio-

to-intent classiﬁcation and correctly mitigates 93.3% of unintended

user audio from invoking the VA at 99% true positive rate.

Index Terms—audio-to-intent, CTC posteriors, subword to-

kens, false trigger mitigation, end-to-end ASR

1. INTRODUCTION

In typical voice-assistant (VA) architectures on devices like smart-

phones, any input audio is ﬁrst gated by a wake-word detection mod-

ule, which actively listens for a wake-word (e.g. “Hey Siri”, “Hey

Alexa”, “Okay Google”, and so on). It only allows audio anchored

with the wake-word to be processed by the downstream models. This

gating mechanism is often referred to as the user intent classiﬁcation

(but other names such as false-trigger-mitigation as well as device-

directed-speech-detection are interchangeably used).

Prior work is mainly focused on key-word spotting and wake-

up word detection. These approaches typically rely on multi-stage

neural network based processing of acoustic features to determine

the presence of the wake-word [1, 2, 3, 4, 5]. Despite the fact that

the latest wake-word detectors are relatively highly accurate, they

can still confuse unintended speech as intended for device. Such

false alarms have adverse effects on the user engagement and over-

all experience as well as privacy considerations. To mitigate this,

some system architectures use ASR-based clues from the full con-

text of the audio as compared to the wake-word detector which only

focuses on hypothesized wake-word segment of the audio. ASR

lattice-based models have successfully been explored in this direc-

tion [6, 7, 8, 9, 10, 11, 12], showing that confusion in the ASR lat-

Fig. 1: Skeleton of our audio-to-intent approach

tices provides a strong signal of falsely accepting unintended speech.

In this work, we propose a novel approach to the user’s intent clas-

siﬁcation that detects if a given speech utterance accepted by the

wake-word detector is actually intended towards the VA or not. Un-

like traditional intent classiﬁcation approaches which model acous-

tic and textual space at phoneme and word level respectively [1]-[9],

our audio-to-intent classiﬁcation model learns robust acoustic and

textual information at subword token-level, and provides improved

classiﬁcation accuracy when compared to the baseline approaches.

In our approach, shown in Figure 1, an end-to-end ASR model

directly predicts the frame-level subword-token probabilities from

the audio. An acoustic module processes these subword-token

posteriors as a sum-of-posteriors vector which is obtained by the

logsumexp operation. While we discard the frame-level granularity

of token probabilities, the sum-of-posteriors vector still captures

the acoustic content of the utterance as well as the uncertainty in

ASR when predicting the correct tokens. Such sum-of-posteriors

vectors were recently shown to be informative for training an ASR

model without knowing the order of the words [13]. The goal of

the acoustic module is to process acoustic information in the au-

dio, without explicitly modeling the semantics of the user’s speech.

For example, for a given query “what is deep learning”, the sum-

of-posteriors vectors would contain high probabilities for subword

tokens “what”, “ is”, “ deep”, “ learn”, “ing” as well non-zero

probabilities for other subword tokens that are (typically) confused

(e.g. “ yearn” subword from the word “yearning”). To account

for speech semantics, our architecture is comprised of a dedicated

textual module that processes a beam of top-N subword-tokens pre-

dicted at each output frame, and models the (contextual) semantic

information in the resulting subword-token sequences. Speciﬁcally,

each subword-token is ﬁrst represented by a pretrained continu-

ous bag-of-words token-level embedding [14], which are further

augmented with mean positional-encodings that captures the se-

quential and multiple-hypotheses information from ASR. Finally,

we fuse the acoustic and textual representations learned from the

two submodules to train our audio-to-intent (A2I) classiﬁer.

The main contributions of this work can be summarized as

follows. First, we propose a novel approach that combines mul-

tiple sources of information (acoustics and text) under a common

subword-level representation learning. Second, we use sum-of-

posterior vectors that capture a multinomial probability distribution

over the subword token space. The entropy of the sum-of-posteriors

vectors encodes the uncertainty of ASR in decoding the underlying

audio. As we show in our experiments, this acoustic representa-

arXiv:2210.12134v1 [cs.CL] 21 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AUDIO-TO-INTENTUSINGACOUSTIC-TEXTUALSUBWORDREPRESENTATIONSFROMEND-TO-ENDASRPranayDighe,PrateethNayak,OggiRudovic,ErikMarchi,XiaochuanNiu,AhmedTewkAppleABSTRACTAccuratepredictionoftheuserintenttointeractwithavoiceassis-tant(VA)onadevice(e.g.asmartphone)iscriticalforachievingnaturalistic,engaging,and...

展开>> 收起<<

AUDIO-TO-INTENT USING ACOUSTIC-TEXTUAL SUBWORD REPRESENTATIONS FROM END-TO-END ASR Pranay Dighe Prateeth Nayak Oggi Rudovic Erik Marchi Xiaochuan Niu Ahmed Tewﬁk.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AUDIO-TO-INTENT USING ACOUSTIC-TEXTUAL SUBWORD REPRESENTATIONS FROM END-TO-END ASR Pranay Dighe Prateeth Nayak Oggi Rudovic Erik Marchi Xiaochuan Niu Ahmed Tewﬁk

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: