
AUDIO-TO-INTENT USING ACOUSTIC-TEXTUAL SUBWORD REPRESENTATIONS
FROM END-TO-END ASR
Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewfik
Apple
ABSTRACT
Accurate prediction of the user intent to interact with a voice assis-
tant (VA) on a device (e.g. a smartphone) is critical for achieving
naturalistic, engaging, and privacy-centric interactions with the VA.
To this end, we present a novel approach to predict the user inten-
tion (whether the user is speaking to the device or not) directly from
acoustic and textual information encoded at subword tokens which
are obtained via an end-to-end (E2E) ASR model. Modeling directly
the subword tokens, compared to modeling of the phonemes and/or
full words, has at least two advantages: (i) it provides a unique vo-
cabulary representation, where each token has a semantic meaning,
in contrast to the phoneme-level representations, (ii) each subword
token has a reusable “sub”-word acoustic pattern (that can be used to
construct multiple full words), resulting in a largely reduced vocab-
ulary space than of the full words. To learn the subword represen-
tations for the audio-to-intent classification, we extract: (i) acoustic
information from an E2E-ASR model, which provides frame-level
CTC posterior probabilities for the subword tokens, and (ii) tex-
tual information from a pretrained continuous bag-of-words model
capturing the semantic meaning of the subword tokens. The key
to our approach is that it combines acoustic subword-level poste-
riors with text information using the notion of positional-encoding
to account for multiple ASR hypotheses simultaneously. We show
that the proposed approach learns robust representations for audio-
to-intent classification and correctly mitigates 93.3% of unintended
user audio from invoking the VA at 99% true positive rate.
Index Terms—audio-to-intent, CTC posteriors, subword to-
kens, false trigger mitigation, end-to-end ASR
1. INTRODUCTION
In typical voice-assistant (VA) architectures on devices like smart-
phones, any input audio is first gated by a wake-word detection mod-
ule, which actively listens for a wake-word (e.g. “Hey Siri”, “Hey
Alexa”, “Okay Google”, and so on). It only allows audio anchored
with the wake-word to be processed by the downstream models. This
gating mechanism is often referred to as the user intent classification
(but other names such as false-trigger-mitigation as well as device-
directed-speech-detection are interchangeably used).
Prior work is mainly focused on key-word spotting and wake-
up word detection. These approaches typically rely on multi-stage
neural network based processing of acoustic features to determine
the presence of the wake-word [1, 2, 3, 4, 5]. Despite the fact that
the latest wake-word detectors are relatively highly accurate, they
can still confuse unintended speech as intended for device. Such
false alarms have adverse effects on the user engagement and over-
all experience as well as privacy considerations. To mitigate this,
some system architectures use ASR-based clues from the full con-
text of the audio as compared to the wake-word detector which only
focuses on hypothesized wake-word segment of the audio. ASR
lattice-based models have successfully been explored in this direc-
tion [6, 7, 8, 9, 10, 11, 12], showing that confusion in the ASR lat-
Fig. 1: Skeleton of our audio-to-intent approach
tices provides a strong signal of falsely accepting unintended speech.
In this work, we propose a novel approach to the user’s intent clas-
sification that detects if a given speech utterance accepted by the
wake-word detector is actually intended towards the VA or not. Un-
like traditional intent classification approaches which model acous-
tic and textual space at phoneme and word level respectively [1]-[9],
our audio-to-intent classification model learns robust acoustic and
textual information at subword token-level, and provides improved
classification accuracy when compared to the baseline approaches.
In our approach, shown in Figure 1, an end-to-end ASR model
directly predicts the frame-level subword-token probabilities from
the audio. An acoustic module processes these subword-token
posteriors as a sum-of-posteriors vector which is obtained by the
logsumexp operation. While we discard the frame-level granularity
of token probabilities, the sum-of-posteriors vector still captures
the acoustic content of the utterance as well as the uncertainty in
ASR when predicting the correct tokens. Such sum-of-posteriors
vectors were recently shown to be informative for training an ASR
model without knowing the order of the words [13]. The goal of
the acoustic module is to process acoustic information in the au-
dio, without explicitly modeling the semantics of the user’s speech.
For example, for a given query “what is deep learning”, the sum-
of-posteriors vectors would contain high probabilities for subword
tokens “what”, “ is”, “ deep”, “ learn”, “ing” as well non-zero
probabilities for other subword tokens that are (typically) confused
(e.g. “ yearn” subword from the word “yearning”). To account
for speech semantics, our architecture is comprised of a dedicated
textual module that processes a beam of top-N subword-tokens pre-
dicted at each output frame, and models the (contextual) semantic
information in the resulting subword-token sequences. Specifically,
each subword-token is first represented by a pretrained continu-
ous bag-of-words token-level embedding [14], which are further
augmented with mean positional-encodings that captures the se-
quential and multiple-hypotheses information from ASR. Finally,
we fuse the acoustic and textual representations learned from the
two submodules to train our audio-to-intent (A2I) classifier.
The main contributions of this work can be summarized as
follows. First, we propose a novel approach that combines mul-
tiple sources of information (acoustics and text) under a common
subword-level representation learning. Second, we use sum-of-
posterior vectors that capture a multinomial probability distribution
over the subword token space. The entropy of the sum-of-posteriors
vectors encodes the uncertainty of ASR in decoding the underlying
audio. As we show in our experiments, this acoustic representa-
arXiv:2210.12134v1 [cs.CL] 21 Oct 2022