Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent Pradip Pramanick andChayan Sarkar

2025-04-30 0 0 763.84KB 11 页 10玖币
侵权投诉
Can Visual Context Improve Automatic Speech Recognition
for an Embodied Agent?
Pradip Pramanick and Chayan Sarkar
Robotics & Autonomous Systems
TCS Research, India
{pradip.pramanick,sarkar.chayan}@tcs.com
The usage of automatic speech recognition
(ASR) systems are becoming omnipresent rang-
ing from personal assistant to chatbots, home, and
industrial automation systems, etc. Modern robots
are also equipped with ASR capabilities for inter-
acting with humans as speech is the most natural
interaction modality. However, ASR in robots faces
additional challenges as compared to a personal as-
sistant. Being an embodied agent, a robot must
recognize the physical entities around it and there-
fore reliably recognize the speech containing the
description of such entities. However, current ASR
systems are often unable to do so due to limita-
tions in ASR training, such as generic datasets and
open-vocabulary modeling. Also, adverse condi-
tions during inference, such as noise, accented, and
far-field speech makes the transcription inaccurate.
In this work, we present a method to incorporate
a robot’s visual information into an ASR system
and improve the recognition of a spoken utterance
containing a visible entity. Specifically, we propose
a new decoder biasing technique to incorporate the
visual context while ensuring the ASR output does
not degrade for incorrect context. We achieve a
59% relative reduction in WER from an unmodi-
fied ASR system.
1 Introduction
Spoken interaction with a robot not only increases
its usability and acceptability, it provides a natural
mode of interaction even for a novice user. The
recent development of deep learning-based end-to-
end automatic speech recognition (ASR) systems
has achieved a very high accuracy (Li,2021) as
compared to traditional ASR systems. As a result,
we see a huge surge of speech-based interfaces for
many systems including robots. However, the accu-
racy of any state-of-the-art ASR gets significantly
impacted based on the dialect of the speaker, dis-
tance of the speaker from the microphone, ambient
noise, etc., particularly for novel and low-frequency
take the bink illow
Acoustic
model
Task understanding
& grounded planning
Object detector
a black lamp on a table
the sofa is black
a pink pillow on the sofa
Task planning failed
(take the pink pillow)
Figure 1: A simple pipeline of speech interface for
human-robot interaction.
take the pink pillow
a black lamp on a table
the sofa is black
a pink pillow on the sofa
1. MOVE_TO sofa
2. LOCALIZE pink-pillow
3. PICK_UP pink-pillow
Acoustic
model
Task understanding
& grounded planning
(take the pink pillow)
Object detector
Shallow
fusion
biasing
Dynamic
context
extraction
biasing
Vocabulary
[ black, lamp,
table, sofa,
black, pink,
pillow ]
Figure 2: Robust Speech Interface (RoSI) for embod-
ied agents with shallow fusion biasing using dynamic
biasing vocabulary.
vocabularies. These factors are often predominant
in many robotic applications. This not only results
in poor translation accuracy but also impacts the
instruction understanding and task execution capa-
bility of the robot.
Figure 1depicts a typical scenario where an
agent
1
uses an ASR to translate audio input to text.
Then, it detects the set of objects in its vicinity us-
ing an object detector. Finally, it matches the object
mentioned in the command and the objects detected
in the vicinity (grounding) to narrow down the tar-
get object before execution. If the audio translation
is erroneous, the grounding can fail, which leads to
failure in task execution. For example, in Figure 1,
even though the user mentioned “pink pillow”, the
translation was “bink illow”, which results in fail-
ure in task grounding.
There has been an increasing interest in con-
textual speech recognition, primarily applied to
voice-based assistants (Williams et al.,2018;Pun-
dak et al.,2018;Chen et al.,2019;He et al.,2019;
1In this article, we use robot and agent interchangeably.
arXiv:2210.13189v1 [eess.AS] 21 Oct 2022
Gourav et al.,2021;Le et al.,2021). However,
incorporating visual context into a speech recog-
nizer is usually modeled as a multi-modal speech
recognition problem (Michelsanti et al.,2021), of-
ten simplified to lip-reading (Ghorbani et al.,2021).
Attempts to utilize visual context in robotic agents
also follow the same approach (Onea
t
,
˘
a and Cucu,
2021). Such models always require a pair of speech
and visual input, which fails to tackle cases where
the visual context is irrelevant to the speech.
In contrast, we consider the visual context as
a source of dynamic prior knowledge. Thus, we
bias the prediction of the speech recognizer to in-
clude information from the prior knowledge, pro-
vided some relevant information is found. There
are two primary approaches to introducing bias
in an ASR system, namely shallow and deep fu-
sion. Shallow fusion based approaches perform
rescoring of transcription hypotheses upon detec-
tion of biasing words during beam search (Williams
et al.,2018;Kannan et al.,2018). Class-based lan-
guage models have been proposed to utilize the
prefix context of biasing words (Chen et al.,2019;
Kang and Zhou,2020). Zhao et al. 2019 further
improved the shallow-fusion biasing model by in-
troducing sub-word biasing and prefix-based acti-
vation. Gourav et al. 2021 propose 2-pass language
model rescoring with sub-word biasing for more
effective shallow-fusion.
Deep-fusion biasing approaches use a pre-set
biasing vocabulary to encode biasing phrases into
embeddings that are applied using an attention-
based decoder (Pundak et al.,2018). This is fur-
ther improved by using adversarial examples (Alon
et al.,2019), complex attention-modeling (Chang
et al.,2021;Sun et al.,2021), and prefix disam-
biguation (Han et al.,2021). These approaches can
handle irrelevant and empty contexts but are less
scalable when applied to subword units (Pundak
et al.,2018). Furthermore, a static biasing vocabu-
lary is unsuitable for some applications, including
the one described in this paper. Recent works pro-
pose hybrid systems, applying both shallow and
deep fusion to achieve state-of-the-art results (Le
et al.,2021). Spelling correction models are also
included for additional accuracy gains (Wang et al.,
2021;Leng et al.,2021).
In this article, we propose a robust speech in-
terface pipeline for embodied agents, called RoSI,
that augments existing ASR systems (Figure 2).
Using an object detector, a set of (natural language)
phrases about the objects in the scene are generated.
A biasing vocabulary is built using these generated
captions on the go or it can be pre-computed when-
ever a robot moves to a new location. Our main
contributions are twofold.
We propose a new shallow fusion biasing algo-
rithm that also introduces a non-greedy prun-
ing strategy to allow biasing at the word level
using sub-word level information.
We apply this biasing algorithm to develop
a speech recognition system for a robot that
uses the visual context of the robot to improve
the accuracy of the speech recognizer.
2 Background
We adopt a connectionist temporal classifica-
tion (CTC) (Graves et al.,2006) based model-
ing in the baseline ASR model in our experi-
ments. A CTC based ASR model outputs a se-
quence of probability distributions over the tar-
get vocabulary
y={y1, . . . , yT}
(usually char-
acters), given an input speech signal with length
L
,
x={x1, . . . , xL}, L > T , thus computing,
P(y|x) =
T
Y
i=1
P(yi|x).(1)
The output sequence with the maximum like-
lihood is usually approximated using a beam
search (Hannun et al.,2014). During this beam
search decoding, shallow-fusion biasing proposes
rescoring an output sequence hypothesis contain-
ing one or more biasing words (Hall et al.,2015;
Williams et al.,2018;Kannan et al.,2018). As-
suming a list of biasing words/phrases is avail-
able before producing the transcription, a rescoring
function provides a new score for the matching hy-
pothesis that is either interpolated or used to boost
the log probability of the output sequence hypothe-
sis (Williams et al.,2018),
s(y) = logP (y|x)λlogB(y),(2)
where
B(y)
provides a contextual biasing score of
the partial transcription yand λis a scaling factor.
A major limitation of this approach is ineffec-
tive biasing due to the early pruning of hypothe-
sis. To enable open-vocabulary speech recognition,
ASR networks generally predict sub-word unit la-
bels (e.g., character) instead of directly predicting
the word sequence. However, as the beam search
keeps a fixed number of candidates in each time-
step
iL
, lower ranked hypotheses that contain
incomplete biasing words, are pruned by the beam
search before the word boundary is reached.
To counter this, biasing at the sub-word units
(grapheme/word-piece) by weight pushing has been
proposed in (Pundak et al.,2018). Biasing at the
grapheme level improves recognition accuracy than
word level biasing for speech containing bias terms,
but the performance degrades severely for general
speech where the bias context is irrelevant. Sub-
sequently, two improvements are proposed - i) bi-
asing at the word-piece level (Chen et al.,2019;
Zhao et al.,2019;Gourav et al.,2021) and ii) utiliz-
ing known prefixes of the biasing terms to perform
contextual biasing (Zhao et al.,2019;Gourav et al.,
2021). Although word-piece biasing shows less
degradation than grapheme level on general speech,
it performs worse than word level biasing when
using high biasing weights (Gourav et al.,2021).
Also, prefix-context based biasing is ineffective
when a biasing term is out of context. Moreover, a
general problem with the weight-pushing based ap-
proach as compared to the sub-word level biasing
is that they require a static/class-specific biasing vo-
cabulary to work, usually compiled as a weighted
finite state transducer (WFST). Also it requires
costly operations to be included with the primary
WFST-based decoder. However, for frequently
changing biasing vocabulary, e.g., changing with
the agent’s movement, frequent re-compiling and
merging of the WFST is inefficient.
Therefore, we propose an approach to retain the
benefits of word-level biasing for general speech,
also preventing early pruning of partially match-
ing hypotheses using a modified beam search al-
gorithm. During beam search, our algorithm al-
locates a fixed portion of the beam width to be
influenced by sub-word level look-ahead, which
does not affect the intrinsic ranking of the other
hypotheses. This property is not guaranteed in
weight-pushing (Chen et al.,2019;Gourav et al.,
2021) that directly performs subword-level bias-
ing. Moreover, we specifically target transcribing
robotic instructions which usually include descrip-
tions of everyday objects. Thus the words in the bi-
asing vocabulary are often present in a standard lan-
guage models (LM), while existing biasing models
focus on biasing out-of-vocabulary (OOV) terms
such as person names. We also utilize this distinc-
tion, by incorporating an n-gram language model
to contextually scale the biasing score. We describe
our shallow-fusion model in Section 4.2.
3 System Overview
In this section, we present an overview of the em-
bodied agent that executes natural language instruc-
tions, as depicted in Figure 3. Given a speech input,
the agent also captures an image from its ego-view
camera. The dynamic context extraction module
extracts the visual context from the captured image
before producing the transcription of the speech
input. Firstly, a dense image captioning model
predicts several bounding boxes of interest in the
image and generates a natural language description
for each of them. Given the dense captioned im-
age, a bias target prediction model predicts a list of
biasing words to be used in speech recognition.
The list of biasing words/phrases is compiled
into a prefix tree (trie) that is used by the beam
search decoder to prevent the pruning of partially
matched hypotheses. The trie is dynamically cre-
ated with the agent’s movement that captures a new
image. The acoustic model processes the speech
input to produce a sequence of probability distri-
butions over a character vocabulary. We use the
Wav2Vec2 (Baevski et al.,2020) for acoustic mod-
eling of the speech. This sequence is decoded into
the transcription using a modified beam search de-
coding algorithm. During the beam search, the
visual context that is represented using the bias-
ing trie is used to produce a transcription that is
likely to contain the word(s) from the visual con-
text. We describe our biasing approach in detail in
Section 4.2.
Given the transcribed instruction, the task under-
standing & planning module performs task type
classification, argument extraction, and task plan-
ning. We use conditional random field (CRF) based
model as proposed in our earlier works (Pramanick
et al.,2019,2020). Specifically, the transcribed
instruction is passed through a task-crf model that
labels tokens in the transcription from a set of task
types. Given the output of task-crf, an argument-crf
model labels text spans in the instruction from a
set of argument labels. This results in an annotated
instruction such as,
[Take]taking [the pink pillow]theme.
To perform a high-level task mentioned in the in-
struction (e.g., taking), the agent needs to perform
a sequence of basic actions as produced by the
task planner. The predicted task type is matched
摘要:

CanVisualContextImproveAutomaticSpeechRecognitionforanEmbodiedAgent?PradipPramanickandChayanSarkarRobotics&AutonomousSystemsTCSResearch,India{pradip.pramanick,sarkar.chayan}@tcs.comTheusageofautomaticspeechrecognition(ASR)systemsarebecomingomnipresentrang-ingfrompersonalassistanttochatbots,home,andi...

展开>> 收起<<
Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent Pradip Pramanick andChayan Sarkar.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:763.84KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注