Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent Pradip Pramanick andChayan Sarkar

2025-04-30 0 0 763.84KB 11 页 10玖币

侵权投诉

Can Visual Context Improve Automatic Speech Recognition

for an Embodied Agent?

Pradip Pramanick and Chayan Sarkar

Robotics & Autonomous Systems

TCS Research, India

{pradip.pramanick,sarkar.chayan}@tcs.com

The usage of automatic speech recognition

(ASR) systems are becoming omnipresent rang-

ing from personal assistant to chatbots, home, and

industrial automation systems, etc. Modern robots

are also equipped with ASR capabilities for inter-

acting with humans as speech is the most natural

interaction modality. However, ASR in robots faces

additional challenges as compared to a personal as-

sistant. Being an embodied agent, a robot must

recognize the physical entities around it and there-

fore reliably recognize the speech containing the

description of such entities. However, current ASR

systems are often unable to do so due to limita-

tions in ASR training, such as generic datasets and

open-vocabulary modeling. Also, adverse condi-

tions during inference, such as noise, accented, and

far-ﬁeld speech makes the transcription inaccurate.

In this work, we present a method to incorporate

a robot’s visual information into an ASR system

and improve the recognition of a spoken utterance

containing a visible entity. Speciﬁcally, we propose

a new decoder biasing technique to incorporate the

visual context while ensuring the ASR output does

not degrade for incorrect context. We achieve a

59% relative reduction in WER from an unmodi-

ﬁed ASR system.

1 Introduction

Spoken interaction with a robot not only increases

its usability and acceptability, it provides a natural

mode of interaction even for a novice user. The

recent development of deep learning-based end-to-

end automatic speech recognition (ASR) systems

has achieved a very high accuracy (Li,2021) as

compared to traditional ASR systems. As a result,

we see a huge surge of speech-based interfaces for

many systems including robots. However, the accu-

racy of any state-of-the-art ASR gets signiﬁcantly

impacted based on the dialect of the speaker, dis-

tance of the speaker from the microphone, ambient

noise, etc., particularly for novel and low-frequency

take the bink illow

Acoustic

model

Task understanding

& grounded planning

Object detector

a black lamp on a table

the sofa is black

a pink pillow on the sofa

Task planning failed

(take the pink pillow)

Figure 1: A simple pipeline of speech interface for

human-robot interaction.

take the pink pillow

a black lamp on a table

the sofa is black

a pink pillow on the sofa

1. MOVE_TO sofa

2. LOCALIZE pink-pillow

3. PICK_UP pink-pillow

Acoustic

model

Task understanding

& grounded planning

(take the pink pillow)

Object detector

Shallow

fusion

biasing

Dynamic

context

extraction

biasing

Vocabulary

[ black, lamp,

table, sofa,

black, pink,

pillow ]

Figure 2: Robust Speech Interface (RoSI) for embod-

ied agents with shallow fusion biasing using dynamic

biasing vocabulary.

vocabularies. These factors are often predominant

in many robotic applications. This not only results

in poor translation accuracy but also impacts the

instruction understanding and task execution capa-

bility of the robot.

Figure 1depicts a typical scenario where an

agent

uses an ASR to translate audio input to text.

Then, it detects the set of objects in its vicinity us-

ing an object detector. Finally, it matches the object

mentioned in the command and the objects detected

in the vicinity (grounding) to narrow down the tar-

get object before execution. If the audio translation

is erroneous, the grounding can fail, which leads to

failure in task execution. For example, in Figure 1,

even though the user mentioned “pink pillow”, the

translation was “bink illow”, which results in fail-

ure in task grounding.

There has been an increasing interest in con-

textual speech recognition, primarily applied to

voice-based assistants (Williams et al.,2018;Pun-

dak et al.,2018;Chen et al.,2019;He et al.,2019;

1In this article, we use robot and agent interchangeably.

arXiv:2210.13189v1 [eess.AS] 21 Oct 2022

Gourav et al.,2021;Le et al.,2021). However,

incorporating visual context into a speech recog-

nizer is usually modeled as a multi-modal speech

recognition problem (Michelsanti et al.,2021), of-

ten simpliﬁed to lip-reading (Ghorbani et al.,2021).

Attempts to utilize visual context in robotic agents

also follow the same approach (Onea

a and Cucu,

2021). Such models always require a pair of speech

and visual input, which fails to tackle cases where

the visual context is irrelevant to the speech.

In contrast, we consider the visual context as

a source of dynamic prior knowledge. Thus, we

bias the prediction of the speech recognizer to in-

clude information from the prior knowledge, pro-

vided some relevant information is found. There

are two primary approaches to introducing bias

in an ASR system, namely shallow and deep fu-

sion. Shallow fusion based approaches perform

rescoring of transcription hypotheses upon detec-

tion of biasing words during beam search (Williams

et al.,2018;Kannan et al.,2018). Class-based lan-

guage models have been proposed to utilize the

preﬁx context of biasing words (Chen et al.,2019;

Kang and Zhou,2020). Zhao et al. 2019 further

improved the shallow-fusion biasing model by in-

troducing sub-word biasing and preﬁx-based acti-

vation. Gourav et al. 2021 propose 2-pass language

model rescoring with sub-word biasing for more

effective shallow-fusion.

Deep-fusion biasing approaches use a pre-set

biasing vocabulary to encode biasing phrases into

embeddings that are applied using an attention-

based decoder (Pundak et al.,2018). This is fur-

ther improved by using adversarial examples (Alon

et al.,2019), complex attention-modeling (Chang

et al.,2021;Sun et al.,2021), and preﬁx disam-

biguation (Han et al.,2021). These approaches can

handle irrelevant and empty contexts but are less

scalable when applied to subword units (Pundak

et al.,2018). Furthermore, a static biasing vocabu-

lary is unsuitable for some applications, including

the one described in this paper. Recent works pro-

pose hybrid systems, applying both shallow and

deep fusion to achieve state-of-the-art results (Le

et al.,2021). Spelling correction models are also

included for additional accuracy gains (Wang et al.,

2021;Leng et al.,2021).

In this article, we propose a robust speech in-

terface pipeline for embodied agents, called RoSI,

that augments existing ASR systems (Figure 2).

Using an object detector, a set of (natural language)

phrases about the objects in the scene are generated.

A biasing vocabulary is built using these generated

captions on the go or it can be pre-computed when-

ever a robot moves to a new location. Our main

contributions are twofold.

•

We propose a new shallow fusion biasing algo-

rithm that also introduces a non-greedy prun-

ing strategy to allow biasing at the word level

using sub-word level information.

•

We apply this biasing algorithm to develop

a speech recognition system for a robot that

uses the visual context of the robot to improve

the accuracy of the speech recognizer.

2 Background

We adopt a connectionist temporal classiﬁca-

tion (CTC) (Graves et al.,2006) based model-

ing in the baseline ASR model in our experi-

ments. A CTC based ASR model outputs a se-

quence of probability distributions over the tar-

get vocabulary

y={y1, . . . , yT}

(usually char-

acters), given an input speech signal with length

x={x1, . . . , xL}, L > T , thus computing,

P(y|x) =

i=1

P(yi|x).(1)

The output sequence with the maximum like-

lihood is usually approximated using a beam

search (Hannun et al.,2014). During this beam

search decoding, shallow-fusion biasing proposes

rescoring an output sequence hypothesis contain-

ing one or more biasing words (Hall et al.,2015;

Williams et al.,2018;Kannan et al.,2018). As-

suming a list of biasing words/phrases is avail-

able before producing the transcription, a rescoring

function provides a new score for the matching hy-

pothesis that is either interpolated or used to boost

the log probability of the output sequence hypothe-

sis (Williams et al.,2018),

s(y) = logP (y|x)−λlogB(y),(2)

where

B(y)

provides a contextual biasing score of

the partial transcription yand λis a scaling factor.

A major limitation of this approach is ineffec-

tive biasing due to the early pruning of hypothe-

sis. To enable open-vocabulary speech recognition,

ASR networks generally predict sub-word unit la-

bels (e.g., character) instead of directly predicting

the word sequence. However, as the beam search

keeps a ﬁxed number of candidates in each time-

step

i∈L

, lower ranked hypotheses that contain

incomplete biasing words, are pruned by the beam

search before the word boundary is reached.

To counter this, biasing at the sub-word units

(grapheme/word-piece) by weight pushing has been

proposed in (Pundak et al.,2018). Biasing at the

grapheme level improves recognition accuracy than

word level biasing for speech containing bias terms,

but the performance degrades severely for general

speech where the bias context is irrelevant. Sub-

sequently, two improvements are proposed - i) bi-

asing at the word-piece level (Chen et al.,2019;

Zhao et al.,2019;Gourav et al.,2021) and ii) utiliz-

ing known preﬁxes of the biasing terms to perform

contextual biasing (Zhao et al.,2019;Gourav et al.,

2021). Although word-piece biasing shows less

degradation than grapheme level on general speech,

it performs worse than word level biasing when

using high biasing weights (Gourav et al.,2021).

Also, preﬁx-context based biasing is ineffective

when a biasing term is out of context. Moreover, a

general problem with the weight-pushing based ap-

proach as compared to the sub-word level biasing

is that they require a static/class-speciﬁc biasing vo-

cabulary to work, usually compiled as a weighted

ﬁnite state transducer (WFST). Also it requires

costly operations to be included with the primary

WFST-based decoder. However, for frequently

changing biasing vocabulary, e.g., changing with

the agent’s movement, frequent re-compiling and

merging of the WFST is inefﬁcient.

Therefore, we propose an approach to retain the

beneﬁts of word-level biasing for general speech,

also preventing early pruning of partially match-

ing hypotheses using a modiﬁed beam search al-

gorithm. During beam search, our algorithm al-

locates a ﬁxed portion of the beam width to be

inﬂuenced by sub-word level look-ahead, which

does not affect the intrinsic ranking of the other

hypotheses. This property is not guaranteed in

weight-pushing (Chen et al.,2019;Gourav et al.,

2021) that directly performs subword-level bias-

ing. Moreover, we speciﬁcally target transcribing

robotic instructions which usually include descrip-

tions of everyday objects. Thus the words in the bi-

asing vocabulary are often present in a standard lan-

guage models (LM), while existing biasing models

focus on biasing out-of-vocabulary (OOV) terms

such as person names. We also utilize this distinc-

tion, by incorporating an n-gram language model

to contextually scale the biasing score. We describe

our shallow-fusion model in Section 4.2.

3 System Overview

In this section, we present an overview of the em-

bodied agent that executes natural language instruc-

tions, as depicted in Figure 3. Given a speech input,

the agent also captures an image from its ego-view

camera. The dynamic context extraction module

extracts the visual context from the captured image

before producing the transcription of the speech

input. Firstly, a dense image captioning model

predicts several bounding boxes of interest in the

image and generates a natural language description

for each of them. Given the dense captioned im-

age, a bias target prediction model predicts a list of

biasing words to be used in speech recognition.

The list of biasing words/phrases is compiled

into a preﬁx tree (trie) that is used by the beam

search decoder to prevent the pruning of partially

matched hypotheses. The trie is dynamically cre-

ated with the agent’s movement that captures a new

image. The acoustic model processes the speech

input to produce a sequence of probability distri-

butions over a character vocabulary. We use the

Wav2Vec2 (Baevski et al.,2020) for acoustic mod-

eling of the speech. This sequence is decoded into

the transcription using a modiﬁed beam search de-

coding algorithm. During the beam search, the

visual context that is represented using the bias-

ing trie is used to produce a transcription that is

likely to contain the word(s) from the visual con-

text. We describe our biasing approach in detail in

Section 4.2.

Given the transcribed instruction, the task under-

standing & planning module performs task type

classiﬁcation, argument extraction, and task plan-

ning. We use conditional random ﬁeld (CRF) based

model as proposed in our earlier works (Pramanick

et al.,2019,2020). Speciﬁcally, the transcribed

instruction is passed through a task-crf model that

labels tokens in the transcription from a set of task

types. Given the output of task-crf, an argument-crf

model labels text spans in the instruction from a

set of argument labels. This results in an annotated

instruction such as,

[Take]taking [the pink pillow]theme.

To perform a high-level task mentioned in the in-

struction (e.g., taking), the agent needs to perform

a sequence of basic actions as produced by the

task planner. The predicted task type is matched

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CanVisualContextImproveAutomaticSpeechRecognitionforanEmbodiedAgent?PradipPramanickandChayanSarkarRobotics&AutonomousSystemsTCSResearch,India{pradip.pramanick,sarkar.chayan}@tcs.comTheusageofautomaticspeechrecognition(ASR)systemsarebecomingomnipresentrang-ingfrompersonalassistanttochatbots,home,andi...

展开>> 收起<<

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent Pradip Pramanick andChayan Sarkar.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent Pradip Pramanick andChayan Sarkar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: