keeps a fixed number of candidates in each time-
step
i∈L
, lower ranked hypotheses that contain
incomplete biasing words, are pruned by the beam
search before the word boundary is reached.
To counter this, biasing at the sub-word units
(grapheme/word-piece) by weight pushing has been
proposed in (Pundak et al.,2018). Biasing at the
grapheme level improves recognition accuracy than
word level biasing for speech containing bias terms,
but the performance degrades severely for general
speech where the bias context is irrelevant. Sub-
sequently, two improvements are proposed - i) bi-
asing at the word-piece level (Chen et al.,2019;
Zhao et al.,2019;Gourav et al.,2021) and ii) utiliz-
ing known prefixes of the biasing terms to perform
contextual biasing (Zhao et al.,2019;Gourav et al.,
2021). Although word-piece biasing shows less
degradation than grapheme level on general speech,
it performs worse than word level biasing when
using high biasing weights (Gourav et al.,2021).
Also, prefix-context based biasing is ineffective
when a biasing term is out of context. Moreover, a
general problem with the weight-pushing based ap-
proach as compared to the sub-word level biasing
is that they require a static/class-specific biasing vo-
cabulary to work, usually compiled as a weighted
finite state transducer (WFST). Also it requires
costly operations to be included with the primary
WFST-based decoder. However, for frequently
changing biasing vocabulary, e.g., changing with
the agent’s movement, frequent re-compiling and
merging of the WFST is inefficient.
Therefore, we propose an approach to retain the
benefits of word-level biasing for general speech,
also preventing early pruning of partially match-
ing hypotheses using a modified beam search al-
gorithm. During beam search, our algorithm al-
locates a fixed portion of the beam width to be
influenced by sub-word level look-ahead, which
does not affect the intrinsic ranking of the other
hypotheses. This property is not guaranteed in
weight-pushing (Chen et al.,2019;Gourav et al.,
2021) that directly performs subword-level bias-
ing. Moreover, we specifically target transcribing
robotic instructions which usually include descrip-
tions of everyday objects. Thus the words in the bi-
asing vocabulary are often present in a standard lan-
guage models (LM), while existing biasing models
focus on biasing out-of-vocabulary (OOV) terms
such as person names. We also utilize this distinc-
tion, by incorporating an n-gram language model
to contextually scale the biasing score. We describe
our shallow-fusion model in Section 4.2.
3 System Overview
In this section, we present an overview of the em-
bodied agent that executes natural language instruc-
tions, as depicted in Figure 3. Given a speech input,
the agent also captures an image from its ego-view
camera. The dynamic context extraction module
extracts the visual context from the captured image
before producing the transcription of the speech
input. Firstly, a dense image captioning model
predicts several bounding boxes of interest in the
image and generates a natural language description
for each of them. Given the dense captioned im-
age, a bias target prediction model predicts a list of
biasing words to be used in speech recognition.
The list of biasing words/phrases is compiled
into a prefix tree (trie) that is used by the beam
search decoder to prevent the pruning of partially
matched hypotheses. The trie is dynamically cre-
ated with the agent’s movement that captures a new
image. The acoustic model processes the speech
input to produce a sequence of probability distri-
butions over a character vocabulary. We use the
Wav2Vec2 (Baevski et al.,2020) for acoustic mod-
eling of the speech. This sequence is decoded into
the transcription using a modified beam search de-
coding algorithm. During the beam search, the
visual context that is represented using the bias-
ing trie is used to produce a transcription that is
likely to contain the word(s) from the visual con-
text. We describe our biasing approach in detail in
Section 4.2.
Given the transcribed instruction, the task under-
standing & planning module performs task type
classification, argument extraction, and task plan-
ning. We use conditional random field (CRF) based
model as proposed in our earlier works (Pramanick
et al.,2019,2020). Specifically, the transcribed
instruction is passed through a task-crf model that
labels tokens in the transcription from a set of task
types. Given the output of task-crf, an argument-crf
model labels text spans in the instruction from a
set of argument labels. This results in an annotated
instruction such as,
[Take]taking [the pink pillow]theme.
To perform a high-level task mentioned in the in-
struction (e.g., taking), the agent needs to perform
a sequence of basic actions as produced by the
task planner. The predicted task type is matched