
This model is trained using a visual tagger to generate textual
bag-of-words labels for training images. These labels are then
used to train a keyword detection model [12]. While a written
keyword arguably gives a stronger query signal than an im-
age, we show that when combining our new attention mech-
anism with our new sampling scheme, we outperform the vi-
sual bag-of-words. Further analysis shows that the distribu-
tion of keywords that our model is able to localise is much
smaller than that of the visual bag-of-words model. We at-
tribute this to image queries sometimes depicting more than
one keyword. Through further analyses, we also show that
the model’s performance decreases when tasked with learn-
ing a larger set of keywords. We also present initial experi-
ments where a real image tagger is used to produce positive
and negative examples for our training scheme—highlighting
additional challenges for future work.
2. NEW TASK: VISUALLY PROMPTED KEYWORD
LOCALISATION
The approach of directly training on image-speech pairs is
motivated by children having access to image and speech sig-
nals when acquiring their native language [13–19]. To learn
words, they can use the co-occurrences of spoken words with
visual objects, and vice versa [20]. Eventually humans can
establish if and where a word depicted by its visual repre-
sentation, is uttered—without ever requiring transcriptions.
Drawing inspiration from humans, we introduce the new task
of visually prompted keyword localisation (VPKL). This task
is very similar to the task of textual keyword detection, where
a model is given a written query keyword and asked to detect
(and possibly locate) occurrences of the keyword in a search
collection [9, 11, 12, 21–25]. Instead of a written keyword, in
VPKL the query is an image of an object or concept.
Formally, VPKL involves both detection and localisation.
Detection is illustrated in Fig. 2. A model is given an image
query xvision, which depicts a keyword, and asked whether
the keyword occurs in an utterance xaudio. For localisation, if
the model detects the image query xvision in xaudio, the model
is prompted to identify where in xaudio the keyword occurs.
E.g. in Fig. 1 the model is asked whether the mountain in
xvision, occurs in xaudio3. After the model detected the key-
Fig. 2. The goal in visually prompted detection is to detect
whether a given query keyword (given as an image) occurs
anywhere within a spoken utterance.
word “mountain”, it is prompted to identify where in xaudio3
it occurs. During VPKL, the model therefore has to first do
detection and then localisation, i.e. detection is a task on its
own, but localisation includes detection. To do VPKL, we
need a multimodal model that can output whether a keyword
occurs and at which frame detected keywords occur.
3. APPROACH: MULTIMODAL LOCALISATION
MODELS
Our VPKL model outputs an overall score S ∈ [0,1] indi-
cating whether a keyword is present anywhere within an ut-
terance. The keyword is detected if the Sis above a thresh-
old α. Additionally, the model outputs a sequence of scores
aaudio ∈RNwhere Nis the number of speech frames; each
element aaudioiindicates whether the detected keyword oc-
curs at frame i. To do this, we need a multimodal model that
can predict which frames in an utterance is most relevant to a
given query image. As starting point for our model, we use
the deep audio-visual embedding network (DAVEnet) of [4].
We then adapt it by introducing a new sampling scheme and
attention mechanism that encourages localisation.
3.1. Starting point: deep audio-visual embedding net-
work
deep audio-visual embedding network [4] consists of a vision
and an audio network which separately maps an image and
its entire spoken caption to single fixed-size embeddings in
a common multimodal space. The goal is to get embeddings
of paired images and spoken captions to be more similar than
the embeddings of mismatched images and captions. Our im-
plementation of deep audio-visual embedding network incor-
porates some of the extensions from [2].
Following [2], we extend the deep audio-visual embed-
ding network architecture to use ResNet50 [26] for the image
network and instead of learning fixed-size embeddings, we
learn a sequence of embeddings for each image evision ∈RM
and caption eaudio ∈RN. Here Mis the number of pixels and
Nthe number of frames. These embedding sequences are
then used in a matchmap M ∈ RM×Nwhich calculates the
dot product between each frame embedding in eaudio and each
pixel embedding in evision. The idea is that high similarity in
the Mshould indicate those speech frames and image pixels
that are related. In [2], the authors showed quantitatively that
the matchmaps can indeed localise words and objects corre-
sponding to the same concept.
Another change we make from [4] and [2] is that, instead
of using standard speech features as input, we use an acoustic
network trained on external data. Concretely, we use a dif-
ferent network as the audio branch in our modified version of
deep audio-visual embedding network. This network consists
of an acoustic facoustic and a BiLSTM fBiLSTM network. For
facoustic, we pretrain the contrastive predictive coding model