TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE SPOKEN LANGUAGES Leanne Nortje and Herman Kamper

2025-04-26 0 0 3.19MB 8 页 10玖币

侵权投诉

TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE

SPOKEN LANGUAGES

Leanne Nortje and Herman Kamper

MediaLab, Electrical & Electronic Engineering, Stellenbosch University, South Africa

ABSTRACT

Imagine being able to show a system a visual depiction of

a keyword and ﬁnding spoken utterances that contain this

keyword from a zero-resource speech corpus. We formalise

this task and call it visually prompted keyword localisation

(VPKL): given an image of a keyword, detect and predict

where in an utterance the keyword occurs. To do VPKL, we

propose a speech-vision model with a novel localising atten-

tion mechanism which we train with a new keyword sampling

scheme. We show that these innovations give improvements

in VPKL over an existing speech-vision model. We also

compare to a visual bag-of-words (BoW) model where im-

ages are automatically tagged with visual labels and paired

with unlabelled speech. Although this visual BoW can be

queried directly with a written keyword (while our’s takes

image queries), our new model still outperforms the visual

BoW in both detection and localisation, giving a 16% relative

improvement in localisation F1.

Index Terms—Visually grounded speech models, key-

word localisation, speech-image retrieval.

1. INTRODUCTION

How can we search a speech collection in a zero-resource lan-

guage where it is impossible to obtain text transcriptions (e.g.

unwritten languages)? One way in which recent research is

addressing this problem is to use vision as a weak form of

supervision: speech systems are built on images paired with

unlabelled spoken captions—removing the need for text [1].

In this paper we speciﬁcally introduce the new visually

prompted keyword localisation (VPKL) task: a model is

given an image depicting a keyword—serving as an image

query—and is prompted to detect whether the query occurs

in a spoken utterance. If the keyword is detected, the model

should also determine where in the utterance the keyword

occurs. E.g. the model is shown an image of a mountain and

asked whether it occurs in the spoken caption: “a hiker in

a tent on a mountain”. The model should also say where in

the utterance mountain occurs (if it is detected), as shown

in Fig. 1. To do this, we need a multimodal model that can

compare images or image regions to spoken utterances.

Leanne Nortje is funded through a DeepMind PhD scholarship.

In the last few years a range of speech-vision models have

been proposed [1–8]. Most were developed for retrieving

whole images given a whole spoken caption as query (or vice

versa). Image-caption retrieval is different from VPKL—in

the latter, the query is typically a depiction of an isolated ob-

ject or concept and we want to detect and localise this query

within an utterance (rather than retrieving a whole spoken

caption). Nevertheless, with slight modiﬁcation, we can use

an image-caption retrieval model for VPKL. We show that

this performs poorly, presumably because of the mismatch

between the training objective and the test-time VPKL task.

As a result, we propose a novel localising attention mech-

anism and a new keyword sampling scheme. First, for the

attention mechanism, we combine the idea of matchmaps [2]

with a more explicit form of within-utterance attention [9,10].

Second, for the sampling scheme, we can use a visual tagger

to automatically tag training images with text labels of words

likely occurring in the image. From these generated tags, we

can sample positive and negative image-caption pairs which

contain the same or different keywords. E.g. while originally

we could have a spoken caption “hikers going up a moun-

tain slope” paired only with a single image, we could now

also pair this utterance with the spoken caption “a boy and

his dad on a mountain.” This would encourage the model to

not only focus on utterances as a whole, but also learn within-

utterance distinctions between keywords. Note that in this pa-

per we mainly consider an idealised case using the captions’

text transcriptions to sample positive and negative pairs (sim-

ulating an ideal visual tagger).

In this setting, we show that both innovations lead to im-

provements in VPKL over an image-caption retrieval model.

We also compare to a visual bag-of-words model [9, 11, 12]

which is queried with written keywords instead of images.

Where?

Fig. 1. The goal in visually prompted keyword localisation is

to locate a given query keyword (given as an image) within a

spoken utterance.

arXiv:2210.06229v1 [cs.CL] 12 Oct 2022

This model is trained using a visual tagger to generate textual

bag-of-words labels for training images. These labels are then

used to train a keyword detection model [12]. While a written

keyword arguably gives a stronger query signal than an im-

age, we show that when combining our new attention mech-

anism with our new sampling scheme, we outperform the vi-

sual bag-of-words. Further analysis shows that the distribu-

tion of keywords that our model is able to localise is much

smaller than that of the visual bag-of-words model. We at-

tribute this to image queries sometimes depicting more than

one keyword. Through further analyses, we also show that

the model’s performance decreases when tasked with learn-

ing a larger set of keywords. We also present initial experi-

ments where a real image tagger is used to produce positive

and negative examples for our training scheme—highlighting

additional challenges for future work.

2. NEW TASK: VISUALLY PROMPTED KEYWORD

LOCALISATION

The approach of directly training on image-speech pairs is

motivated by children having access to image and speech sig-

nals when acquiring their native language [13–19]. To learn

words, they can use the co-occurrences of spoken words with

visual objects, and vice versa [20]. Eventually humans can

establish if and where a word depicted by its visual repre-

sentation, is uttered—without ever requiring transcriptions.

Drawing inspiration from humans, we introduce the new task

of visually prompted keyword localisation (VPKL). This task

is very similar to the task of textual keyword detection, where

a model is given a written query keyword and asked to detect

(and possibly locate) occurrences of the keyword in a search

collection [9, 11, 12, 21–25]. Instead of a written keyword, in

VPKL the query is an image of an object or concept.

Formally, VPKL involves both detection and localisation.

Detection is illustrated in Fig. 2. A model is given an image

query xvision, which depicts a keyword, and asked whether

the keyword occurs in an utterance xaudio. For localisation, if

the model detects the image query xvision in xaudio, the model

is prompted to identify where in xaudio the keyword occurs.

E.g. in Fig. 1 the model is asked whether the mountain in

xvision, occurs in xaudio3. After the model detected the key-

? ?

Text

Fig. 2. The goal in visually prompted detection is to detect

whether a given query keyword (given as an image) occurs

anywhere within a spoken utterance.

word “mountain”, it is prompted to identify where in xaudio3

it occurs. During VPKL, the model therefore has to ﬁrst do

detection and then localisation, i.e. detection is a task on its

own, but localisation includes detection. To do VPKL, we

need a multimodal model that can output whether a keyword

occurs and at which frame detected keywords occur.

3. APPROACH: MULTIMODAL LOCALISATION

MODELS

Our VPKL model outputs an overall score S ∈ [0,1] indi-

cating whether a keyword is present anywhere within an ut-

terance. The keyword is detected if the Sis above a thresh-

old α. Additionally, the model outputs a sequence of scores

aaudio ∈RNwhere Nis the number of speech frames; each

element aaudioiindicates whether the detected keyword oc-

curs at frame i. To do this, we need a multimodal model that

can predict which frames in an utterance is most relevant to a

given query image. As starting point for our model, we use

the deep audio-visual embedding network (DAVEnet) of [4].

We then adapt it by introducing a new sampling scheme and

attention mechanism that encourages localisation.

3.1. Starting point: deep audio-visual embedding net-

work

deep audio-visual embedding network [4] consists of a vision

and an audio network which separately maps an image and

its entire spoken caption to single ﬁxed-size embeddings in

a common multimodal space. The goal is to get embeddings

of paired images and spoken captions to be more similar than

the embeddings of mismatched images and captions. Our im-

plementation of deep audio-visual embedding network incor-

porates some of the extensions from [2].

Following [2], we extend the deep audio-visual embed-

ding network architecture to use ResNet50 [26] for the image

network and instead of learning ﬁxed-size embeddings, we

learn a sequence of embeddings for each image evision ∈RM

and caption eaudio ∈RN. Here Mis the number of pixels and

Nthe number of frames. These embedding sequences are

then used in a matchmap M ∈ RM×Nwhich calculates the

dot product between each frame embedding in eaudio and each

pixel embedding in evision. The idea is that high similarity in

the Mshould indicate those speech frames and image pixels

that are related. In [2], the authors showed quantitatively that

the matchmaps can indeed localise words and objects corre-

sponding to the same concept.

Another change we make from [4] and [2] is that, instead

of using standard speech features as input, we use an acoustic

network trained on external data. Concretely, we use a dif-

ferent network as the audio branch in our modiﬁed version of

deep audio-visual embedding network. This network consists

of an acoustic facoustic and a BiLSTM fBiLSTM network. For

facoustic, we pretrain the contrastive predictive coding model

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TOWARDSVISUALLYPROMPTEDKEYWORDLOCALISATIONFORZERO-RESOURCESPOKENLANGUAGESLeanneNortjeandHermanKamperMediaLab,Electrical&ElectronicEngineering,StellenboschUniversity,SouthAfricaABSTRACTImaginebeingabletoshowasystemavisualdepictionofakeywordandndingspokenutterancesthatcontainthiskeywordfromazero-reso...

展开>> 收起<<

TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE SPOKEN LANGUAGES Leanne Nortje and Herman Kamper.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE SPOKEN LANGUAGES Leanne Nortje and Herman Kamper

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: