TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE SPOKEN LANGUAGES Leanne Nortje and Herman Kamper

2025-04-26 0 0 3.19MB 8 页 10玖币
侵权投诉
TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE
SPOKEN LANGUAGES
Leanne Nortje and Herman Kamper
MediaLab, Electrical & Electronic Engineering, Stellenbosch University, South Africa
ABSTRACT
Imagine being able to show a system a visual depiction of
a keyword and finding spoken utterances that contain this
keyword from a zero-resource speech corpus. We formalise
this task and call it visually prompted keyword localisation
(VPKL): given an image of a keyword, detect and predict
where in an utterance the keyword occurs. To do VPKL, we
propose a speech-vision model with a novel localising atten-
tion mechanism which we train with a new keyword sampling
scheme. We show that these innovations give improvements
in VPKL over an existing speech-vision model. We also
compare to a visual bag-of-words (BoW) model where im-
ages are automatically tagged with visual labels and paired
with unlabelled speech. Although this visual BoW can be
queried directly with a written keyword (while our’s takes
image queries), our new model still outperforms the visual
BoW in both detection and localisation, giving a 16% relative
improvement in localisation F1.
Index TermsVisually grounded speech models, key-
word localisation, speech-image retrieval.
1. INTRODUCTION
How can we search a speech collection in a zero-resource lan-
guage where it is impossible to obtain text transcriptions (e.g.
unwritten languages)? One way in which recent research is
addressing this problem is to use vision as a weak form of
supervision: speech systems are built on images paired with
unlabelled spoken captions—removing the need for text [1].
In this paper we specifically introduce the new visually
prompted keyword localisation (VPKL) task: a model is
given an image depicting a keyword—serving as an image
query—and is prompted to detect whether the query occurs
in a spoken utterance. If the keyword is detected, the model
should also determine where in the utterance the keyword
occurs. E.g. the model is shown an image of a mountain and
asked whether it occurs in the spoken caption: “a hiker in
a tent on a mountain”. The model should also say where in
the utterance mountain occurs (if it is detected), as shown
in Fig. 1. To do this, we need a multimodal model that can
compare images or image regions to spoken utterances.
Leanne Nortje is funded through a DeepMind PhD scholarship.
In the last few years a range of speech-vision models have
been proposed [1–8]. Most were developed for retrieving
whole images given a whole spoken caption as query (or vice
versa). Image-caption retrieval is different from VPKL—in
the latter, the query is typically a depiction of an isolated ob-
ject or concept and we want to detect and localise this query
within an utterance (rather than retrieving a whole spoken
caption). Nevertheless, with slight modification, we can use
an image-caption retrieval model for VPKL. We show that
this performs poorly, presumably because of the mismatch
between the training objective and the test-time VPKL task.
As a result, we propose a novel localising attention mech-
anism and a new keyword sampling scheme. First, for the
attention mechanism, we combine the idea of matchmaps [2]
with a more explicit form of within-utterance attention [9,10].
Second, for the sampling scheme, we can use a visual tagger
to automatically tag training images with text labels of words
likely occurring in the image. From these generated tags, we
can sample positive and negative image-caption pairs which
contain the same or different keywords. E.g. while originally
we could have a spoken caption “hikers going up a moun-
tain slope” paired only with a single image, we could now
also pair this utterance with the spoken caption “a boy and
his dad on a mountain. This would encourage the model to
not only focus on utterances as a whole, but also learn within-
utterance distinctions between keywords. Note that in this pa-
per we mainly consider an idealised case using the captions’
text transcriptions to sample positive and negative pairs (sim-
ulating an ideal visual tagger).
In this setting, we show that both innovations lead to im-
provements in VPKL over an image-caption retrieval model.
We also compare to a visual bag-of-words model [9, 11, 12]
which is queried with written keywords instead of images.
Where?
Fig. 1. The goal in visually prompted keyword localisation is
to locate a given query keyword (given as an image) within a
spoken utterance.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.06229v1 [cs.CL] 12 Oct 2022
This model is trained using a visual tagger to generate textual
bag-of-words labels for training images. These labels are then
used to train a keyword detection model [12]. While a written
keyword arguably gives a stronger query signal than an im-
age, we show that when combining our new attention mech-
anism with our new sampling scheme, we outperform the vi-
sual bag-of-words. Further analysis shows that the distribu-
tion of keywords that our model is able to localise is much
smaller than that of the visual bag-of-words model. We at-
tribute this to image queries sometimes depicting more than
one keyword. Through further analyses, we also show that
the model’s performance decreases when tasked with learn-
ing a larger set of keywords. We also present initial experi-
ments where a real image tagger is used to produce positive
and negative examples for our training scheme—highlighting
additional challenges for future work.
2. NEW TASK: VISUALLY PROMPTED KEYWORD
LOCALISATION
The approach of directly training on image-speech pairs is
motivated by children having access to image and speech sig-
nals when acquiring their native language [13–19]. To learn
words, they can use the co-occurrences of spoken words with
visual objects, and vice versa [20]. Eventually humans can
establish if and where a word depicted by its visual repre-
sentation, is uttered—without ever requiring transcriptions.
Drawing inspiration from humans, we introduce the new task
of visually prompted keyword localisation (VPKL). This task
is very similar to the task of textual keyword detection, where
a model is given a written query keyword and asked to detect
(and possibly locate) occurrences of the keyword in a search
collection [9, 11, 12, 21–25]. Instead of a written keyword, in
VPKL the query is an image of an object or concept.
Formally, VPKL involves both detection and localisation.
Detection is illustrated in Fig. 2. A model is given an image
query xvision, which depicts a keyword, and asked whether
the keyword occurs in an utterance xaudio. For localisation, if
the model detects the image query xvision in xaudio, the model
is prompted to identify where in xaudio the keyword occurs.
E.g. in Fig. 1 the model is asked whether the mountain in
xvision, occurs in xaudio3. After the model detected the key-
?
? ?
?
Text
?
Fig. 2. The goal in visually prompted detection is to detect
whether a given query keyword (given as an image) occurs
anywhere within a spoken utterance.
word “mountain”, it is prompted to identify where in xaudio3
it occurs. During VPKL, the model therefore has to first do
detection and then localisation, i.e. detection is a task on its
own, but localisation includes detection. To do VPKL, we
need a multimodal model that can output whether a keyword
occurs and at which frame detected keywords occur.
3. APPROACH: MULTIMODAL LOCALISATION
MODELS
Our VPKL model outputs an overall score S [0,1] indi-
cating whether a keyword is present anywhere within an ut-
terance. The keyword is detected if the Sis above a thresh-
old α. Additionally, the model outputs a sequence of scores
aaudio RNwhere Nis the number of speech frames; each
element aaudioiindicates whether the detected keyword oc-
curs at frame i. To do this, we need a multimodal model that
can predict which frames in an utterance is most relevant to a
given query image. As starting point for our model, we use
the deep audio-visual embedding network (DAVEnet) of [4].
We then adapt it by introducing a new sampling scheme and
attention mechanism that encourages localisation.
3.1. Starting point: deep audio-visual embedding net-
work
deep audio-visual embedding network [4] consists of a vision
and an audio network which separately maps an image and
its entire spoken caption to single fixed-size embeddings in
a common multimodal space. The goal is to get embeddings
of paired images and spoken captions to be more similar than
the embeddings of mismatched images and captions. Our im-
plementation of deep audio-visual embedding network incor-
porates some of the extensions from [2].
Following [2], we extend the deep audio-visual embed-
ding network architecture to use ResNet50 [26] for the image
network and instead of learning fixed-size embeddings, we
learn a sequence of embeddings for each image evision RM
and caption eaudio RN. Here Mis the number of pixels and
Nthe number of frames. These embedding sequences are
then used in a matchmap M ∈ RM×Nwhich calculates the
dot product between each frame embedding in eaudio and each
pixel embedding in evision. The idea is that high similarity in
the Mshould indicate those speech frames and image pixels
that are related. In [2], the authors showed quantitatively that
the matchmaps can indeed localise words and objects corre-
sponding to the same concept.
Another change we make from [4] and [2] is that, instead
of using standard speech features as input, we use an acoustic
network trained on external data. Concretely, we use a dif-
ferent network as the audio branch in our modified version of
deep audio-visual embedding network. This network consists
of an acoustic facoustic and a BiLSTM fBiLSTM network. For
facoustic, we pretrain the contrastive predictive coding model
摘要:

TOWARDSVISUALLYPROMPTEDKEYWORDLOCALISATIONFORZERO-RESOURCESPOKENLANGUAGESLeanneNortjeandHermanKamperMediaLab,Electrical&ElectronicEngineering,StellenboschUniversity,SouthAfricaABSTRACTImaginebeingabletoshowasystemavisualdepictionofakeywordandndingspokenutterancesthatcontainthiskeywordfromazero-reso...

展开>> 收起<<
TOWARDS VISUALLY PROMPTED KEYWORD LOCALISATION FOR ZERO-RESOURCE SPOKEN LANGUAGES Leanne Nortje and Herman Kamper.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:3.19MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注