
SPEECHCLIP: INTEGRATING SPEECH
WITH PRE-TRAINED VISION AND LANGUAGE MODEL
Yi-Jen Shih1, Hsuan-Fu Wang1, Heng-Jui Chang1,2, Layne Berry3, Hung-yi Lee1, David Harwath3
1National Taiwan University
2MIT CSAIL
3The University of Texas at Austin
ABSTRACT
Data-driven speech processing models usually perform well
with a large amount of text supervision, but collecting tran-
scribed speech data is costly. Therefore, we propose Speech-
CLIP, a novel framework bridging speech and text through
images to enhance speech models without transcriptions.
We leverage state-of-the-art pre-trained HuBERT and CLIP,
aligning them via paired images and spoken captions with
minimal fine-tuning. SpeechCLIP outperforms prior state-
of-the-art on image-speech retrieval and performs zero-shot
speech-text retrieval without direct supervision from tran-
scriptions. Moreover, SpeechCLIP can directly retrieve se-
mantically related keywords from speech.
Index Terms—Visual grounding, vision and language,
self-supervised learning
Copyright 2022 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,
Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
1. INTRODUCTION
Conventionally, speech processing tasks like speech recog-
nition need transcribed speech data for machine learning.
They usually require large labeled datasets to perform well,
but transcribing an enormous amount of speech is expensive.
Therefore, recent studies exploit unlabeled speech to pre-
train models with self-supervised learning (SSL) [1]. Models
learn to predict pseudo targets generated from raw data in
SSL pre-training. Some typical speech SSL methods include
masked reconstruction [2–6], contrastive learing [7–11], clas-
sification [12–14], multi-task learning [15], and knowledge
distillation [16–18]. These methods succeed in a wide range
of speech processing problems [19–21].
Besides SSL methods focusing on a single modality, re-
searchers propose using data from other modalities to boost
machine performance on a specific modality. E.g., pairing
images with semantically related text or spoken captions
is a typical method since collecting parallel image-text or
image-speech data is fast and inexpensive [22]. Specifically,
paired image-text data can be obtained by crawling images
and captions from the internet. Paired image-speech data can
be collected by uttering text captions or describing images
Speech Encoder
Contrastive
Loss
Contrastive
Loss
Cascaded SpeechCLIP
Parallel SpeechCLIP
CLIP Image
Encoder
(frozen)
CLIP Text
Encoder
(frozen)
Fig. 1: An overview of the proposed SpeechCLIP model.
in speech. This paper uses paired image-speech data and an
image-text pre-trained model to enhance speech SSL models.
Much effort was put into using paired images and spoken
captions to help speech processing [24], and they are usually
called visually grounded speech models (VGS). VGS mod-
els benefit many applications like speech recognition [25],
word discovery [26], speech generation [27], cross-modal
alignment [22, 28, 29], and multilingual spoken language
processing [30–33]. Most studies pre-train and evaluate
VGS models on image-speech retrieval, showing the capa-
bilities of capturing the correspondence between images and
speech [34, 35]. E.g., the recent Fast-Slow Transformer for
Visually Grounding Speech (FaST-VGS and FaST-VGS+)
succeeds in many speech processing tasks by utilizing trans-
formers and cross-modal attention mechanisms to perform
image-speech retrieval and semantic tasks [36,37]. Moreover,
VGS models trained with retrieval objectives can extract se-
mantic and word-level information from speech [38], which
is difficult to achieve by training solely with speech [39].
While many studies obtain semantic information from
speech without transcriptions, some extent of assistance from
text could be helpful for some tasks. E.g., recent unsu-
pervised ASR methods rely on nonparallel text data and a
pronunciation lexicon [40, 41]. To circumvent transcriptions
or lexicons, we propose to bridge speech and text domains
arXiv:2210.00705v2 [cs.CL] 25 Oct 2022