SPEECHCLIP INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL Yi-Jen Shih1 Hsuan-Fu Wang1 Heng-Jui Chang12 Layne Berry3 Hung-yi Lee1 David Harwath3

2025-05-03 0 0 1.17MB 8 页 10玖币
侵权投诉
SPEECHCLIP: INTEGRATING SPEECH
WITH PRE-TRAINED VISION AND LANGUAGE MODEL
Yi-Jen Shih1, Hsuan-Fu Wang1, Heng-Jui Chang1,2, Layne Berry3, Hung-yi Lee1, David Harwath3
1National Taiwan University
2MIT CSAIL
3The University of Texas at Austin
ABSTRACT
Data-driven speech processing models usually perform well
with a large amount of text supervision, but collecting tran-
scribed speech data is costly. Therefore, we propose Speech-
CLIP, a novel framework bridging speech and text through
images to enhance speech models without transcriptions.
We leverage state-of-the-art pre-trained HuBERT and CLIP,
aligning them via paired images and spoken captions with
minimal fine-tuning. SpeechCLIP outperforms prior state-
of-the-art on image-speech retrieval and performs zero-shot
speech-text retrieval without direct supervision from tran-
scriptions. Moreover, SpeechCLIP can directly retrieve se-
mantically related keywords from speech.
Index TermsVisual grounding, vision and language,
self-supervised learning
Copyright 2022 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,
Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
1. INTRODUCTION
Conventionally, speech processing tasks like speech recog-
nition need transcribed speech data for machine learning.
They usually require large labeled datasets to perform well,
but transcribing an enormous amount of speech is expensive.
Therefore, recent studies exploit unlabeled speech to pre-
train models with self-supervised learning (SSL) [1]. Models
learn to predict pseudo targets generated from raw data in
SSL pre-training. Some typical speech SSL methods include
masked reconstruction [2–6], contrastive learing [7–11], clas-
sification [12–14], multi-task learning [15], and knowledge
distillation [16–18]. These methods succeed in a wide range
of speech processing problems [19–21].
Besides SSL methods focusing on a single modality, re-
searchers propose using data from other modalities to boost
machine performance on a specific modality. E.g., pairing
images with semantically related text or spoken captions
is a typical method since collecting parallel image-text or
image-speech data is fast and inexpensive [22]. Specifically,
paired image-text data can be obtained by crawling images
and captions from the internet. Paired image-speech data can
be collected by uttering text captions or describing images
Speech Encoder
Contrastive
Loss
Contrastive
Loss
Cascaded SpeechCLIP
Parallel SpeechCLIP
CLIP Image
Encoder
(frozen)
CLIP Text
Encoder
(frozen)
Fig. 1: An overview of the proposed SpeechCLIP model.
in speech. This paper uses paired image-speech data and an
image-text pre-trained model to enhance speech SSL models.
Much effort was put into using paired images and spoken
captions to help speech processing [24], and they are usually
called visually grounded speech models (VGS). VGS mod-
els benefit many applications like speech recognition [25],
word discovery [26], speech generation [27], cross-modal
alignment [22, 28, 29], and multilingual spoken language
processing [30–33]. Most studies pre-train and evaluate
VGS models on image-speech retrieval, showing the capa-
bilities of capturing the correspondence between images and
speech [34, 35]. E.g., the recent Fast-Slow Transformer for
Visually Grounding Speech (FaST-VGS and FaST-VGS+)
succeeds in many speech processing tasks by utilizing trans-
formers and cross-modal attention mechanisms to perform
image-speech retrieval and semantic tasks [36,37]. Moreover,
VGS models trained with retrieval objectives can extract se-
mantic and word-level information from speech [38], which
is difficult to achieve by training solely with speech [39].
While many studies obtain semantic information from
speech without transcriptions, some extent of assistance from
text could be helpful for some tasks. E.g., recent unsu-
pervised ASR methods rely on nonparallel text data and a
pronunciation lexicon [40, 41]. To circumvent transcriptions
or lexicons, we propose to bridge speech and text domains
arXiv:2210.00705v2 [cs.CL] 25 Oct 2022
CLIP Image
Encoder
(frozen)
Audio Features
Transformer Encoder
Contrastive
Loss
CLS
Audio Feature Extractor
(HuBERT, frozen)
(a) Parallel SpeechCLIP
CLIP Image
Encoder
(frozen)
CLS x KAudio Features
Transformer Encoder
BN + VQ
Keyword x KContrastive
Loss
CLIP Text
Encoder
(frozen)
Audio Feature Extractor
(HuBERT, frozen)
(b) Cascaded SpeechCLIP
Fig. 2: An illustration of SpeechCLIP models. (a) A pre-trained HuBERT [12] extracts audio features. The features are concate-
nated with a learnable CLS token and fed into a transformer encoder layer to obtain a single vector representing the information
of the entire sequence. The vector is then used to compute contrastive loss with the CLIP image encoder’s output [23]. (b)
Cascaded SpeechCLIP uses KCLS tokens to capture a small sequence of keywords from the audio signal. The keywords
are batch-normalized and vector-quantized before passing to the CLIP text encoder. BN and VQ respectively denote batch
normalization and vector quantization.
via images, i.e., taking advantage of paired image-speech and
image-text data. Thus, this paper introduces SpeechCLIP,
a novel framework to integrate speech SSL models with a
pre-trained vision and language model as depicted in Fig. 1.
We use Contrastive Language-Image Pre-training (CLIP),
a powerful model pre-trained to align parallel image-text
data [23]. Then, a speech encoder initialized by a pre-trained
speech SSL model is enhanced by aligning with CLIP using
paired image-speech data. By aligning a speech encoder’s
and CLIP’s image embedding spaces, the speech encoder
is implicitly aligned with CLIP’s text encoder, forcing it to
capture more textual content.
We propose two SpeechCLIP architectures: parallel and
cascaded. The parallel model is similar to WAV2CLIP [42].
However, our speech encoder uses a pre-trained speech SSL
model and focuses on capturing local and global spoken con-
tents. Meanwhile, WAV2CLIP extracts global features in
general audio for classification and retrieval. Furthermore,
AudioCLIP is an extension of WAV2CLIP since it is trained
with paired image, audio, and text data [43]. The cascaded
SpeechCLIP cascades CLIP’s text encoder on top of the
speech encoder, forcing the model to output subword embed-
dings. Eventually, the cascaded model captures spoken words
in speech signals.
In this paper, the proposed SpeechCLIP models achieve
state-of-the-art image-speech retrieval on two standard spo-
ken caption datasets with minimal fine-tuning. Moreover,
we demonstrate SpeechCLIP’s capability of performing zero-
shot speech-text retrieval and capturing keywords directly
from speech. We also make our code available on Github1.
1https://github.com/atosystem/SpeechCLIP
2. METHOD
2.1. Preliminaries
We briefly explain pre-trained models used in SpeechCLIP.
Contrastive Language-Image Pre-training (CLIP) [23].
CLIP uses contrastive learning to pre-train visual models
from natural language supervision on an enormous scale,
where the supervision comes from paired image-text data.
Composing two encoders processing image and text sepa-
rately, CLIP aims to align semantically similar images and
text captions. CLIP can easily transfer across various com-
puter vision tasks with little supervision.
Hidden-unit BERT (HuBERT) [12]. HuBERT is a speech
SSL method similar to masked language modeling, predict-
ing labels generated by clustered acoustic features. HuBERT
comprises a CNN feature extractor followed by a transformer
encoder [44] and offers good initialization for many speech
processing tasks [19, 21].
In SpeechCLIP, pre-trained CLIP and HuBERT models
are frozen and serve as feature extractors, as shown in Fig. 2.
The CLIP model extracts image and sentence embeddings to
supervise SpeechCLIP. Following SUPERB [19], HuBERT’s
CNN output and transformer encoder’s hidden representa-
tions are weighted and summed by a set of learnable weights.
The weights automatically assign importance to each hidden
layer to minimize the overall objective function. Only the
newly added components excluding HuBERT and CLIP are
learnable during training, reducing the computational cost
significantly, thus enabling a larger batch size for contrastive
pre-training. In the following sections, we introduce two
SpeechCLIP architectures: parallel and cascaded.
摘要:

SPEECHCLIP:INTEGRATINGSPEECHWITHPRE-TRAINEDVISIONANDLANGUAGEMODELYi-JenShih1,Hsuan-FuWang1,Heng-JuiChang1,2,LayneBerry3,Hung-yiLee1,DavidHarwath31NationalTaiwanUniversity2MITCSAIL3TheUniversityofTexasatAustinABSTRACTData-drivenspeechprocessingmodelsusuallyperformwellwithalargeamountoftextsupervision...

展开>> 收起<<
SPEECHCLIP INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL Yi-Jen Shih1 Hsuan-Fu Wang1 Heng-Jui Chang12 Layne Berry3 Hung-yi Lee1 David Harwath3.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.17MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注