SPEECHCLIP INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL Yi-Jen Shih1 Hsuan-Fu Wang1 Heng-Jui Chang12 Layne Berry3 Hung-yi Lee1 David Harwath3

2025-05-03 0 0 1.17MB 8 页 10玖币

侵权投诉

SPEECHCLIP: INTEGRATING SPEECH

WITH PRE-TRAINED VISION AND LANGUAGE MODEL

Yi-Jen Shih1, Hsuan-Fu Wang1, Heng-Jui Chang1,2, Layne Berry3, Hung-yi Lee1, David Harwath3

1National Taiwan University

2MIT CSAIL

3The University of Texas at Austin

ABSTRACT

Data-driven speech processing models usually perform well

with a large amount of text supervision, but collecting tran-

scribed speech data is costly. Therefore, we propose Speech-

CLIP, a novel framework bridging speech and text through

images to enhance speech models without transcriptions.

We leverage state-of-the-art pre-trained HuBERT and CLIP,

aligning them via paired images and spoken captions with

minimal ﬁne-tuning. SpeechCLIP outperforms prior state-

of-the-art on image-speech retrieval and performs zero-shot

speech-text retrieval without direct supervision from tran-

scriptions. Moreover, SpeechCLIP can directly retrieve se-

mantically related keywords from speech.

Index Terms—Visual grounding, vision and language,

self-supervised learning

Copyright 2022 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this

material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,

Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

1. INTRODUCTION

Conventionally, speech processing tasks like speech recog-

nition need transcribed speech data for machine learning.

They usually require large labeled datasets to perform well,

but transcribing an enormous amount of speech is expensive.

Therefore, recent studies exploit unlabeled speech to pre-

train models with self-supervised learning (SSL) [1]. Models

learn to predict pseudo targets generated from raw data in

SSL pre-training. Some typical speech SSL methods include

masked reconstruction [2–6], contrastive learing [7–11], clas-

siﬁcation [12–14], multi-task learning [15], and knowledge

distillation [16–18]. These methods succeed in a wide range

of speech processing problems [19–21].

Besides SSL methods focusing on a single modality, re-

searchers propose using data from other modalities to boost

machine performance on a speciﬁc modality. E.g., pairing

images with semantically related text or spoken captions

is a typical method since collecting parallel image-text or

image-speech data is fast and inexpensive [22]. Speciﬁcally,

paired image-text data can be obtained by crawling images

and captions from the internet. Paired image-speech data can

be collected by uttering text captions or describing images

Speech Encoder

Contrastive

Loss

Contrastive

Loss

Cascaded SpeechCLIP

Parallel SpeechCLIP

CLIP Image

Encoder

(frozen)

CLIP Text

Encoder

(frozen)

Fig. 1: An overview of the proposed SpeechCLIP model.

in speech. This paper uses paired image-speech data and an

image-text pre-trained model to enhance speech SSL models.

Much effort was put into using paired images and spoken

captions to help speech processing [24], and they are usually

called visually grounded speech models (VGS). VGS mod-

els beneﬁt many applications like speech recognition [25],

word discovery [26], speech generation [27], cross-modal

alignment [22, 28, 29], and multilingual spoken language

processing [30–33]. Most studies pre-train and evaluate

VGS models on image-speech retrieval, showing the capa-

bilities of capturing the correspondence between images and

speech [34, 35]. E.g., the recent Fast-Slow Transformer for

Visually Grounding Speech (FaST-VGS and FaST-VGS+)

succeeds in many speech processing tasks by utilizing trans-

formers and cross-modal attention mechanisms to perform

image-speech retrieval and semantic tasks [36,37]. Moreover,

VGS models trained with retrieval objectives can extract se-

mantic and word-level information from speech [38], which

is difﬁcult to achieve by training solely with speech [39].

While many studies obtain semantic information from

speech without transcriptions, some extent of assistance from

text could be helpful for some tasks. E.g., recent unsu-

pervised ASR methods rely on nonparallel text data and a

pronunciation lexicon [40, 41]. To circumvent transcriptions

or lexicons, we propose to bridge speech and text domains

arXiv:2210.00705v2 [cs.CL] 25 Oct 2022

CLIP Image

Encoder

(frozen)

Audio Features

Transformer Encoder

Contrastive

Loss

CLS

Audio Feature Extractor

(HuBERT, frozen)

(a) Parallel SpeechCLIP

CLIP Image

Encoder

(frozen)

CLS x KAudio Features

Transformer Encoder

BN + VQ

Keyword x KContrastive

Loss

CLIP Text

Encoder

(frozen)

Audio Feature Extractor

(HuBERT, frozen)

(b) Cascaded SpeechCLIP

Fig. 2: An illustration of SpeechCLIP models. (a) A pre-trained HuBERT [12] extracts audio features. The features are concate-

nated with a learnable CLS token and fed into a transformer encoder layer to obtain a single vector representing the information

of the entire sequence. The vector is then used to compute contrastive loss with the CLIP image encoder’s output [23]. (b)

Cascaded SpeechCLIP uses KCLS tokens to capture a small sequence of keywords from the audio signal. The keywords

are batch-normalized and vector-quantized before passing to the CLIP text encoder. BN and VQ respectively denote batch

normalization and vector quantization.

via images, i.e., taking advantage of paired image-speech and

image-text data. Thus, this paper introduces SpeechCLIP,

a novel framework to integrate speech SSL models with a

pre-trained vision and language model as depicted in Fig. 1.

We use Contrastive Language-Image Pre-training (CLIP),

a powerful model pre-trained to align parallel image-text

data [23]. Then, a speech encoder initialized by a pre-trained

speech SSL model is enhanced by aligning with CLIP using

paired image-speech data. By aligning a speech encoder’s

and CLIP’s image embedding spaces, the speech encoder

is implicitly aligned with CLIP’s text encoder, forcing it to

capture more textual content.

We propose two SpeechCLIP architectures: parallel and

cascaded. The parallel model is similar to WAV2CLIP [42].

However, our speech encoder uses a pre-trained speech SSL

model and focuses on capturing local and global spoken con-

tents. Meanwhile, WAV2CLIP extracts global features in

general audio for classiﬁcation and retrieval. Furthermore,

AudioCLIP is an extension of WAV2CLIP since it is trained

with paired image, audio, and text data [43]. The cascaded

SpeechCLIP cascades CLIP’s text encoder on top of the

speech encoder, forcing the model to output subword embed-

dings. Eventually, the cascaded model captures spoken words

in speech signals.

In this paper, the proposed SpeechCLIP models achieve

state-of-the-art image-speech retrieval on two standard spo-

ken caption datasets with minimal ﬁne-tuning. Moreover,

we demonstrate SpeechCLIP’s capability of performing zero-

shot speech-text retrieval and capturing keywords directly

from speech. We also make our code available on Github1.

1https://github.com/atosystem/SpeechCLIP

2. METHOD

2.1. Preliminaries

We brieﬂy explain pre-trained models used in SpeechCLIP.

Contrastive Language-Image Pre-training (CLIP) [23].

CLIP uses contrastive learning to pre-train visual models

from natural language supervision on an enormous scale,

where the supervision comes from paired image-text data.

Composing two encoders processing image and text sepa-

rately, CLIP aims to align semantically similar images and

text captions. CLIP can easily transfer across various com-

puter vision tasks with little supervision.

Hidden-unit BERT (HuBERT) [12]. HuBERT is a speech

SSL method similar to masked language modeling, predict-

ing labels generated by clustered acoustic features. HuBERT

comprises a CNN feature extractor followed by a transformer

encoder [44] and offers good initialization for many speech

processing tasks [19, 21].

In SpeechCLIP, pre-trained CLIP and HuBERT models

are frozen and serve as feature extractors, as shown in Fig. 2.

The CLIP model extracts image and sentence embeddings to

supervise SpeechCLIP. Following SUPERB [19], HuBERT’s

CNN output and transformer encoder’s hidden representa-

tions are weighted and summed by a set of learnable weights.

The weights automatically assign importance to each hidden

layer to minimize the overall objective function. Only the

newly added components excluding HuBERT and CLIP are

learnable during training, reducing the computational cost

signiﬁcantly, thus enabling a larger batch size for contrastive

pre-training. In the following sections, we introduce two

SpeechCLIP architectures: parallel and cascaded.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SPEECHCLIP:INTEGRATINGSPEECHWITHPRE-TRAINEDVISIONANDLANGUAGEMODELYi-JenShih1,Hsuan-FuWang1,Heng-JuiChang1,2,LayneBerry3,Hung-yiLee1,DavidHarwath31NationalTaiwanUniversity2MITCSAIL3TheUniversityofTexasatAustinABSTRACTData-drivenspeechprocessingmodelsusuallyperformwellwithalargeamountoftextsupervision...

展开>> 收起<<

SPEECHCLIP INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL Yi-Jen Shih1 Hsuan-Fu Wang1 Heng-Jui Chang12 Layne Berry3 Hung-yi Lee1 David Harwath3.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SPEECHCLIP INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL Yi-Jen Shih1 Hsuan-Fu Wang1 Heng-Jui Chang12 Layne Berry3 Hung-yi Lee1 David Harwath3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: