APE Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

2025-04-27 0 0 868.88KB 10 页 10玖币

侵权投诉

APE: Aligning Pretrained Encoders

to Quickly Learn Aligned

Multimodal Representations

Elan Rosenfeld∗

Carnegie Mellon University

elan@cmu.edu

Preetum Nakkiran

Apple

Hadi Pouransari

Apple

Oncel Tuzel

Apple

Fartash Faghri

Apple

fartash@apple.com

Abstract

Recent advances in learning aligned multimodal representations have been primar-

ily driven by training large neural networks on massive, noisy paired-modality

datasets. In this work, we ask whether it is possible to achieve similar results with

substantially less training time and data. We achieve this by taking advantage

of existing pretrained unimodal encoders and careful curation of alignment data

relevant to the downstream task of interest. We study a natural approach to aligning

existing encoders via small auxiliary functions, and we ﬁnd that this method is

competitive with (or outperforms) state of the art in many settings while being less

prone to overﬁtting, less costly to train, and more robust to distribution shift. With

a properly chosen alignment distribution, our method surpasses prior state of the

art for ImageNet zero-shot classiﬁcation on public data while using two orders of

magnitude less time and data and training 77% fewer parameters.

1 Introduction

How much modality-coupled data and compute is required to learn expressive, well-aligned multi-

modal representations? The latest advances in learning aligned representations have largely been

driven by the compilation of ever-growing collections of noisy paired data scraped from the web

(Radford et al.,2021;Jia et al.,2021). Trained on these massive multimodal datasets, new models

achieve unparalleled performance on downstream tasks such as zero-shot classiﬁcation, both in- and

out-of-distribution (Radford et al.,2021;Hendrycks et al.,2019). Unfortunately, the cost of training

these large models continues to scale in tandem—when little paired data already exists, or when one

wants an aligned representation for a new setting, it is unclear how to avoid the time and expense

of collecting and training on such a large dataset. Moreover, though it is simple to scale up noisy

image-text pair scraping from the web, this is not necessarily the case for different modality couplings

(e.g., audio descriptions of body pose) or more speciﬁc applications such as classiﬁcation for niche

downstream tasks.

In this work, we ask whether it is possible to leverage the power of pretrained unimodal encoders and

a carefully chosen multimodal distribution to learn better aligned image-text representations with

less training time and data. Our proposed approach, Aligning Pretrained Encoders (APE), results in

well aligned, high-quality representations which can be learned orders of magnitude faster. We show

that it is possible to align the representations of frozen pretrained encoders using simple functions

∗Work done while an intern at Apple.

arXiv:2210.03927v1 [cs.LG] 8 Oct 2022

with relatively few parameters (4-6 layer MLPs), substantially outperforming CLIP on zero-shot

classiﬁcation, with signiﬁcantly less time spent aligning on multimodal data. Our method is inspired

by Locked-image Tuning (LiT), which ﬁnetunes a text encoder to align with a frozen pretrained

image encoder on a large paired-data corpus (Zhai et al.,2022). Instead, we consider settings with

limited paired data, such as when the downstream task involves a distribution very different from the

pretraining task or when we simply do not have the time and/or compute resources to train on all

available pairs.

In this setting, we show how aligning pretrained encoders on a much smaller, carefully chosen dataset

can result in better performance at less cost: our resulting model achieves 76.85% ImageNet zero-shot

accuracy—as compared with 75.7% reported by LiT on public data—using

98%

less training data

and

98.5%

less time on alignment. This suggests that collecting a small, high-quality dataset tailored

to a speciﬁc downstream task can be signiﬁcantly more cost- and compute-effective than scraping

noisy data in bulk, in addition to providing better absolute performance. Further, we demonstrate

that this simple approach is competitive even when training data is abundant, matching LiT to within

1.5% in-distribution and .5% under distribution shift while training approximately 20% as many

parameters (Fig. 4).

Related Work.

The current state of the art in learning aligned multimodal representations is

Contrastive Language-Image Pretraining (CLIP), which was demonstrated to be feasible at unprecen-

dented scale by (Radford et al.,2021). Following this work, most advancements in this space have

been primarily due to further scale in training set size (Jia et al.,2021), though a popular alternative

is to simultaneously train the unimodal encoders on both unimodal and paired multimodal data

(Geng et al.,2022). Zhai et al. (2022) demonstrate that using a frozen, pretrained image encoder

results in substantially higher zero-shot accuracy on downstream classiﬁcation tasks by making use

of better visual representations. They train a large text encoder on the union of two public image-text

datasets (with a total sample size of ~25 million) to align with a large ImageNet-21k pretrained

Vision Transformer (Dosovitskiy et al.,2021). Though effective, the cost of training the text encoder

remains, as well as the use of a massive amount of training data—their results are achieved by training

for 60,000 iterations with a batch-size of 16,384. LiT is also prone to overﬁtting when the training set

being used for alignment is not very large.

2 Method

To implement APE, we encode the paired data using separate pretrained unimodal image and text

encoders and leave the image encoding unchanged. The token encodings of the text sample are

passed through a small MLP (4-6 layers) and then average pooled across the sequence (See Fig. 3 for

a high-level diagram). This does not directly account for token order; we instead rely on the output of

the pretrained text encoder to include any relevant positional information.

The resulting embeddings

are then normalized and used in the usual contrastive loss (Chen et al.,2020;Radford et al.,2021).

The MLP contains 7.5-22.5% the number of parameters in the entire text tower, which itself has

slightly more parameters than the image encoder. More directly, LiT trains about half of the

parameters trained by CLIP, and APE trains less than a quarter the number of parameters as LiT.

Note that the total number of parameters is greater in APE, as we are learning a small MLP on top

of the pretrained encoders—but APE is less likely to overﬁt to a small alignment dataset because

it is training a much smaller fraction of these parameters. It is also cheaper to train because it

avoids backpropagating through the large encoders, and some of the inputs can be pre-calculated to

avoid having to load the encoders into memory at all. We found that text augmentations made little

difference to ﬁnal performance but image augmentations have a sizeable effect, so naively encoding

all training data with the frozen encoder can result in sub-optimal downstream accuracy. Identifying

the maximum reusable computation for various data modalities is an important future direction to

investigate.

Additional beneﬁts of small alignment functions.

Because the underlying encoders are frozen,

it is easy to learn alignments for new downstream distributions or modalities (though we do not

Surprisingly, we found that training an auxiliary transformer actually performed worse than a simple MLP.

To test that our method is making use of positional info in the text encoder output, we also tried directly learning

a token embedding lookup table and average-pooling the results. It performs surprisingly well, but still much

worse than APE (Fig. 4).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APE:AligningPretrainedEncoderstoQuicklyLearnAlignedMultimodalRepresentationsElanRosenfeldCarnegieMellonUniversityelan@cmu.eduPreetumNakkiranAppleHadiPouransariAppleOncelTuzelAppleFartashFaghriApplefartash@apple.comAbstractRecentadvancesinlearningalignedmultimodalrepresentationshavebeenprimar-ilydri...

展开>> 收起<<

APE Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

APE Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: