APE Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

2025-04-27 0 0 868.88KB 10 页 10玖币
侵权投诉
APE: Aligning Pretrained Encoders
to Quickly Learn Aligned
Multimodal Representations
Elan Rosenfeld
Carnegie Mellon University
elan@cmu.edu
Preetum Nakkiran
Apple
Hadi Pouransari
Apple
Oncel Tuzel
Apple
Fartash Faghri
Apple
fartash@apple.com
Abstract
Recent advances in learning aligned multimodal representations have been primar-
ily driven by training large neural networks on massive, noisy paired-modality
datasets. In this work, we ask whether it is possible to achieve similar results with
substantially less training time and data. We achieve this by taking advantage
of existing pretrained unimodal encoders and careful curation of alignment data
relevant to the downstream task of interest. We study a natural approach to aligning
existing encoders via small auxiliary functions, and we find that this method is
competitive with (or outperforms) state of the art in many settings while being less
prone to overfitting, less costly to train, and more robust to distribution shift. With
a properly chosen alignment distribution, our method surpasses prior state of the
art for ImageNet zero-shot classification on public data while using two orders of
magnitude less time and data and training 77% fewer parameters.
1 Introduction
How much modality-coupled data and compute is required to learn expressive, well-aligned multi-
modal representations? The latest advances in learning aligned representations have largely been
driven by the compilation of ever-growing collections of noisy paired data scraped from the web
(Radford et al.,2021;Jia et al.,2021). Trained on these massive multimodal datasets, new models
achieve unparalleled performance on downstream tasks such as zero-shot classification, both in- and
out-of-distribution (Radford et al.,2021;Hendrycks et al.,2019). Unfortunately, the cost of training
these large models continues to scale in tandem—when little paired data already exists, or when one
wants an aligned representation for a new setting, it is unclear how to avoid the time and expense
of collecting and training on such a large dataset. Moreover, though it is simple to scale up noisy
image-text pair scraping from the web, this is not necessarily the case for different modality couplings
(e.g., audio descriptions of body pose) or more specific applications such as classification for niche
downstream tasks.
In this work, we ask whether it is possible to leverage the power of pretrained unimodal encoders and
a carefully chosen multimodal distribution to learn better aligned image-text representations with
less training time and data. Our proposed approach, Aligning Pretrained Encoders (APE), results in
well aligned, high-quality representations which can be learned orders of magnitude faster. We show
that it is possible to align the representations of frozen pretrained encoders using simple functions
Work done while an intern at Apple.
arXiv:2210.03927v1 [cs.LG] 8 Oct 2022
with relatively few parameters (4-6 layer MLPs), substantially outperforming CLIP on zero-shot
classification, with significantly less time spent aligning on multimodal data. Our method is inspired
by Locked-image Tuning (LiT), which finetunes a text encoder to align with a frozen pretrained
image encoder on a large paired-data corpus (Zhai et al.,2022). Instead, we consider settings with
limited paired data, such as when the downstream task involves a distribution very different from the
pretraining task or when we simply do not have the time and/or compute resources to train on all
available pairs.
In this setting, we show how aligning pretrained encoders on a much smaller, carefully chosen dataset
can result in better performance at less cost: our resulting model achieves 76.85% ImageNet zero-shot
accuracy—as compared with 75.7% reported by LiT on public data—using
98%
less training data
and
98.5%
less time on alignment. This suggests that collecting a small, high-quality dataset tailored
to a specific downstream task can be significantly more cost- and compute-effective than scraping
noisy data in bulk, in addition to providing better absolute performance. Further, we demonstrate
that this simple approach is competitive even when training data is abundant, matching LiT to within
1.5% in-distribution and .5% under distribution shift while training approximately 20% as many
parameters (Fig. 4).
Related Work.
The current state of the art in learning aligned multimodal representations is
Contrastive Language-Image Pretraining (CLIP), which was demonstrated to be feasible at unprecen-
dented scale by (Radford et al.,2021). Following this work, most advancements in this space have
been primarily due to further scale in training set size (Jia et al.,2021), though a popular alternative
is to simultaneously train the unimodal encoders on both unimodal and paired multimodal data
(Geng et al.,2022). Zhai et al. (2022) demonstrate that using a frozen, pretrained image encoder
results in substantially higher zero-shot accuracy on downstream classification tasks by making use
of better visual representations. They train a large text encoder on the union of two public image-text
datasets (with a total sample size of ~25 million) to align with a large ImageNet-21k pretrained
Vision Transformer (Dosovitskiy et al.,2021). Though effective, the cost of training the text encoder
remains, as well as the use of a massive amount of training data—their results are achieved by training
for 60,000 iterations with a batch-size of 16,384. LiT is also prone to overfitting when the training set
being used for alignment is not very large.
2 Method
To implement APE, we encode the paired data using separate pretrained unimodal image and text
encoders and leave the image encoding unchanged. The token encodings of the text sample are
passed through a small MLP (4-6 layers) and then average pooled across the sequence (See Fig. 3 for
a high-level diagram). This does not directly account for token order; we instead rely on the output of
the pretrained text encoder to include any relevant positional information.
2
The resulting embeddings
are then normalized and used in the usual contrastive loss (Chen et al.,2020;Radford et al.,2021).
The MLP contains 7.5-22.5% the number of parameters in the entire text tower, which itself has
slightly more parameters than the image encoder. More directly, LiT trains about half of the
parameters trained by CLIP, and APE trains less than a quarter the number of parameters as LiT.
Note that the total number of parameters is greater in APE, as we are learning a small MLP on top
of the pretrained encoders—but APE is less likely to overfit to a small alignment dataset because
it is training a much smaller fraction of these parameters. It is also cheaper to train because it
avoids backpropagating through the large encoders, and some of the inputs can be pre-calculated to
avoid having to load the encoders into memory at all. We found that text augmentations made little
difference to final performance but image augmentations have a sizeable effect, so naively encoding
all training data with the frozen encoder can result in sub-optimal downstream accuracy. Identifying
the maximum reusable computation for various data modalities is an important future direction to
investigate.
Additional benefits of small alignment functions.
Because the underlying encoders are frozen,
it is easy to learn alignments for new downstream distributions or modalities (though we do not
2
Surprisingly, we found that training an auxiliary transformer actually performed worse than a simple MLP.
To test that our method is making use of positional info in the text encoder output, we also tried directly learning
a token embedding lookup table and average-pooling the results. It performs surprisingly well, but still much
worse than APE (Fig. 4).
2
摘要:

APE:AligningPretrainedEncoderstoQuicklyLearnAlignedMultimodalRepresentationsElanRosenfeldCarnegieMellonUniversityelan@cmu.eduPreetumNakkiranAppleHadiPouransariAppleOncelTuzelAppleFartashFaghriApplefartash@apple.comAbstractRecentadvancesinlearningalignedmultimodalrepresentationshavebeenprimar-ilydri...

展开>> 收起<<
APE Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:868.88KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注