
with relatively few parameters (4-6 layer MLPs), substantially outperforming CLIP on zero-shot
classification, with significantly less time spent aligning on multimodal data. Our method is inspired
by Locked-image Tuning (LiT), which finetunes a text encoder to align with a frozen pretrained
image encoder on a large paired-data corpus (Zhai et al.,2022). Instead, we consider settings with
limited paired data, such as when the downstream task involves a distribution very different from the
pretraining task or when we simply do not have the time and/or compute resources to train on all
available pairs.
In this setting, we show how aligning pretrained encoders on a much smaller, carefully chosen dataset
can result in better performance at less cost: our resulting model achieves 76.85% ImageNet zero-shot
accuracy—as compared with 75.7% reported by LiT on public data—using
98%
less training data
and
98.5%
less time on alignment. This suggests that collecting a small, high-quality dataset tailored
to a specific downstream task can be significantly more cost- and compute-effective than scraping
noisy data in bulk, in addition to providing better absolute performance. Further, we demonstrate
that this simple approach is competitive even when training data is abundant, matching LiT to within
1.5% in-distribution and .5% under distribution shift while training approximately 20% as many
parameters (Fig. 4).
Related Work.
The current state of the art in learning aligned multimodal representations is
Contrastive Language-Image Pretraining (CLIP), which was demonstrated to be feasible at unprecen-
dented scale by (Radford et al.,2021). Following this work, most advancements in this space have
been primarily due to further scale in training set size (Jia et al.,2021), though a popular alternative
is to simultaneously train the unimodal encoders on both unimodal and paired multimodal data
(Geng et al.,2022). Zhai et al. (2022) demonstrate that using a frozen, pretrained image encoder
results in substantially higher zero-shot accuracy on downstream classification tasks by making use
of better visual representations. They train a large text encoder on the union of two public image-text
datasets (with a total sample size of ~25 million) to align with a large ImageNet-21k pretrained
Vision Transformer (Dosovitskiy et al.,2021). Though effective, the cost of training the text encoder
remains, as well as the use of a massive amount of training data—their results are achieved by training
for 60,000 iterations with a batch-size of 16,384. LiT is also prone to overfitting when the training set
being used for alignment is not very large.
2 Method
To implement APE, we encode the paired data using separate pretrained unimodal image and text
encoders and leave the image encoding unchanged. The token encodings of the text sample are
passed through a small MLP (4-6 layers) and then average pooled across the sequence (See Fig. 3 for
a high-level diagram). This does not directly account for token order; we instead rely on the output of
the pretrained text encoder to include any relevant positional information.
2
The resulting embeddings
are then normalized and used in the usual contrastive loss (Chen et al.,2020;Radford et al.,2021).
The MLP contains 7.5-22.5% the number of parameters in the entire text tower, which itself has
slightly more parameters than the image encoder. More directly, LiT trains about half of the
parameters trained by CLIP, and APE trains less than a quarter the number of parameters as LiT.
Note that the total number of parameters is greater in APE, as we are learning a small MLP on top
of the pretrained encoders—but APE is less likely to overfit to a small alignment dataset because
it is training a much smaller fraction of these parameters. It is also cheaper to train because it
avoids backpropagating through the large encoders, and some of the inputs can be pre-calculated to
avoid having to load the encoders into memory at all. We found that text augmentations made little
difference to final performance but image augmentations have a sizeable effect, so naively encoding
all training data with the frozen encoder can result in sub-optimal downstream accuracy. Identifying
the maximum reusable computation for various data modalities is an important future direction to
investigate.
Additional benefits of small alignment functions.
Because the underlying encoders are frozen,
it is easy to learn alignments for new downstream distributions or modalities (though we do not
2
Surprisingly, we found that training an auxiliary transformer actually performed worse than a simple MLP.
To test that our method is making use of positional info in the text encoder output, we also tried directly learning
a token embedding lookup table and average-pooling the results. It performs surprisingly well, but still much
worse than APE (Fig. 4).
2