tion direction.
To address this issue, we have proposed our method
CSLA by bridging CLIP and StyleGAN through latent
alignment. Specifically, we have adopted mappers to map
latents from CLIP-space to w-space and s-space. How-
ever, there is a knowledge distribution bias problem be-
tween CLIP and StyleGAN. These two models have knowl-
edge divergence about the average object (e.g. the average
human face), which can result in bad cases of manipulation
as shown in Fig. 1.
To solve this problem and achieve more precise map-
ping, instead of directly mapping CLIP-latent fIto wla-
tent wand slatent s, we have mapped the manipulation
residuals ∆fIto ∆wand ∆s. Nevertheless, it is obvious
to select the average latent as the manipulation center in w
and sspace, but the selection of the manipulation center in
CLIP-space can greatly influence the manipulation result.
Therefore we have discussed the effect of different manip-
ulation center selections and proposed a temporal relative
manipulation center selection strategy for training.
Despite the well-alignment of the manipulation center
and manipulation direction, mapping from CLIP-space to s
space still has biases. Inspired by style mixing, we man-
age to design a learnable adaptive style mixing module
to achieve self-modulation in sspace, which can generate
modulating signals to refine ∆s.
After training, the image encoder EIof CLIP can be
used for GAN inversion. The text encoder ETof CLIP can
be embed texts into CLIP-space as fT. For the CLIP-space
is shared with fIand fT,fTcan also be mapped into w
space or sspace for text-conditioned image generation. Be-
sides, by given of two sentences, we can compute the ma-
nipulation direction with fTsrc and fTtrg .
Experiments have produced qualitative and quantitative
results that can verify our method. Our contributions are
summarized as follows:
• A data-free training strategy named CSLA is proposed
for arbitrary text-driven manipulation without extra
time consumption at inference-time.
• We have noticed the knowledge distribution bias prob-
lem between CLIP and StyleGAN and solved this
problem by mapping the latent residual.
• We have proposed temporal relative consistency and
adaptive style mixing to refine mapped ∆s.
2. Related work
2.1. Vision-Language Model
In recent years, vision-language pre-training [14,26,29,
36,47,48] has become an attractive novel research topic.
As the most significant study, CLIP [26] has achieved
cross-modal joint representation between text and image
data by training among 400 million text-image pairs from
the Internet. With its prior text-image consistency knowl-
edge, it can achieve SOTA on zero-shot classification. And
its prior knowledge also promotes other text-image cross-
modal tasks. In this paper, we employ CLIP as an image
encoder for latent alignment training.
2.2. Zero-Shot Text-to-Image Generation
Zero-shot text-to-image generation is a conditional gen-
eration task with reference text. To achieve zero-shot text-
to-image generation, DALL-E [28] aims to train a trans-
former [42] to autoregressively model the text and image
tokens as a single stream of data. Beyond zero-shot gen-
eration, CogView [12] has further investigated the potential
of itself on diverse downstream tasks, such as style learn-
ing, super-resolution, etc. CogView2 has made attempts
for faster generation via hierarchical transformer architec-
ture. Different form those transformer-based emthods, Dif-
fusionCLIP [20], GLIDE [23] and DALL-E2 [27] has in-
troduced diffusion model [16,37,38] with CLIP [26] guid-
ance and achieved better performance. From then on, many
recent works [9,32,33,43] have greatly developed this re-
search direction. Most of these works suffer from expen-
sive training consumption due to their billions of parame-
ters. Meanwhile, our method with about only 8M learnable
parameters can achieve text-to-image generation via a much
faster training process rather than these methods on a single
2080Ti.
2.3. StyleGAN-based Image Manipulation
StyleGAN has three editable latent spaces (w,w+,s)
and high-fidelity generative ability. Therefore, some ap-
proaches manage to achieve image manipulation through an
inversion-editing manner. pSp [30] has proposed a pyra-
mid encoder to inverse an image for RGB space to w+
space. Instead of learning inversion from image to w+
latent, e4e [40] encodes an images into wand a group of
residuals to produce w+latent. Besides, ReStyle [4] it-
eratively inverse an image into w+space by addition on
the predicted w+residual and achieved SOTA performance.
To manipulate the inversed image latent, most of the exist-
ing non-CLIP manipulation direction mining methods are
working through a clustering way, e.g. InterFaceGAN [35].
Our method manipulates images with two sentences instead
of two sets of images.
2.4. Combination of CLIP and StyleGAN
StyleCLIP [24] is the most influential CLIP-driven Style-
GAN manipulation approach by using CLIP to consist its
objective. Following StyleCLIP, StyleMC [22] has trained
mappers to produce ∆sfor single-mode StyleGAN manip-
ulation. And StyleGAN-NADA [13] finetunes an generator
2