Bridging CLIP and StyleGAN through Latent Alignment for Image Editing Wanfeng Zheng Beijing University of Posts and Telecommunications

2025-04-27 0 0 6.38MB 20 页 10玖币
侵权投诉
Bridging CLIP and StyleGAN through Latent Alignment for Image Editing
Wanfeng Zheng
Beijing University of Posts and Telecommunications
zhengwanfeng@bupt.edu.cn
Qiang Li *
Kuaishou Technology
liqiang03@kuaishou.com
Xiaoyan Guo
Kuaishou Technology
guoxiaoyan@kuaishou.com
Pengfei Wan
Kuaishou Technology
wanpengfei@kuaishou.com
Zhongyuan Wang
Kuaishou Technology
wangzhongyuan@kuaishou.com
Abstract
Text-driven image manipulation is developed since the
vision-language model (CLIP) has been proposed. Previous
work has adopted CLIP to design a text-image consistency-
based objective to address this issue. However, these
methods require either test-time optimization or image fea-
ture cluster analysis for single-mode manipulation direc-
tion. In this paper, we manage to achieve inference-time
optimization-free diverse manipulation direction mining by
bridging CLIP and StyleGAN through Latent Alignment
(CSLA). More specifically, our efforts consist of three parts:
1) a data-free training strategy to train latent mappers
to bridge the latent space of CLIP and StyleGAN; 2) for
more precise mapping, temporal relative consistency is pro-
posed to address the knowledge distribution bias problem
among different latent spaces; 3) to refine the mapped la-
tent in sspace, adaptive style mixing is also proposed.
With this mapping scheme, we can achieve GAN inversion,
text-to-image generation and text-driven image manipula-
tion. Qualitative and quantitative comparisons are made to
demonstrate the effectiveness of our method.
1. Introduction
Generative adversarial networks [15] are designed for
image synthesize. Among several GAN variants, Style-
GAN [18,19] has shown its superior properties on several
generative tasks such as human face generation or animal
face generation. And the generative prior of StyleGAN has
been utilized in many downstream tasks [1,3,8,24,30].
StyleGAN-based image manipulation is one of the most im-
portant subtopics, which indicates editing a realistic or syn-
thesized image on its w,w+or slatent.
Due to the progress of vision-language model (CLIP)
*Corresponding author
a person
𝑠𝑟𝑐 ∆𝑠%𝑚𝑎𝑝𝑝𝑒𝑟𝑠%𝑚𝑎𝑝𝑝𝑒𝑟
Figure 1. Demonstration of knowledge bias problem. Images in
the first row are generated from “a person”, which represents the
face knowledge distribution center in CLIP-space logically. It is
obvious that the generated image of smapper is closer to the
knowledge distribution center corresponding face of StyleGAN
than smapper, which indicates the former is better aligned. Im-
ages in the second row are text-driven manipulated on age. And
the manipulation result of smapper looks like a man rather than a
woman in the source image, which means the manipulation direc-
tion is perturbed by the distribution bias.
[26], image and text data can be embedded into one shared
latent space, which has enabled many text-image cross-
modal approaches. StyleCLIP [24] has proposed three dif-
ferent methods based on the text-image consistency for text-
driven image manipulation: 1) latent optimization edits one
image by optimizing its latent feature in w+space; 2) la-
tent mapper is trained for a certain manipulation direction
and can generate residual for w+latent; 3) global direc-
tions mining can explore manipulation direction in sspace
via a statistical strategy. These methods suffer from lack of
inference-time efficiency, because extra costs are inevitable
for editing on a certain image or mining a novel manipula-
1
arXiv:2210.04506v1 [cs.CV] 10 Oct 2022
tion direction.
To address this issue, we have proposed our method
CSLA by bridging CLIP and StyleGAN through latent
alignment. Specifically, we have adopted mappers to map
latents from CLIP-space to w-space and s-space. How-
ever, there is a knowledge distribution bias problem be-
tween CLIP and StyleGAN. These two models have knowl-
edge divergence about the average object (e.g. the average
human face), which can result in bad cases of manipulation
as shown in Fig. 1.
To solve this problem and achieve more precise map-
ping, instead of directly mapping CLIP-latent fIto wla-
tent wand slatent s, we have mapped the manipulation
residuals fIto wand s. Nevertheless, it is obvious
to select the average latent as the manipulation center in w
and sspace, but the selection of the manipulation center in
CLIP-space can greatly influence the manipulation result.
Therefore we have discussed the effect of different manip-
ulation center selections and proposed a temporal relative
manipulation center selection strategy for training.
Despite the well-alignment of the manipulation center
and manipulation direction, mapping from CLIP-space to s
space still has biases. Inspired by style mixing, we man-
age to design a learnable adaptive style mixing module
to achieve self-modulation in sspace, which can generate
modulating signals to refine s.
After training, the image encoder EIof CLIP can be
used for GAN inversion. The text encoder ETof CLIP can
be embed texts into CLIP-space as fT. For the CLIP-space
is shared with fIand fT,fTcan also be mapped into w
space or sspace for text-conditioned image generation. Be-
sides, by given of two sentences, we can compute the ma-
nipulation direction with fTsrc and fTtrg .
Experiments have produced qualitative and quantitative
results that can verify our method. Our contributions are
summarized as follows:
A data-free training strategy named CSLA is proposed
for arbitrary text-driven manipulation without extra
time consumption at inference-time.
We have noticed the knowledge distribution bias prob-
lem between CLIP and StyleGAN and solved this
problem by mapping the latent residual.
We have proposed temporal relative consistency and
adaptive style mixing to refine mapped s.
2. Related work
2.1. Vision-Language Model
In recent years, vision-language pre-training [14,26,29,
36,47,48] has become an attractive novel research topic.
As the most significant study, CLIP [26] has achieved
cross-modal joint representation between text and image
data by training among 400 million text-image pairs from
the Internet. With its prior text-image consistency knowl-
edge, it can achieve SOTA on zero-shot classification. And
its prior knowledge also promotes other text-image cross-
modal tasks. In this paper, we employ CLIP as an image
encoder for latent alignment training.
2.2. Zero-Shot Text-to-Image Generation
Zero-shot text-to-image generation is a conditional gen-
eration task with reference text. To achieve zero-shot text-
to-image generation, DALL-E [28] aims to train a trans-
former [42] to autoregressively model the text and image
tokens as a single stream of data. Beyond zero-shot gen-
eration, CogView [12] has further investigated the potential
of itself on diverse downstream tasks, such as style learn-
ing, super-resolution, etc. CogView2 has made attempts
for faster generation via hierarchical transformer architec-
ture. Different form those transformer-based emthods, Dif-
fusionCLIP [20], GLIDE [23] and DALL-E2 [27] has in-
troduced diffusion model [16,37,38] with CLIP [26] guid-
ance and achieved better performance. From then on, many
recent works [9,32,33,43] have greatly developed this re-
search direction. Most of these works suffer from expen-
sive training consumption due to their billions of parame-
ters. Meanwhile, our method with about only 8M learnable
parameters can achieve text-to-image generation via a much
faster training process rather than these methods on a single
2080Ti.
2.3. StyleGAN-based Image Manipulation
StyleGAN has three editable latent spaces (w,w+,s)
and high-fidelity generative ability. Therefore, some ap-
proaches manage to achieve image manipulation through an
inversion-editing manner. pSp [30] has proposed a pyra-
mid encoder to inverse an image for RGB space to w+
space. Instead of learning inversion from image to w+
latent, e4e [40] encodes an images into wand a group of
residuals to produce w+latent. Besides, ReStyle [4] it-
eratively inverse an image into w+space by addition on
the predicted w+residual and achieved SOTA performance.
To manipulate the inversed image latent, most of the exist-
ing non-CLIP manipulation direction mining methods are
working through a clustering way, e.g. InterFaceGAN [35].
Our method manipulates images with two sentences instead
of two sets of images.
2.4. Combination of CLIP and StyleGAN
StyleCLIP [24] is the most influential CLIP-driven Style-
GAN manipulation approach by using CLIP to consist its
objective. Following StyleCLIP, StyleMC [22] has trained
mappers to produce sfor single-mode StyleGAN manip-
ulation. And StyleGAN-NADA [13] finetunes an generator
2
GCLIP T
(a) CLIP for loss
GCLIP∆𝑤
T
(b) CLIP as classifier
G
𝐸!
𝐸"
(c) CLIP as encoder
Figure 2. Demonstration of three major applications of CLIP. Red dotted line represents the data flow of backward propagation. Black
dotted line means “inference-only”. As shown in figure, the first group of works adopts CLIP to consist of their loss function. Another
group of works analysis wby zero-shot classification with CLIP. Besides, some works use CLIP as the image encoder and replace the
image embedding with text embedding as inference time to achieve language-free training and text-driven manipulation.
fixed
learnable
𝐺: StyleGAN2
{𝐸!, 𝐸"}: CLIP
Inference Only
𝑠#$%&
𝑓𝑒𝑎𝑡'(")
-∆𝑤 𝑓𝑐+×+8
𝐺
𝑓𝑐
𝑓𝑐∆𝑠*
∆𝑠+
∆𝑠 =
+
𝑓𝑒𝑎𝑡#$%&
𝐸!
𝐸"
𝐺
𝑧𝑓𝑐+×+8 𝑤
𝑓𝑐
𝑓𝑐 𝑠*
𝑠+
𝑠 =
𝑙𝑜𝑠𝑠,
𝑙𝑜𝑠𝑠%
𝐴𝑑𝑎𝑝𝑡𝑖𝑣𝑒+𝑆𝑡𝑦𝑙𝑒+𝑀𝑖𝑥𝑖𝑛𝑔
Figure 3. Demonstration of the training process of our method. Synthesized images with their corresponding latents in wand sspace are
sampled by a pre-trained StyleGAN. Then the image is passed through the image encoder of CLIP and subtracted by the base feature in
CLIP-space. Later, the residual CLIP feature is mapped into wand s={s1, s2...sn}which has enabled the consistency objectives to
train these align mappers. During inference time, the image encoder of CLIP is replaced by the text encoder. Therefore, we can achieve
text-driven manipulation direction mining.
with text guidance. Essence Transfer [5] has also adopted
CLIP for its objective to achieve image-based essence trans-
fer by optimization on latents of StyleGAN. Another work
Counterfactual [46] has focused on counterfactual manip-
ulation by learning to produce manipulation residual on
wlatents. Meanwhile, ContraCLIP [41] has made at-
tempts to address the problem of discovering non-linear in-
terpretable paths in the latent space of pre-trained GANs
in a model-agnostic manner. Different from these works,
CLIP2StyleGAN [2] has adopted CLIP as a classifier to tag
a large set of real or synthesized images. Thus they can
better disentangle the wspace to achieve exact manipula-
tion. Besides, some other works like HairCLIP [44] have
adopted CLIP as their image encoder while our method has
also utilized CLIP in this way as shown in Fig. 2. Among
this kind of methods, AnyFace [39] is the most similar to
our method. But it requires paired language-image training
data, thus can not support manipulation in arbitrary data do-
mains. Meanwhile, our method combines CLIP and Style-
GAN in an unsupervised and training data-free manner so
3
(a) single center reference (b) multi-center reference
Figure 4. Comparison of single center reference (left) and multi-
center reference (right). Rather than single center, multi-center has
more references, which leads to more precise latent orientation in
wspace. Therefore, multi-center training strategy can result in
preferable latent alignment.
that we can achieve diverse text-driven image manipula-
tion more efficiently. A concurrent work CLIP2Latent [25]
also developed latent mapping for text-to-image generation.
There are several distinctions: 1) they adopt diffusion model
for latent mapping while ours use MLP which is faster; 2)
their method maps to wlatent while ours maps to slatent
for better disentangled manipulation; 3) we propose tempo-
ral relative consistency and adaptive style mixing to refine
the mapping.
3. Method
In this section, the description of our method has been
divided into five parts: 1) the overall architecture of our
method; 2) the selection of manipulation center in CLIP-
space and our proposed temporal relative consistency; 3)
the adaptive style mixing strategy; 4) the objective of our
method; 5) how our method works at inference time.
3.1. Overall Architecture of Latent Alignment
The overall architecture of our method has been shown in
Fig. 3. Our major purpose is to train mappers to align latent
spaces. To efficiently achieve that, we manage this training
process through a latent distillation strategy. During train-
ing, we sample wand slatents with their corresponding
RGB image Iby a pre-trained StyleGAN.
w, s, I =G(z), z N(0,1).(1)
Then the image is encoded to a CLIP latent to compute
residual latent fCLIP with preset CLIP-space manipulation
center fbase, which is fed into full connected (FC) mappers
to compute wand s.
fCLIP =EI(I)fbase,
w=FCw(∆fCLIP),
s={FCsi(∆w)}, i ∈ {1,2...n}.
(2)
While wtrg and strg can be computed by subtraction
between original w, s and the preset manipulation center la-
tent wbase, sbase, our mappers can be trained by forcing w
∆𝑠!
∆𝑠"
∆𝑠#
MLP
MLP
∆𝑤
Sigmoid
* + ∆𝑠
Figure 5. The architecture of proposed adaptive style mixing. The
tiny network produces two modification elements to modify each
slatent.
and sto fit wtrg and strg.
wtrg =wwbase,
strg ={sisbasei}, i ∈ {1,2...n}.(3)
After all, the StyleGAN is forward propagated once for
each iteration, which can save about 50% cost than the dis-
tillation in RGB space.
The network architecture of our mappers is the same as
StyleGAN because its effectiveness is proven and we con-
sider it is not necessary for extra incremental designs for the
mapper. And we have removed the pixel normalization to
avoid information loss of latent norm.
3.2. Temporal Relative Consistency
To address the knowledge distribution bias problem dis-
cussed in Sec. 1, we manage to map latent residuals. There-
fore, the selection of manipulation centers fbase, wbase and
sbase are crucial for latent alignment. It is reasonable to se-
lect the wavg and its corresponding savg for wspace and s
space. The only thing questionable is the choice of CLIP-
space. We have made attempts for at least four selections:
1) average center ˆ
fCLIP encoded from ˆ
Icorresponding to
the average latent ˆwof StyleGAN; 2) text center embed-
ded from “a picture of [class]”, where [class] is the Style-
GAN’s specialty; 3) exponential moving average (EMA)
center computed during training; 4) learnable manipulation
center.
Having made a series of attempts, we have found the
average center outperforms the other three configurations.
The text center lacks stability because it can be influenced
by the original text description. It is mathematical explica-
ble that the EMA center has the same performance as the
average center. And the learnable center can be considered
4
摘要:

BridgingCLIPandStyleGANthroughLatentAlignmentforImageEditingWanfengZhengBeijingUniversityofPostsandTelecommunicationszhengwanfeng@bupt.edu.cnQiangLi*KuaishouTechnologyliqiang03@kuaishou.comXiaoyanGuoKuaishouTechnologyguoxiaoyan@kuaishou.comPengfeiWanKuaishouTechnologywanpengfei@kuaishou.comZhongyuan...

展开>> 收起<<
Bridging CLIP and StyleGAN through Latent Alignment for Image Editing Wanfeng Zheng Beijing University of Posts and Telecommunications.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:6.38MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注