Bridging CLIP and StyleGAN through Latent Alignment for Image Editing Wanfeng Zheng Beijing University of Posts and Telecommunications

2025-04-27 0 0 6.38MB 20 页 10玖币

侵权投诉

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Wanfeng Zheng

Beijing University of Posts and Telecommunications

zhengwanfeng@bupt.edu.cn

Qiang Li *

Kuaishou Technology

liqiang03@kuaishou.com

Xiaoyan Guo

Kuaishou Technology

guoxiaoyan@kuaishou.com

Pengfei Wan

Kuaishou Technology

wanpengfei@kuaishou.com

Zhongyuan Wang

Kuaishou Technology

wangzhongyuan@kuaishou.com

Abstract

Text-driven image manipulation is developed since the

vision-language model (CLIP) has been proposed. Previous

work has adopted CLIP to design a text-image consistency-

based objective to address this issue. However, these

methods require either test-time optimization or image fea-

ture cluster analysis for single-mode manipulation direc-

tion. In this paper, we manage to achieve inference-time

optimization-free diverse manipulation direction mining by

bridging CLIP and StyleGAN through Latent Alignment

(CSLA). More speciﬁcally, our efforts consist of three parts:

1) a data-free training strategy to train latent mappers

to bridge the latent space of CLIP and StyleGAN; 2) for

more precise mapping, temporal relative consistency is pro-

posed to address the knowledge distribution bias problem

among different latent spaces; 3) to reﬁne the mapped la-

tent in sspace, adaptive style mixing is also proposed.

With this mapping scheme, we can achieve GAN inversion,

text-to-image generation and text-driven image manipula-

tion. Qualitative and quantitative comparisons are made to

demonstrate the effectiveness of our method.

1. Introduction

Generative adversarial networks [15] are designed for

image synthesize. Among several GAN variants, Style-

GAN [18,19] has shown its superior properties on several

generative tasks such as human face generation or animal

face generation. And the generative prior of StyleGAN has

been utilized in many downstream tasks [1,3,8,24,30].

StyleGAN-based image manipulation is one of the most im-

portant subtopics, which indicates editing a realistic or syn-

thesized image on its w,w+or slatent.

Due to the progress of vision-language model (CLIP)

*Corresponding author

a person

𝑠𝑟𝑐 ∆𝑠%𝑚𝑎𝑝𝑝𝑒𝑟𝑠%𝑚𝑎𝑝𝑝𝑒𝑟

Figure 1. Demonstration of knowledge bias problem. Images in

the ﬁrst row are generated from “a person”, which represents the

face knowledge distribution center in CLIP-space logically. It is

obvious that the generated image of ∆smapper is closer to the

knowledge distribution center corresponding face of StyleGAN

than smapper, which indicates the former is better aligned. Im-

ages in the second row are text-driven manipulated on age. And

the manipulation result of smapper looks like a man rather than a

woman in the source image, which means the manipulation direc-

tion is perturbed by the distribution bias.

[26], image and text data can be embedded into one shared

latent space, which has enabled many text-image cross-

modal approaches. StyleCLIP [24] has proposed three dif-

ferent methods based on the text-image consistency for text-

driven image manipulation: 1) latent optimization edits one

image by optimizing its latent feature in w+space; 2) la-

tent mapper is trained for a certain manipulation direction

and can generate residual for w+latent; 3) global direc-

tions mining can explore manipulation direction in sspace

via a statistical strategy. These methods suffer from lack of

inference-time efﬁciency, because extra costs are inevitable

for editing on a certain image or mining a novel manipula-

arXiv:2210.04506v1 [cs.CV] 10 Oct 2022

tion direction.

To address this issue, we have proposed our method

CSLA by bridging CLIP and StyleGAN through latent

alignment. Speciﬁcally, we have adopted mappers to map

latents from CLIP-space to w-space and s-space. How-

ever, there is a knowledge distribution bias problem be-

tween CLIP and StyleGAN. These two models have knowl-

edge divergence about the average object (e.g. the average

human face), which can result in bad cases of manipulation

as shown in Fig. 1.

To solve this problem and achieve more precise map-

ping, instead of directly mapping CLIP-latent fIto wla-

tent wand slatent s, we have mapped the manipulation

residuals ∆fIto ∆wand ∆s. Nevertheless, it is obvious

to select the average latent as the manipulation center in w

and sspace, but the selection of the manipulation center in

CLIP-space can greatly inﬂuence the manipulation result.

Therefore we have discussed the effect of different manip-

ulation center selections and proposed a temporal relative

manipulation center selection strategy for training.

Despite the well-alignment of the manipulation center

and manipulation direction, mapping from CLIP-space to s

space still has biases. Inspired by style mixing, we man-

age to design a learnable adaptive style mixing module

to achieve self-modulation in sspace, which can generate

modulating signals to reﬁne ∆s.

After training, the image encoder EIof CLIP can be

used for GAN inversion. The text encoder ETof CLIP can

be embed texts into CLIP-space as fT. For the CLIP-space

is shared with fIand fT,fTcan also be mapped into w

space or sspace for text-conditioned image generation. Be-

sides, by given of two sentences, we can compute the ma-

nipulation direction with fTsrc and fTtrg .

Experiments have produced qualitative and quantitative

results that can verify our method. Our contributions are

summarized as follows:

• A data-free training strategy named CSLA is proposed

for arbitrary text-driven manipulation without extra

time consumption at inference-time.

• We have noticed the knowledge distribution bias prob-

lem between CLIP and StyleGAN and solved this

problem by mapping the latent residual.

• We have proposed temporal relative consistency and

adaptive style mixing to reﬁne mapped ∆s.

2. Related work

2.1. Vision-Language Model

In recent years, vision-language pre-training [14,26,29,

36,47,48] has become an attractive novel research topic.

As the most signiﬁcant study, CLIP [26] has achieved

cross-modal joint representation between text and image

data by training among 400 million text-image pairs from

the Internet. With its prior text-image consistency knowl-

edge, it can achieve SOTA on zero-shot classiﬁcation. And

its prior knowledge also promotes other text-image cross-

modal tasks. In this paper, we employ CLIP as an image

encoder for latent alignment training.

2.2. Zero-Shot Text-to-Image Generation

Zero-shot text-to-image generation is a conditional gen-

eration task with reference text. To achieve zero-shot text-

to-image generation, DALL-E [28] aims to train a trans-

former [42] to autoregressively model the text and image

tokens as a single stream of data. Beyond zero-shot gen-

eration, CogView [12] has further investigated the potential

of itself on diverse downstream tasks, such as style learn-

ing, super-resolution, etc. CogView2 has made attempts

for faster generation via hierarchical transformer architec-

ture. Different form those transformer-based emthods, Dif-

fusionCLIP [20], GLIDE [23] and DALL-E2 [27] has in-

troduced diffusion model [16,37,38] with CLIP [26] guid-

ance and achieved better performance. From then on, many

recent works [9,32,33,43] have greatly developed this re-

search direction. Most of these works suffer from expen-

sive training consumption due to their billions of parame-

ters. Meanwhile, our method with about only 8M learnable

parameters can achieve text-to-image generation via a much

faster training process rather than these methods on a single

2080Ti.

2.3. StyleGAN-based Image Manipulation

StyleGAN has three editable latent spaces (w,w+,s)

and high-ﬁdelity generative ability. Therefore, some ap-

proaches manage to achieve image manipulation through an

inversion-editing manner. pSp [30] has proposed a pyra-

mid encoder to inverse an image for RGB space to w+

space. Instead of learning inversion from image to w+

latent, e4e [40] encodes an images into wand a group of

residuals to produce w+latent. Besides, ReStyle [4] it-

eratively inverse an image into w+space by addition on

the predicted w+residual and achieved SOTA performance.

To manipulate the inversed image latent, most of the exist-

ing non-CLIP manipulation direction mining methods are

working through a clustering way, e.g. InterFaceGAN [35].

Our method manipulates images with two sentences instead

of two sets of images.

2.4. Combination of CLIP and StyleGAN

StyleCLIP [24] is the most inﬂuential CLIP-driven Style-

GAN manipulation approach by using CLIP to consist its

objective. Following StyleCLIP, StyleMC [22] has trained

mappers to produce ∆sfor single-mode StyleGAN manip-

ulation. And StyleGAN-NADA [13] ﬁnetunes an generator

GCLIP T

(a) CLIP for loss

GCLIP∆𝑤

(b) CLIP as classifier

𝐸!

𝐸"

Figure 2. Demonstration of three major applications of CLIP. Red dotted line represents the data ﬂow of backward propagation. Black

dotted line means “inference-only”. As shown in ﬁgure, the ﬁrst group of works adopts CLIP to consist of their loss function. Another

group of works analysis ∆wby zero-shot classiﬁcation with CLIP. Besides, some works use CLIP as the image encoder and replace the

image embedding with text embedding as inference time to achieve language-free training and text-driven manipulation.

fixed

learnable

𝐺: StyleGAN2

{𝐸!, 𝐸"}: CLIP

Inference Only

𝑠#$%&

𝑓𝑒𝑎𝑡'(")

-∆𝑤 𝑓𝑐+×+8

𝐺

𝑓𝑐

𝑓𝑐∆𝑠*

∆𝑠+

∆𝑠 =

𝑓𝑒𝑎𝑡#$%&

𝐸!

𝐸"

𝐺

𝑧𝑓𝑐+×+8 𝑤

𝑓𝑐

𝑓𝑐 𝑠*

𝑠+

𝑠 =

𝑙𝑜𝑠𝑠,

𝑙𝑜𝑠𝑠%

𝐴𝑑𝑎𝑝𝑡𝑖𝑣𝑒+𝑆𝑡𝑦𝑙𝑒+𝑀𝑖𝑥𝑖𝑛𝑔

Figure 3. Demonstration of the training process of our method. Synthesized images with their corresponding latents in wand sspace are

sampled by a pre-trained StyleGAN. Then the image is passed through the image encoder of CLIP and subtracted by the base feature in

CLIP-space. Later, the residual CLIP feature is mapped into ∆wand ∆s={s1, s2...sn}which has enabled the consistency objectives to

train these align mappers. During inference time, the image encoder of CLIP is replaced by the text encoder. Therefore, we can achieve

text-driven manipulation direction mining.

with text guidance. Essence Transfer [5] has also adopted

CLIP for its objective to achieve image-based essence trans-

fer by optimization on latents of StyleGAN. Another work

Counterfactual [46] has focused on counterfactual manip-

ulation by learning to produce manipulation residual on

wlatents. Meanwhile, ContraCLIP [41] has made at-

tempts to address the problem of discovering non-linear in-

terpretable paths in the latent space of pre-trained GANs

in a model-agnostic manner. Different from these works,

CLIP2StyleGAN [2] has adopted CLIP as a classiﬁer to tag

a large set of real or synthesized images. Thus they can

better disentangle the wspace to achieve exact manipula-

tion. Besides, some other works like HairCLIP [44] have

adopted CLIP as their image encoder while our method has

also utilized CLIP in this way as shown in Fig. 2. Among

this kind of methods, AnyFace [39] is the most similar to

our method. But it requires paired language-image training

data, thus can not support manipulation in arbitrary data do-

mains. Meanwhile, our method combines CLIP and Style-

GAN in an unsupervised and training data-free manner so

(a) single center reference (b) multi-center reference

Figure 4. Comparison of single center reference (left) and multi-

center reference (right). Rather than single center, multi-center has

more references, which leads to more precise latent orientation in

wspace. Therefore, multi-center training strategy can result in

preferable latent alignment.

that we can achieve diverse text-driven image manipula-

tion more efﬁciently. A concurrent work CLIP2Latent [25]

also developed latent mapping for text-to-image generation.

There are several distinctions: 1) they adopt diffusion model

for latent mapping while ours use MLP which is faster; 2)

their method maps to wlatent while ours maps to ∆slatent

for better disentangled manipulation; 3) we propose tempo-

ral relative consistency and adaptive style mixing to reﬁne

the mapping.

3. Method

In this section, the description of our method has been

divided into ﬁve parts: 1) the overall architecture of our

method; 2) the selection of manipulation center in CLIP-

space and our proposed temporal relative consistency; 3)

the adaptive style mixing strategy; 4) the objective of our

method; 5) how our method works at inference time.

3.1. Overall Architecture of Latent Alignment

The overall architecture of our method has been shown in

Fig. 3. Our major purpose is to train mappers to align latent

spaces. To efﬁciently achieve that, we manage this training

process through a latent distillation strategy. During train-

ing, we sample wand slatents with their corresponding

RGB image Iby a pre-trained StyleGAN.

w, s, I =G(z), z ∼N(0,1).(1)

Then the image is encoded to a CLIP latent to compute

residual latent ∆fCLIP with preset CLIP-space manipulation

center fbase, which is fed into full connected (FC) mappers

to compute ∆wand ∆s.

∆fCLIP =EI(I)−fbase,

∆w=FCw(∆fCLIP),

∆s={FCsi(∆w)}, i ∈ {1,2...n}.

(2)

While ∆wtrg and ∆strg can be computed by subtraction

between original w, s and the preset manipulation center la-

tent wbase, sbase, our mappers can be trained by forcing ∆w

∆𝑠!

∆𝑠"

∆𝑠#

MLP

∆𝑤

Sigmoid

* + ∆𝑠

Figure 5. The architecture of proposed adaptive style mixing. The

tiny network produces two modiﬁcation elements to modify each

∆slatent.

and ∆sto ﬁt ∆wtrg and strg.

∆wtrg =w−wbase,

∆strg ={si−sbasei}, i ∈ {1,2...n}.(3)

After all, the StyleGAN is forward propagated once for

each iteration, which can save about 50% cost than the dis-

tillation in RGB space.

The network architecture of our mappers is the same as

StyleGAN because its effectiveness is proven and we con-

sider it is not necessary for extra incremental designs for the

mapper. And we have removed the pixel normalization to

avoid information loss of latent norm.

3.2. Temporal Relative Consistency

To address the knowledge distribution bias problem dis-

cussed in Sec. 1, we manage to map latent residuals. There-

fore, the selection of manipulation centers fbase, wbase and

sbase are crucial for latent alignment. It is reasonable to se-

lect the wavg and its corresponding savg for wspace and s

space. The only thing questionable is the choice of CLIP-

space. We have made attempts for at least four selections:

1) average center ˆ

fCLIP encoded from ˆ

Icorresponding to

the average latent ˆwof StyleGAN; 2) text center embed-

ded from “a picture of [class]”, where [class] is the Style-

GAN’s specialty; 3) exponential moving average (EMA)

center computed during training; 4) learnable manipulation

center.

Having made a series of attempts, we have found the

average center outperforms the other three conﬁgurations.

The text center lacks stability because it can be inﬂuenced

by the original text description. It is mathematical explica-

ble that the EMA center has the same performance as the

average center. And the learnable center can be considered

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingCLIPandStyleGANthroughLatentAlignmentforImageEditingWanfengZhengBeijingUniversityofPostsandTelecommunicationszhengwanfeng@bupt.edu.cnQiangLi*KuaishouTechnologyliqiang03@kuaishou.comXiaoyanGuoKuaishouTechnologyguoxiaoyan@kuaishou.comPengfeiWanKuaishouTechnologywanpengfei@kuaishou.comZhongyuan...

展开>> 收起<<

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing Wanfeng Zheng Beijing University of Posts and Telecommunications.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing Wanfeng Zheng Beijing University of Posts and Telecommunications

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: