ManiCLIP Multi-Attribute Face Manipulation from Text Hao Wang1Guosheng Lin1yAna Garc ıa del Molino2Anran Wang2 Jiashi Feng2Zhiqi Shen1y

2025-05-02 0 0 2.77MB 10 页 10玖币

侵权投诉

ManiCLIP: Multi-Attribute Face Manipulation from Text

Hao Wang1Guosheng Lin1†Ana Garc´

ıa del Molino2Anran Wang2

Jiashi Feng2Zhiqi Shen1†

1Nanyang Technological University 2ByteDance

Abstract

In this paper we present a novel multi-attribute face ma-

nipulation method based on textual descriptions. Previous

text-based image editing methods either require test-time

optimization for each individual image or are restricted

to single attribute editing. Extending these methods to

multi-attribute face image editing scenarios will introduce

undesired excessive attribute change, e.g., text-relevant

attributes are overly manipulated and text-irrelevant at-

tributes are also changed. In order to address these chal-

lenges and achieve natural editing over multiple face at-

tributes, we propose a new decoupling training scheme

where we use group sampling to get text segments from

same attribute categories, instead of whole complex sen-

tences. Further, to preserve other existing face attributes,

we encourage the model to edit the latent code of each at-

tribute separately via an entropy constraint. During the in-

ference phase, our model is able to edit new face images

without any test-time optimization, even from complex tex-

tual prompts. We show extensive experiments and analysis

to demonstrate the efﬁcacy of our method, which generates

natural manipulated faces with minimal text-irrelevant at-

tribute editing. Code and pre-trained model are available at

https://github.com/hwang1996/ManiCLIP.

1. Introduction

Face editing [32,7,17] aims to manipulate the given

face images under certain instructions. In this paper, we

are interested in multi-attribute face editing, where one can

change several different facial attributes in one shot, from

the provided textual descriptions. Multi-attribute face edit-

ing is of great signiﬁcance when users want to change their

photo contents, including makeups, hair styles and face

shapes. This gives people freedom to change the images

and generate their desired images. However, relevant stud-

ies on multi-attribute editing are still rare.

†Corresponding authors

Text Image

She wears lipstick.

She has big lips.

Excessive editing Natural editing

Input Output

⨉✓

Figure 1. Comparison between excessive editing and natural

editing. The excessive and natural editing results are generated by

our baseline model and our method respectively.

With the development of StyleGAN2 [9,10] and CLIP

[18] model, some recent works explore leveraging them for

image content editing. Speciﬁcally, StyleGAN2 has been

demonstrated to learn disentangled latent codes [9], which

are shown to have corresponding semantic meanings as

their generated images [30,27]. CLIP model is pretrained

on large-scale image-text datasets, which can measure the

similarity between given images and text, by mapping them

to the learned feature space. Therefore, to achieve au-

tomatic natural language based face editing, many works

[17,30,28,15,32] designed manipulation components to

change the latent codes based on StyleGAN2 and CLIP

model.

To be speciﬁc, TediGAN [30] uses CLIP loss [18] to su-

pervise the text-image alignment of the edited images. This

method suffers the problem of slow inference speed, as it re-

quires individual alignment optimization over each image.

StyleCLIP [17] proposes to train a mapper module to pre-

dict the proper offsets over the image latent codes to achieve

desired editing. However, they need to train various map-

pers for different text inputs. HairCLIP [28] follows the

mapper design of StyleCLIP [17], and only trains one map-

per for different hair-related text inputs. But the original

HairCLIP [28] is only applicable to single hair-related at-

tribute editing, and the problem of multi-attribute editing

remains unsolved.

Though it is straightforward to extend previous methods

to the general multi-attribute face manipulation task, we ob-

serve that the baseline model results in excessive editing, as

shown in Figure 1. Excessive editing means: 1) the relevant

attributes are overly ampliﬁed and unnatural results are pro-

arXiv:2210.00445v3 [cs.CV] 26 Mar 2023

The person is

chubby, and has

blond hair.

The person has

brown hair, and

arched eyebrows.

Input

Figure 2. ManiCLIP: our model is able to naturally edit multiple face attributes from natural language instructions. Top row shows the

original images. Rows 2-3 show the edited face images under different text descriptions.

duced, e.g. the lipstick has over bright color in Figure 1,

and 2) the text-irrelevant attributes are also changed in the

edited images, e.g. the skin color in Figure 1. To alleviate

these issues, we introduce a new method, named ManiCLIP,

that deploys a new decoupling training scheme and entropy

constraint based loss design to generate natural edited im-

ages while minimizing the text-irrelevant attribute change.

Our proposed method is two-pronged. First, we observe

that the difﬁculty of attribute editing for the model is hetero-

geneous. For example, manipulation of lipstick is easy, sev-

eral epoch training would give overly edited results, while

editing of other attributes may require more epochs. When

we do mixed training, since “hard” attributes are harder to

synthesize, the model tends to keep optimizing the CLIP

loss and, as a result, “easy” attributes will become unnatu-

ral, for example the lipstick in Figure 1presents excessive

editing. Hence we propose the decoupling training scheme,

where we use group sampling and only edit one kind of at-

tributes in each instance. It allows the model to ﬁt each

kind of attributes individually, which alleviates the issue of

unnatural edited results. Second, because of the disentan-

glement property [10] of StyleGAN2 latent space, only a

small portion of the latent code dimensions affect certain

attributes. However, StyleGAN2 latent codes have thou-

sands of dimensions, which have much freedom during the

editing process, hence we propose to use the entropy loss

to control the number of non-zero dimensions. Since the

uniform distribution of latent code values yields maximum

entropy, minimizing the entropy loss can force the model to

produce more values closer to zero, as visualized in Figure

5. During the training phase, we optimize the entropy loss

and CLIP loss simultaneously, text-irrelevant attributes can

be preserved and relevant attributes can be edited.

The comprehensive experimental results show the suit-

ability of our proposed method avoiding test-time opti-

mization. Based on textual descriptions containing multi-

ple attributes, we generate natural manipulated images with

minimal text-irrelevant editing by our proposed decoupling

training scheme and entropy loss. We show the teaser in

Figure 2. Our contributions can be summarized as:

• Adoption of a decoupling training scheme to enforce

minimal overly editing for input text, which is beneﬁ-

cial to giving natural results.

• Application of the entropy loss onto the StyleGAN2 la-

tent code offsets, which regularizes the offset freedom

to avoid unnecessary change.

• We demonstrate that our proposed ManiCLIP out-

performs several state-of-the-art methods for multi-

attribute face editing task.

2. Related work

2.1. Face editing

The challenge of face editing [24,23,21,16,8] is to

change part of the attributes of the original face image,

while preserving all other irrelevant attributes. Since the la-

tent space Wof StyleGAN2 is claimed to better reﬂect the

disentangled semantics of the learned distribution [10,29],

previous methods [17,30,28,7] take the architecture of

StyleGAN2 [10] and do manipulation on its latent codes,

such that the generated images can be edited and meet the

desired outputs.

Speciﬁcally, Talk-to-Edit [7] introduces a dialog sys-

tem to iteratively edit images, which adopts a pretrained

attribute predictor to supervise the editing pipeline. Xu

et al. [32] propose dual latent spaces based on the origi-

nal StyleGAN2 architecture, and claim that their proposed

cross-space interaction allows cooperated complex editing.

However, this method is dependent on InterFaceGAN [23],

which requires individual operation over each sample.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ManiCLIP:Multi-AttributeFaceManipulationfromTextHaoWang1GuoshengLin1yAnaGarc´adelMolino2AnranWang2JiashiFeng2ZhiqiShen1y1NanyangTechnologicalUniversity2ByteDanceAbstractInthispaperwepresentanovelmulti-attributefacema-nipulationmethodbasedontextualdescriptions.Previoustext-basedimageeditingmethodsei...

展开>> 收起<<

ManiCLIP Multi-Attribute Face Manipulation from Text Hao Wang1Guosheng Lin1yAna Garc ıa del Molino2Anran Wang2 Jiashi Feng2Zhiqi Shen1y.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ManiCLIP Multi-Attribute Face Manipulation from Text Hao Wang1Guosheng Lin1yAna Garc ıa del Molino2Anran Wang2 Jiashi Feng2Zhiqi Shen1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: