ManiCLIP Multi-Attribute Face Manipulation from Text Hao Wang1Guosheng Lin1yAna Garc ıa del Molino2Anran Wang2 Jiashi Feng2Zhiqi Shen1y

2025-05-02 0 0 2.77MB 10 页 10玖币
侵权投诉
ManiCLIP: Multi-Attribute Face Manipulation from Text
Hao Wang1Guosheng Lin1Ana Garc´
ıa del Molino2Anran Wang2
Jiashi Feng2Zhiqi Shen1
1Nanyang Technological University 2ByteDance
Abstract
In this paper we present a novel multi-attribute face ma-
nipulation method based on textual descriptions. Previous
text-based image editing methods either require test-time
optimization for each individual image or are restricted
to single attribute editing. Extending these methods to
multi-attribute face image editing scenarios will introduce
undesired excessive attribute change, e.g., text-relevant
attributes are overly manipulated and text-irrelevant at-
tributes are also changed. In order to address these chal-
lenges and achieve natural editing over multiple face at-
tributes, we propose a new decoupling training scheme
where we use group sampling to get text segments from
same attribute categories, instead of whole complex sen-
tences. Further, to preserve other existing face attributes,
we encourage the model to edit the latent code of each at-
tribute separately via an entropy constraint. During the in-
ference phase, our model is able to edit new face images
without any test-time optimization, even from complex tex-
tual prompts. We show extensive experiments and analysis
to demonstrate the efficacy of our method, which generates
natural manipulated faces with minimal text-irrelevant at-
tribute editing. Code and pre-trained model are available at
https://github.com/hwang1996/ManiCLIP.
1. Introduction
Face editing [32,7,17] aims to manipulate the given
face images under certain instructions. In this paper, we
are interested in multi-attribute face editing, where one can
change several different facial attributes in one shot, from
the provided textual descriptions. Multi-attribute face edit-
ing is of great significance when users want to change their
photo contents, including makeups, hair styles and face
shapes. This gives people freedom to change the images
and generate their desired images. However, relevant stud-
ies on multi-attribute editing are still rare.
Corresponding authors
Text Image
She wears lipstick.
She has big lips.
Excessive editing Natural editing
Input Output
Figure 1. Comparison between excessive editing and natural
editing. The excessive and natural editing results are generated by
our baseline model and our method respectively.
With the development of StyleGAN2 [9,10] and CLIP
[18] model, some recent works explore leveraging them for
image content editing. Specifically, StyleGAN2 has been
demonstrated to learn disentangled latent codes [9], which
are shown to have corresponding semantic meanings as
their generated images [30,27]. CLIP model is pretrained
on large-scale image-text datasets, which can measure the
similarity between given images and text, by mapping them
to the learned feature space. Therefore, to achieve au-
tomatic natural language based face editing, many works
[17,30,28,15,32] designed manipulation components to
change the latent codes based on StyleGAN2 and CLIP
model.
To be specific, TediGAN [30] uses CLIP loss [18] to su-
pervise the text-image alignment of the edited images. This
method suffers the problem of slow inference speed, as it re-
quires individual alignment optimization over each image.
StyleCLIP [17] proposes to train a mapper module to pre-
dict the proper offsets over the image latent codes to achieve
desired editing. However, they need to train various map-
pers for different text inputs. HairCLIP [28] follows the
mapper design of StyleCLIP [17], and only trains one map-
per for different hair-related text inputs. But the original
HairCLIP [28] is only applicable to single hair-related at-
tribute editing, and the problem of multi-attribute editing
remains unsolved.
Though it is straightforward to extend previous methods
to the general multi-attribute face manipulation task, we ob-
serve that the baseline model results in excessive editing, as
shown in Figure 1. Excessive editing means: 1) the relevant
attributes are overly amplified and unnatural results are pro-
arXiv:2210.00445v3 [cs.CV] 26 Mar 2023
The person is
chubby, and has
blond hair.
The person has
brown hair, and
arched eyebrows.
Input
Figure 2. ManiCLIP: our model is able to naturally edit multiple face attributes from natural language instructions. Top row shows the
original images. Rows 2-3 show the edited face images under different text descriptions.
duced, e.g. the lipstick has over bright color in Figure 1,
and 2) the text-irrelevant attributes are also changed in the
edited images, e.g. the skin color in Figure 1. To alleviate
these issues, we introduce a new method, named ManiCLIP,
that deploys a new decoupling training scheme and entropy
constraint based loss design to generate natural edited im-
ages while minimizing the text-irrelevant attribute change.
Our proposed method is two-pronged. First, we observe
that the difficulty of attribute editing for the model is hetero-
geneous. For example, manipulation of lipstick is easy, sev-
eral epoch training would give overly edited results, while
editing of other attributes may require more epochs. When
we do mixed training, since “hard” attributes are harder to
synthesize, the model tends to keep optimizing the CLIP
loss and, as a result, “easy” attributes will become unnatu-
ral, for example the lipstick in Figure 1presents excessive
editing. Hence we propose the decoupling training scheme,
where we use group sampling and only edit one kind of at-
tributes in each instance. It allows the model to fit each
kind of attributes individually, which alleviates the issue of
unnatural edited results. Second, because of the disentan-
glement property [10] of StyleGAN2 latent space, only a
small portion of the latent code dimensions affect certain
attributes. However, StyleGAN2 latent codes have thou-
sands of dimensions, which have much freedom during the
editing process, hence we propose to use the entropy loss
to control the number of non-zero dimensions. Since the
uniform distribution of latent code values yields maximum
entropy, minimizing the entropy loss can force the model to
produce more values closer to zero, as visualized in Figure
5. During the training phase, we optimize the entropy loss
and CLIP loss simultaneously, text-irrelevant attributes can
be preserved and relevant attributes can be edited.
The comprehensive experimental results show the suit-
ability of our proposed method avoiding test-time opti-
mization. Based on textual descriptions containing multi-
ple attributes, we generate natural manipulated images with
minimal text-irrelevant editing by our proposed decoupling
training scheme and entropy loss. We show the teaser in
Figure 2. Our contributions can be summarized as:
Adoption of a decoupling training scheme to enforce
minimal overly editing for input text, which is benefi-
cial to giving natural results.
Application of the entropy loss onto the StyleGAN2 la-
tent code offsets, which regularizes the offset freedom
to avoid unnecessary change.
• We demonstrate that our proposed ManiCLIP out-
performs several state-of-the-art methods for multi-
attribute face editing task.
2. Related work
2.1. Face editing
The challenge of face editing [24,23,21,16,8] is to
change part of the attributes of the original face image,
while preserving all other irrelevant attributes. Since the la-
tent space Wof StyleGAN2 is claimed to better reflect the
disentangled semantics of the learned distribution [10,29],
previous methods [17,30,28,7] take the architecture of
StyleGAN2 [10] and do manipulation on its latent codes,
such that the generated images can be edited and meet the
desired outputs.
Specifically, Talk-to-Edit [7] introduces a dialog sys-
tem to iteratively edit images, which adopts a pretrained
attribute predictor to supervise the editing pipeline. Xu
et al. [32] propose dual latent spaces based on the origi-
nal StyleGAN2 architecture, and claim that their proposed
cross-space interaction allows cooperated complex editing.
However, this method is dependent on InterFaceGAN [23],
which requires individual operation over each sample.
摘要:

ManiCLIP:Multi-AttributeFaceManipulationfromTextHaoWang1GuoshengLin1yAnaGarc´adelMolino2AnranWang2JiashiFeng2ZhiqiShen1y1NanyangTechnologicalUniversity2ByteDanceAbstractInthispaperwepresentanovelmulti-attributefacema-nipulationmethodbasedontextualdescriptions.Previoustext-basedimageeditingmethodsei...

展开>> 收起<<
ManiCLIP Multi-Attribute Face Manipulation from Text Hao Wang1Guosheng Lin1yAna Garc ıa del Molino2Anran Wang2 Jiashi Feng2Zhiqi Shen1y.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.77MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注