
The person is
chubby, and has
blond hair.
The person has
brown hair, and
arched eyebrows.
Input
Figure 2. ManiCLIP: our model is able to naturally edit multiple face attributes from natural language instructions. Top row shows the
original images. Rows 2-3 show the edited face images under different text descriptions.
duced, e.g. the lipstick has over bright color in Figure 1,
and 2) the text-irrelevant attributes are also changed in the
edited images, e.g. the skin color in Figure 1. To alleviate
these issues, we introduce a new method, named ManiCLIP,
that deploys a new decoupling training scheme and entropy
constraint based loss design to generate natural edited im-
ages while minimizing the text-irrelevant attribute change.
Our proposed method is two-pronged. First, we observe
that the difficulty of attribute editing for the model is hetero-
geneous. For example, manipulation of lipstick is easy, sev-
eral epoch training would give overly edited results, while
editing of other attributes may require more epochs. When
we do mixed training, since “hard” attributes are harder to
synthesize, the model tends to keep optimizing the CLIP
loss and, as a result, “easy” attributes will become unnatu-
ral, for example the lipstick in Figure 1presents excessive
editing. Hence we propose the decoupling training scheme,
where we use group sampling and only edit one kind of at-
tributes in each instance. It allows the model to fit each
kind of attributes individually, which alleviates the issue of
unnatural edited results. Second, because of the disentan-
glement property [10] of StyleGAN2 latent space, only a
small portion of the latent code dimensions affect certain
attributes. However, StyleGAN2 latent codes have thou-
sands of dimensions, which have much freedom during the
editing process, hence we propose to use the entropy loss
to control the number of non-zero dimensions. Since the
uniform distribution of latent code values yields maximum
entropy, minimizing the entropy loss can force the model to
produce more values closer to zero, as visualized in Figure
5. During the training phase, we optimize the entropy loss
and CLIP loss simultaneously, text-irrelevant attributes can
be preserved and relevant attributes can be edited.
The comprehensive experimental results show the suit-
ability of our proposed method avoiding test-time opti-
mization. Based on textual descriptions containing multi-
ple attributes, we generate natural manipulated images with
minimal text-irrelevant editing by our proposed decoupling
training scheme and entropy loss. We show the teaser in
Figure 2. Our contributions can be summarized as:
• Adoption of a decoupling training scheme to enforce
minimal overly editing for input text, which is benefi-
cial to giving natural results.
• Application of the entropy loss onto the StyleGAN2 la-
tent code offsets, which regularizes the offset freedom
to avoid unnecessary change.
• We demonstrate that our proposed ManiCLIP out-
performs several state-of-the-art methods for multi-
attribute face editing task.
2. Related work
2.1. Face editing
The challenge of face editing [24,23,21,16,8] is to
change part of the attributes of the original face image,
while preserving all other irrelevant attributes. Since the la-
tent space Wof StyleGAN2 is claimed to better reflect the
disentangled semantics of the learned distribution [10,29],
previous methods [17,30,28,7] take the architecture of
StyleGAN2 [10] and do manipulation on its latent codes,
such that the generated images can be edited and meet the
desired outputs.
Specifically, Talk-to-Edit [7] introduces a dialog sys-
tem to iteratively edit images, which adopts a pretrained
attribute predictor to supervise the editing pipeline. Xu
et al. [32] propose dual latent spaces based on the origi-
nal StyleGAN2 architecture, and claim that their proposed
cross-space interaction allows cooperated complex editing.
However, this method is dependent on InterFaceGAN [23],
which requires individual operation over each sample.