and scalable editing system, we propose to employ an off-
the-shelf diffusion model trained on a general semantic do-
main such as Imagenet, and guide its editing with a domain-
specific classifier. To the best of our knowledge, this is the
first attempt to introduce diffusion models to the fashion do-
main, especially fashion attribute editing.
This approach has clear advantage over the prior art for
several reasons. First, training a classifier is generally much
simpler and easier than training a generative model under
limited data. As there is clear shortage of well-annotated
fashion images, we present an efficient finetuning strategy
that empowers the classifier to reason a wide range of fash-
ion attributes at once with the help of recently proposed
attention-pooling technique [36].
Second, as the capacity to understand different fashion
attributes has been transferred to the classifier, our manipu-
lation framework can operate in a much greater scale, cov-
ering attributes like item category, neckline, fit, fabric, pat-
tern, sleeve length, collar and gender with a single model.
This is in clear contrast to previous works that supports only
a few editing operations with separate models. We later
show that training attribute-editing GANs with such a wide
set of attributes leads to total training collapse (Fig. 3). Last,
we can edit multiple attributes at once in an integrated man-
ner, since we use the guidance signal of a multi-attribute
classifier. This is particularly important for fashion domain
as various attributes must be in harmony with one another
to yield an attractive output.
Our fashion attribute classifier adopts a pretrained ViT
backbone and a finetuned attention pooling layer in order
to best perform multi-attribute classification with relatively
small training dataset. We use the gradient signal of this
classifier to guide the diffusion process as done in [3,7],
thus we can alter more than one attributes at once. For local
editing of images, using a user-provided mask that explic-
itly marks the area to be edited is the most straightforward
and widely used approach [3]. However, we leverage the
natural capacity of our attribute classifier to attend to the
relevant spatial regions during classification, and use the at-
tention signal to suppress excess modifications of the origi-
nal image. This frees the users from the obligation to desig-
nate the specific region of interest and simplifies the image
manipulation pipeline.
In sum, our contributions can be summarized as
• We introduce classifier-guided diffusion as a simple
yet generic and effective framework for fashion at-
tribute editing.
• We empirically present an efficient finetuning scheme
to adapt a pretrained ViT for a domain-specific multi-
attribute classification setting.
• We demonstrate the effectiveness of our framework
with thorough evaluations.
2. Related Works
Diffusion Models [32,11] are a family of generative
models that convert a Gaussian noise into a natural im-
age through an iterative denoising process, which is typ-
ically modeled with a learnable neural network such as
U-Net [29]. They have gained the attention from both
the research community and the public with their state-
of-the-art performances in likelihood estimation [11,22]
and sample quality [7,30,26]. Specifically, they have
demonstrated impressive results in conditional image syn-
thesis, such as class-conditional [7], image-conditional [20]
and text-conditional [30,26] settings. Conditioning a dif-
fusion model is typically done with either the classifier-
guidance [7] or the classifier-free-guidance [12], and con-
ditional diffusion models have been shown to be capable of
learning an extremely rich latent space [21,26]. Recently,
a line of works [33,28] focus on improving the sampling
speed of diffusion model, by either altering the Marko-
vian noising process or embedding the diffusion steps into
a learned latent space. Another group [15,3,10] studies the
applications of diffusion models such as text-guided image
manipulation.
Image Manipulation or image editing has been a long
standing challenge in the computer vision community with
a wide range of practical applications [39,23,16,17]. Im-
age manipulation with deep models typically accompany
editing operations in the latent space to produce seman-
tic and natural modifications. Hence, image manipulation
with Generative Adversarial Networks (GANs) poses a new
problem of GAN inversion [35,27], which is the initial step
to find a latent vector corresponding to the image that needs
to be altered. Recently, as diffusion models rise as promi-
nent alternatives, image manipulation using diffusion mod-
els are widely being studied. Blended-diffusion [3] uses
a user-provided mask and a textual prompt during the dif-
fusion process to blend the target and the existing back-
ground iteratively. A concurrent work of ours, prompt-to-
prompt [10] captures the text cross-attention structure to en-
able purely prompt-based scene editing without any explicit
masks. In our work, we take blended-diffusion as the start-
ing point, and incorporate a domain-specific classifier and
its attention structure for mask-free multi-attribute fashion
image manipulation.
Fashion Attribute Editing is a highly practical task for
which several works have been studied. AMGAN [2] uses
Class Activation Map and two discriminators that identifies
both real/fake and the attributes. Fashion-AttGAN [24] im-
proves upon AttGAN [9] to better fit the fashion domain, by
including additional optimization objectives. VPTNet [18]
aims to handle larger shape changes by posing attribute edit-
ing as a two-stage procedure of shape-then-appearance edit-
ing. These works all employ the GAN framework, and thus
the range of attributes supported for editing is relatively lim-