Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation Chaerin Kong1DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1

2025-04-27 0 0 1.91MB 10 页 10玖币
侵权投诉
Leveraging Off-the-shelf Diffusion Model for Multi-attribute
Fashion Image Manipulation
Chaerin Kong1*DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1
1Seoul National University 2NAVER
veztylord@snu.ac.kr {donghyeon.jeon, ohjoon.kwon}@navercorp.com nojunk@snu.ac.kr
Abstract
Fashion attribute editing is a task that aims to convert the
semantic attributes of a given fashion image while preserv-
ing the irrelevant regions. Previous works typically employ
conditional GANs where the generator explicitly learns the
target attributes and directly execute the conversion. These
approaches, however, are neither scalable nor generic as
they operate only with few limited attributes and a sepa-
rate generator is required for each dataset or attribute set.
Inspired by the recent advancement of diffusion models, we
explore the classifier-guided diffusion that leverages the off-
the-shelf diffusion model pretrained on general visual se-
mantics such as Imagenet. In order to achieve a generic
editing pipeline, we pose this as multi-attribute image ma-
nipulation task, where the attribute ranges from item cat-
egory, fabric, pattern to collar and neckline. We empiri-
cally show that conventional methods fail in our challeng-
ing setting, and study efficient adaptation scheme that in-
volves recently introduced attention-pooling technique to
obtain a multi-attribute classifier guidance. Based on this,
we present a mask-free fashion attribute editing framework
that leverages the classifier logits and the cross-attention
map for manipulation. We empirically demonstrate that
our framework achieves convincing sample quality and at-
tribute alignments.
1. Introduction
Denoising diffusion models [11,33,7,26,30] have re-
cently gained great attention from the research commu-
nity for their impressive synthesis quality, training stabil-
ity and scalability. They have demonstrated promising per-
formances across diverse tasks and benchmarks spanning
unconditional image synthesis [7], text-driven image gen-
eration [26,30,21], image manipulation [3,10] and video
synthesis [13]. Nevertheless, studies on diffusion models
are far from complete; unlike traditional generative model
*Work done during internship at NAVER.
families such as Generative Adversarial Networks (GANs)
and Variational Autoencoders (VAEs), the true potential of
diffusion models are yet to be fully disclosed.
One of the reasons for the popularity of diffusion models
is their natural capacity to incorporate conditioning infor-
mation into the generative process. Conditional diffusion
models typically rely on classifier-guidance [7] or classifier-
free-guidance [12], where the former requires a separately
trained classifier (independent of the diffusion model) while
the latter involves condition-aware training of the diffusion
model from the beginning. Classifier-guidance, in partic-
ular, provides a means to leverage the off-the-shelf diffu-
sion model trained on general visual semantics such as Ima-
genet [6], which can be particularly useful for domains with
insufficient public data. 1
Fashion domain, being the heart of modern e-commerce,
has huge practical upside but retains relatively little pub-
licly available data due to privacy and proprietary issues.
To make things worse, the visual semantics are significantly
distant from the common generative benchmarks such as
Imagenet or FFHQ [14], strongly demanding an adequate
adaptation procedure. For these reasons, the field of fash-
ion has been relatively less explored in the deep learning
community despite its industrial values.
Image manipulation is a generative task that aims to con-
trol the semantics of an input image while preserving the
irrelevant details. Fashion image manipulation has great
applications in fashion design, interactive online shopping,
and personalized marketing. Thus several works [2,24,18]
have posed this task as fashion attribute editing task, i.e.,
attribute-guided image manipulation, and delivered promis-
ing results. However, their real world applications are sig-
nificantly restricted as (1) they train a separate generative
model (typically GAN) for each fashion dataset, and (2)
their editing operations are limited to few predefined at-
tributes such as color or sleeve length, as the generative
model has to learn to convert these attributes during train-
ing. As these limitations render them short for a generic
1Classifiers are generally easier and more straightforward to train or
finetune compared to generative models under limited data.
arXiv:2210.05872v1 [cs.CV] 12 Oct 2022
and scalable editing system, we propose to employ an off-
the-shelf diffusion model trained on a general semantic do-
main such as Imagenet, and guide its editing with a domain-
specific classifier. To the best of our knowledge, this is the
first attempt to introduce diffusion models to the fashion do-
main, especially fashion attribute editing.
This approach has clear advantage over the prior art for
several reasons. First, training a classifier is generally much
simpler and easier than training a generative model under
limited data. As there is clear shortage of well-annotated
fashion images, we present an efficient finetuning strategy
that empowers the classifier to reason a wide range of fash-
ion attributes at once with the help of recently proposed
attention-pooling technique [36].
Second, as the capacity to understand different fashion
attributes has been transferred to the classifier, our manipu-
lation framework can operate in a much greater scale, cov-
ering attributes like item category, neckline, fit, fabric, pat-
tern, sleeve length, collar and gender with a single model.
This is in clear contrast to previous works that supports only
a few editing operations with separate models. We later
show that training attribute-editing GANs with such a wide
set of attributes leads to total training collapse (Fig. 3). Last,
we can edit multiple attributes at once in an integrated man-
ner, since we use the guidance signal of a multi-attribute
classifier. This is particularly important for fashion domain
as various attributes must be in harmony with one another
to yield an attractive output.
Our fashion attribute classifier adopts a pretrained ViT
backbone and a finetuned attention pooling layer in order
to best perform multi-attribute classification with relatively
small training dataset. We use the gradient signal of this
classifier to guide the diffusion process as done in [3,7],
thus we can alter more than one attributes at once. For local
editing of images, using a user-provided mask that explic-
itly marks the area to be edited is the most straightforward
and widely used approach [3]. However, we leverage the
natural capacity of our attribute classifier to attend to the
relevant spatial regions during classification, and use the at-
tention signal to suppress excess modifications of the origi-
nal image. This frees the users from the obligation to desig-
nate the specific region of interest and simplifies the image
manipulation pipeline.
In sum, our contributions can be summarized as
We introduce classifier-guided diffusion as a simple
yet generic and effective framework for fashion at-
tribute editing.
We empirically present an efficient finetuning scheme
to adapt a pretrained ViT for a domain-specific multi-
attribute classification setting.
We demonstrate the effectiveness of our framework
with thorough evaluations.
2. Related Works
Diffusion Models [32,11] are a family of generative
models that convert a Gaussian noise into a natural im-
age through an iterative denoising process, which is typ-
ically modeled with a learnable neural network such as
U-Net [29]. They have gained the attention from both
the research community and the public with their state-
of-the-art performances in likelihood estimation [11,22]
and sample quality [7,30,26]. Specifically, they have
demonstrated impressive results in conditional image syn-
thesis, such as class-conditional [7], image-conditional [20]
and text-conditional [30,26] settings. Conditioning a dif-
fusion model is typically done with either the classifier-
guidance [7] or the classifier-free-guidance [12], and con-
ditional diffusion models have been shown to be capable of
learning an extremely rich latent space [21,26]. Recently,
a line of works [33,28] focus on improving the sampling
speed of diffusion model, by either altering the Marko-
vian noising process or embedding the diffusion steps into
a learned latent space. Another group [15,3,10] studies the
applications of diffusion models such as text-guided image
manipulation.
Image Manipulation or image editing has been a long
standing challenge in the computer vision community with
a wide range of practical applications [39,23,16,17]. Im-
age manipulation with deep models typically accompany
editing operations in the latent space to produce seman-
tic and natural modifications. Hence, image manipulation
with Generative Adversarial Networks (GANs) poses a new
problem of GAN inversion [35,27], which is the initial step
to find a latent vector corresponding to the image that needs
to be altered. Recently, as diffusion models rise as promi-
nent alternatives, image manipulation using diffusion mod-
els are widely being studied. Blended-diffusion [3] uses
a user-provided mask and a textual prompt during the dif-
fusion process to blend the target and the existing back-
ground iteratively. A concurrent work of ours, prompt-to-
prompt [10] captures the text cross-attention structure to en-
able purely prompt-based scene editing without any explicit
masks. In our work, we take blended-diffusion as the start-
ing point, and incorporate a domain-specific classifier and
its attention structure for mask-free multi-attribute fashion
image manipulation.
Fashion Attribute Editing is a highly practical task for
which several works have been studied. AMGAN [2] uses
Class Activation Map and two discriminators that identifies
both real/fake and the attributes. Fashion-AttGAN [24] im-
proves upon AttGAN [9] to better fit the fashion domain, by
including additional optimization objectives. VPTNet [18]
aims to handle larger shape changes by posing attribute edit-
ing as a two-stage procedure of shape-then-appearance edit-
ing. These works all employ the GAN framework, and thus
the range of attributes supported for editing is relatively lim-
摘要:

LeveragingOff-the-shelfDiffusionModelforMulti-attributeFashionImageManipulationChaerinKong1*DongHyeonJeon2OhjoonKwon2NojunKwak11SeoulNationalUniversity2NAVERveztylord@snu.ac.krfdonghyeon.jeon,ohjoon.kwong@navercorp.comnojunk@snu.ac.krAbstractFashionattributeeditingisataskthataimstoconvertthesemantic...

展开>> 收起<<
Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation Chaerin Kong1DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:1.91MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注