Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation Chaerin Kong1DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1

2025-04-27 1 0 1.91MB 10 页 10玖币

侵权投诉

Leveraging Off-the-shelf Diffusion Model for Multi-attribute

Fashion Image Manipulation

Chaerin Kong1*DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1

1Seoul National University 2NAVER

veztylord@snu.ac.kr {donghyeon.jeon, ohjoon.kwon}@navercorp.com nojunk@snu.ac.kr

Abstract

Fashion attribute editing is a task that aims to convert the

semantic attributes of a given fashion image while preserv-

ing the irrelevant regions. Previous works typically employ

conditional GANs where the generator explicitly learns the

target attributes and directly execute the conversion. These

approaches, however, are neither scalable nor generic as

they operate only with few limited attributes and a sepa-

rate generator is required for each dataset or attribute set.

Inspired by the recent advancement of diffusion models, we

explore the classiﬁer-guided diffusion that leverages the off-

the-shelf diffusion model pretrained on general visual se-

mantics such as Imagenet. In order to achieve a generic

editing pipeline, we pose this as multi-attribute image ma-

nipulation task, where the attribute ranges from item cat-

egory, fabric, pattern to collar and neckline. We empiri-

cally show that conventional methods fail in our challeng-

ing setting, and study efﬁcient adaptation scheme that in-

volves recently introduced attention-pooling technique to

obtain a multi-attribute classiﬁer guidance. Based on this,

we present a mask-free fashion attribute editing framework

that leverages the classiﬁer logits and the cross-attention

map for manipulation. We empirically demonstrate that

our framework achieves convincing sample quality and at-

tribute alignments.

1. Introduction

Denoising diffusion models [11,33,7,26,30] have re-

cently gained great attention from the research commu-

nity for their impressive synthesis quality, training stabil-

ity and scalability. They have demonstrated promising per-

formances across diverse tasks and benchmarks spanning

unconditional image synthesis [7], text-driven image gen-

eration [26,30,21], image manipulation [3,10] and video

synthesis [13]. Nevertheless, studies on diffusion models

are far from complete; unlike traditional generative model

*Work done during internship at NAVER.

families such as Generative Adversarial Networks (GANs)

and Variational Autoencoders (VAEs), the true potential of

diffusion models are yet to be fully disclosed.

One of the reasons for the popularity of diffusion models

is their natural capacity to incorporate conditioning infor-

mation into the generative process. Conditional diffusion

models typically rely on classiﬁer-guidance [7] or classiﬁer-

free-guidance [12], where the former requires a separately

trained classiﬁer (independent of the diffusion model) while

the latter involves condition-aware training of the diffusion

model from the beginning. Classiﬁer-guidance, in partic-

ular, provides a means to leverage the off-the-shelf diffu-

sion model trained on general visual semantics such as Ima-

genet [6], which can be particularly useful for domains with

insufﬁcient public data. 1

Fashion domain, being the heart of modern e-commerce,

has huge practical upside but retains relatively little pub-

licly available data due to privacy and proprietary issues.

To make things worse, the visual semantics are signiﬁcantly

distant from the common generative benchmarks such as

Imagenet or FFHQ [14], strongly demanding an adequate

adaptation procedure. For these reasons, the ﬁeld of fash-

ion has been relatively less explored in the deep learning

community despite its industrial values.

Image manipulation is a generative task that aims to con-

trol the semantics of an input image while preserving the

irrelevant details. Fashion image manipulation has great

applications in fashion design, interactive online shopping,

and personalized marketing. Thus several works [2,24,18]

have posed this task as fashion attribute editing task, i.e.,

attribute-guided image manipulation, and delivered promis-

ing results. However, their real world applications are sig-

niﬁcantly restricted as (1) they train a separate generative

model (typically GAN) for each fashion dataset, and (2)

their editing operations are limited to few predeﬁned at-

tributes such as color or sleeve length, as the generative

model has to learn to convert these attributes during train-

ing. As these limitations render them short for a generic

1Classiﬁers are generally easier and more straightforward to train or

ﬁnetune compared to generative models under limited data.

arXiv:2210.05872v1 [cs.CV] 12 Oct 2022

and scalable editing system, we propose to employ an off-

the-shelf diffusion model trained on a general semantic do-

main such as Imagenet, and guide its editing with a domain-

speciﬁc classiﬁer. To the best of our knowledge, this is the

ﬁrst attempt to introduce diffusion models to the fashion do-

main, especially fashion attribute editing.

This approach has clear advantage over the prior art for

several reasons. First, training a classiﬁer is generally much

simpler and easier than training a generative model under

limited data. As there is clear shortage of well-annotated

fashion images, we present an efﬁcient ﬁnetuning strategy

that empowers the classiﬁer to reason a wide range of fash-

ion attributes at once with the help of recently proposed

attention-pooling technique [36].

Second, as the capacity to understand different fashion

attributes has been transferred to the classiﬁer, our manipu-

lation framework can operate in a much greater scale, cov-

ering attributes like item category, neckline, ﬁt, fabric, pat-

tern, sleeve length, collar and gender with a single model.

This is in clear contrast to previous works that supports only

a few editing operations with separate models. We later

show that training attribute-editing GANs with such a wide

set of attributes leads to total training collapse (Fig. 3). Last,

we can edit multiple attributes at once in an integrated man-

ner, since we use the guidance signal of a multi-attribute

classiﬁer. This is particularly important for fashion domain

as various attributes must be in harmony with one another

to yield an attractive output.

Our fashion attribute classiﬁer adopts a pretrained ViT

backbone and a ﬁnetuned attention pooling layer in order

to best perform multi-attribute classiﬁcation with relatively

small training dataset. We use the gradient signal of this

classiﬁer to guide the diffusion process as done in [3,7],

thus we can alter more than one attributes at once. For local

editing of images, using a user-provided mask that explic-

itly marks the area to be edited is the most straightforward

and widely used approach [3]. However, we leverage the

natural capacity of our attribute classiﬁer to attend to the

relevant spatial regions during classiﬁcation, and use the at-

tention signal to suppress excess modiﬁcations of the origi-

nal image. This frees the users from the obligation to desig-

nate the speciﬁc region of interest and simpliﬁes the image

manipulation pipeline.

In sum, our contributions can be summarized as

• We introduce classiﬁer-guided diffusion as a simple

yet generic and effective framework for fashion at-

tribute editing.

• We empirically present an efﬁcient ﬁnetuning scheme

to adapt a pretrained ViT for a domain-speciﬁc multi-

attribute classiﬁcation setting.

• We demonstrate the effectiveness of our framework

with thorough evaluations.

2. Related Works

Diffusion Models [32,11] are a family of generative

models that convert a Gaussian noise into a natural im-

age through an iterative denoising process, which is typ-

ically modeled with a learnable neural network such as

U-Net [29]. They have gained the attention from both

the research community and the public with their state-

of-the-art performances in likelihood estimation [11,22]

and sample quality [7,30,26]. Speciﬁcally, they have

demonstrated impressive results in conditional image syn-

thesis, such as class-conditional [7], image-conditional [20]

and text-conditional [30,26] settings. Conditioning a dif-

fusion model is typically done with either the classiﬁer-

guidance [7] or the classiﬁer-free-guidance [12], and con-

ditional diffusion models have been shown to be capable of

learning an extremely rich latent space [21,26]. Recently,

a line of works [33,28] focus on improving the sampling

speed of diffusion model, by either altering the Marko-

vian noising process or embedding the diffusion steps into

a learned latent space. Another group [15,3,10] studies the

applications of diffusion models such as text-guided image

manipulation.

Image Manipulation or image editing has been a long

standing challenge in the computer vision community with

a wide range of practical applications [39,23,16,17]. Im-

age manipulation with deep models typically accompany

editing operations in the latent space to produce seman-

tic and natural modiﬁcations. Hence, image manipulation

with Generative Adversarial Networks (GANs) poses a new

problem of GAN inversion [35,27], which is the initial step

to ﬁnd a latent vector corresponding to the image that needs

to be altered. Recently, as diffusion models rise as promi-

nent alternatives, image manipulation using diffusion mod-

els are widely being studied. Blended-diffusion [3] uses

a user-provided mask and a textual prompt during the dif-

fusion process to blend the target and the existing back-

ground iteratively. A concurrent work of ours, prompt-to-

prompt [10] captures the text cross-attention structure to en-

able purely prompt-based scene editing without any explicit

masks. In our work, we take blended-diffusion as the start-

ing point, and incorporate a domain-speciﬁc classiﬁer and

its attention structure for mask-free multi-attribute fashion

image manipulation.

Fashion Attribute Editing is a highly practical task for

which several works have been studied. AMGAN [2] uses

Class Activation Map and two discriminators that identiﬁes

both real/fake and the attributes. Fashion-AttGAN [24] im-

proves upon AttGAN [9] to better ﬁt the fashion domain, by

including additional optimization objectives. VPTNet [18]

aims to handle larger shape changes by posing attribute edit-

ing as a two-stage procedure of shape-then-appearance edit-

ing. These works all employ the GAN framework, and thus

the range of attributes supported for editing is relatively lim-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LeveragingOff-the-shelfDiffusionModelforMulti-attributeFashionImageManipulationChaerinKong1*DongHyeonJeon2OhjoonKwon2NojunKwak11SeoulNationalUniversity2NAVERveztylord@snu.ac.krfdonghyeon.jeon,ohjoon.kwong@navercorp.comnojunk@snu.ac.krAbstractFashionattributeeditingisataskthataimstoconvertthesemantic...

展开>> 收起<<

Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation Chaerin Kong1DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation Chaerin Kong1DongHyeon Jeon2Ohjoon Kwon2Nojun Kwak1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: