FaD-VLP Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning Suvir Mirchandani

2025-05-06 0 0 2.21MB 14 页 10玖币
侵权投诉
FaD-VLP: Fashion Vision-and-Language Pre-training
towards Unified Retrieval and Captioning
Suvir Mirchandani
Stanford University
suvir@cs.stanford.edu
Licheng Yu
Meta AI
lichengyu@meta.com
Mengjiao Wang
Meta AI
mengjiaow@meta.com
Animesh Sinha
Meta AI
animeshsinha@meta.com
Wenwen Jiang
Meta AI
wenwenj@meta.com
Tao Xiang
Meta AI / University of Surrey
txiang@meta.com
Ning Zhang
Meta AI
ningzhang@meta.com
Abstract
Multimodal tasks in the fashion domain
have significant potential for e-commerce,
but involve challenging vision-and-language
learning problems—e.g., retrieving a fashion
item given a reference image plus text feed-
back from a user. Prior works on multimodal
fashion tasks have either been limited by
the data in individual benchmarks, or have
leveraged generic vision-and-language pre-
training but have not taken advantage of the
characteristics of fashion data. Additionally,
these works have mainly been restricted to
multimodal understanding tasks. To address
these gaps, we make two key contributions.
First, we propose a novel fashion-specific
pre-training framework based on weakly-
supervised triplets constructed from fashion
image-text pairs. We show the triplet-based
tasks are an effective addition to standard
multimodal pre-training tasks. Second, we
propose a flexible decoder-based model archi-
tecture capable of both fashion retrieval and
captioning tasks. Together, our model design
and pre-training approach are competitive
on a diverse set of fashion tasks, including
cross-modal retrieval, image retrieval with text
feedback, image captioning, relative image
captioning, and multimodal categorization.
1 Introduction
Artificial intelligence has taken the fashion industry
by storm in recent years. Significant advances have
been made in tasks like recommendation (McAuley
et al.,2015;Deldjoo et al.,2022) and virtual try-on
(Han et al.,2018;Yang et al.,2022). In addition
to these primarily visual tasks, multimodal tasks
are of particular interest in fashion for e-commerce
applications: for example, text-to-image retrieval
enables a shopper to identify a desired clothing
item via a language query (Zhuge et al.,2021).
Cross-Modal
Retrieval
Image Retrieval
w/ Text Feedback
Multimodal
Categorization
Relative Image
Captioning
Image
Captioning
Long sleeve wool-blend sweater in
faded pink featuring stripes in tones
of white, blue, grey, black ...
Is light blue
with no print
+
Buffed leather
slip-on sneakers
in black ...
+Sneakers;
Low Top Sneakers
Long sleeve hoodie in navy blue.
Drawstrings at hood. Zip closure
and scoop pockets at front ...
+Is mint green with
a smaller design
FaD-VLP
Figure 1: We present FaD-VLP, a flexible architecture
and pre-training method that supports retrieval-based
and captioning-based tasks in the fashion domain.
A key opportunity to enhance customers’ shop-
ping experiences is in the development of inter-
active multimodal shopping assistants, whereby a
user could converse with a system to identify a
desired product (Yuan and Lam,2021;Han et al.,
2022). As in Figure 1, a smart assistant is desired
to perform multiple diverse tasks, e.g., cross-modal
retrieval, image retrieval with text feedback, multi-
modal categorization, image captioning, and rela-
tive image captioning. Among them, perhaps the
most notable task in fashion is image retrieval with
text feedback, where the goal is to retrieve a tar-
get image given a reference image coupled with a
user’s language feedback (e.g., “show me a similar
shirt in light blue with no print”) (Wu et al.,2021;
Lee et al.,2021;Kim et al.,2021). In addition to
retrieval-based tasks, a central capability of conver-
sational shopping assistants is in captioning-based
tasks: describing items in detail (Yang et al.,2020)
or the differences among them. However, existing
works on image retrieval with text feedback have
almost exclusively studied that task in isolation,
arXiv:2210.15028v1 [cs.CV] 26 Oct 2022
focusing on specialized architectures and fusion
methods, with data limited by particular bench-
marks (Lee et al.,2021;Kim et al.,2021).
To train a model that can perform well on sev-
eral fashion-specific multimodal use cases, we ob-
serve an opportunity in the vast availability of mul-
timodal fashion data on e-commerce platforms.
While vision-language pre-trained (VLP) models
have been highly successful for the general domain
(Lu et al.,2019;Li et al.,2020;Su et al.,2020),
prior work has suggested that general VLP mod-
els are helpful but suboptimal for the fashion do-
main (Zhuge et al.,2021;Liu et al.,2021;Goenka
et al.,2022). Fashion images represent a domain
shift from the pre-training data (Liu et al.,2021),
and fashion tasks often require fine-grained repre-
sentations rather than coarse representations from
general VLP models (Zhuge et al.,2021).
To this end, we propose a domain-specific fash-
ion pre-training procedure that takes advantage of
fashion image-text data from multiple fashion cata-
logues. Our approach is inspired by the way that
users might shop, via comparisons: a user may first
identify a product, express a desired change in lan-
guage, and then look for a new product that better
matches their preferences. Given that data in this
triplet form—reference product, modification, tar-
get product—is not nearly as common as the paired
image-text data, we propose a lightweight method
for constructing weakly-supervised pseudo-triplet
data from image-text pairs. Additionally, we pro-
pose a unified, decoder-based model architecture
for both retrieval-based and captioning-based fash-
ion tasks. Together, we refer to our architecture
and pre-training approach as FaD-VLP: Fashion
Decoder with Vision-and-Language Pre-training.
To summarize, we make the following contri-
butions. We propose a unified architecture for
retrieval-based and captioning-based fashion tasks
(Section 3.1) and a fashion pre-training frame-
work, including 2 novel pre-training tasks based
on weakly-supervised pseudo-triplets (Section 3.2).
Our approach achieves competitive performance
on 7 downstream fashion tasks: image-to-text re-
trieval, text-to-image retrieval, image retrieval with
text feedback, category recognition, subcategory
recognition, image captioning, and relative image
captioning (Sections 4and 5.1). Finally, we con-
duct a thorough ablation study to analyze the effects
of our pre-training procedure (Section 5.2).
2 Related Work
A substantial body of work has focused on using
the Transformer architecture (Vaswani et al.,2017)
in the context of vision-and-language pre-training
(VLP) (Li et al.,2019;Su et al.,2020;Chen et al.,
2020b;Radford et al.,2021a;Li et al.,2021a;Yu
et al.,2022a). Recent works have begun to focus
on the fashion domain (Gao et al.,2020;Zhuge
et al.,2021;Zhu et al.,2021;Dong et al.,2021;
Zhang et al.,2021;Goenka et al.,2022;Yu et al.,
2022b). VLP works generally differ in their choice
of model architecture and pre-training objectives.
Model Architecture.
Most existing VLP models,
especially in the fashion domain, use encoder-style
modules for both image and text, focusing on mul-
timodal understanding tasks (which do not involve
generation—e.g., image-text retrieval, multimodal
classification). There are two main classes of these
models: (
i
) single-stream early fusion (Li et al.,
2019;Su et al.,2020;Chen et al.,2020b;Li et al.,
2020), and (
ii
) two-stream late fusion (Tan and
Bansal,2019;Lu et al.,2019;Jia et al.,2021;
Radford et al.,2021a). The nature of the down-
stream tasks often influences the choice of number
of streams; e.g., image-text retrieval is most prac-
tical with late fusion architectures which can have
faster inference. In this work, we propose a flexible
decoder-based model architecture, which embraces
the advantage of both early and late fusion mech-
anisms, and is capable of not only multimodal un-
derstanding tasks, but also captioning tasks (e.g.,
image captioning and relative image captioning).
Pre-training Objectives.
Several pre-training
tasks have been effectively used for VLP. Some
of the most popular include masked modeling or
matching objectives for the different modalities
(Li et al.,2019;Lu et al.,2019;Su et al.,2020;
Chen et al.,2020b); others include cross-modal
contrastive learning (Li et al.,2021a;Radford et al.,
2021a;Li et al.,2021b), caption generation (Zhou
et al.,2020;Wang et al.,2022), and object tagging
(Li et al.,2020). Fashion data has some unique
properties that could be leveraged at pre-training,
partly to mitigate the domain shift which makes
generic VLP less effective for fashion (Zhuge et al.,
2021). For example, there are more structured at-
tributes in fashion captions, which entitles people
to naturally do comparisons when choosing their
desired shopping items. Inspired by this, we pro-
pose that weak triplet-based comparison is used as
the basis for additional pre-training tasks.
Text Decoder
Vision Encoder
Multimodal
Decoder
Contrastive
Caption
Image
Caption
Text Decoder
Vision Encoder
Multimodal
Decoder
Relative Caption
Reference Image
Relative Caption
Vision Encoder
Text Decoder
Vision Encoder
Multimodal
Decoder
Relative Caption
Vision Encoder
(a) Aligner / Captioner (b) Relative Captioner (c) Fuser
Contrastive
Reference Image
Figure 2: Our proposed FaD-VLP architecture consists of an image encoder, a text decoder, and a multimodal de-
coder, with three configurations conforming to various retrieval and captioning tasks. Shared colors indicate shared
parameters, curved arrows represent cross attention, and tokens with a bold border denote pooled representations.
3 Method
We introduce FaD-VLP, our architecture and pre-
training method for fashion tasks. We first detail
our architecture design (Figure 2), which unifies
several retrieval and captioning settings. We then
describe our pre-training approach.
3.1 Model Overview
To motivate our model architecture, we enumerate
three desired properties:
i.
Dual Image & Text Encoders. As referenced in
Section 2, two-stream / dual-encoder architectures
are more efficient for cross-modal retrieval than
single-stream architectures. With dual encoders,
candidate embeddings can be retrieved using a
lightweight similarity function (e.g., dot product)
with a particular query embedding.
ii.
Dual Multimodal & Text Encoders. Key to our
pre-training procedure is the alignment of multi-
modal representations with image representations.
This setup is useful for the downstream task of im-
age retrieval with text feedback: a target image is
retrieved given an image with text feedback. We
desire an architecture that is dual-stream with re-
spect to a hybrid-modal input (image and text) and
another image.
iii.
Multimodal Decoder for Text Generation. For
captioning tasks, we need to generate text given
image input. Thus, we desire that the architecture
contains a multimodal decoder.
To satisfy (i) and (iii), prior work (Li et al.,
2022) has used a mixture of unimodal encoders
and encoder-decoders; more recently, Yu et al.
(2022a) demonstrated the effectiveness of using
single decoder-based model; a decoder can be used
for generation, but can also provide global repre-
sentations given a whole sequence.
Building upon this result, our architecture is
decoder-based, and consists of three modules: a vi-
sual encoder
V
, a text decoder
T
, and a multimodal
decoder
M
. For
V
, we use a convolutional network.
We obtain image token representations from the in-
termediate outputs of the convolutional network
(i.e., the output of layers 3 and 4 in a ResNet-50,
following Kim et al. (2021)). We obtain pooled rep-
resentations from
V
using average pooling over the
final feature map. We use a multi-layer transformer
architecture for
T
and
M
. Each layer consists of a
causal multi-headed self-attention module followed
by a feed-forward network and layer normalization.
For
M
, we also include a cross-attention layer be-
tween the image representation and the outputs of
the causal self-attention. We extract pooled repre-
sentations from
T
or
M
using the output of corre-
sponding to an
[EOS]
token (which has attended to
all prior tokens).
Our architecture has the following modes:
(a) Aligner / Captioner.
This mode can align
cross-modal representations or caption an image.
For alignment, we input a caption to
T
and an im-
age to
V
, extracting the pooled representations. For
captioning, we pass the outputs of
T
to
M
and
condition Mon the image via cross attention.
(b) Relative Captioner.
In this mode, we can in-
put a text (e.g., a relative caption comparing two
摘要:

FaD-VLP:FashionVision-and-LanguagePre-trainingtowardsUniedRetrievalandCaptioningSuvirMirchandaniStanfordUniversitysuvir@cs.stanford.eduLichengYuMetaAIlichengyu@meta.comMengjiaoWangMetaAImengjiaow@meta.comAnimeshSinhaMetaAIanimeshsinha@meta.comWenwenJiangMetaAIwenwenj@meta.comTaoXiangMetaAI/Universi...

展开>> 收起<<
FaD-VLP Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning Suvir Mirchandani.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:2.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注