FaD-VLP Fashion Vision-and-Language Pre-training towards Uniﬁed Retrieval and Captioning Suvir Mirchandani

2025-05-06 0 0 2.21MB 14 页 10玖币

侵权投诉

FaD-VLP: Fashion Vision-and-Language Pre-training

towards Uniﬁed Retrieval and Captioning

Suvir Mirchandani

Stanford University

suvir@cs.stanford.edu

Licheng Yu

Meta AI

lichengyu@meta.com

Mengjiao Wang

Meta AI

mengjiaow@meta.com

Animesh Sinha

Meta AI

animeshsinha@meta.com

Wenwen Jiang

Meta AI

wenwenj@meta.com

Tao Xiang

Meta AI / University of Surrey

txiang@meta.com

Ning Zhang

Meta AI

ningzhang@meta.com

Abstract

Multimodal tasks in the fashion domain

have signiﬁcant potential for e-commerce,

but involve challenging vision-and-language

learning problems—e.g., retrieving a fashion

item given a reference image plus text feed-

back from a user. Prior works on multimodal

fashion tasks have either been limited by

the data in individual benchmarks, or have

leveraged generic vision-and-language pre-

training but have not taken advantage of the

characteristics of fashion data. Additionally,

these works have mainly been restricted to

multimodal understanding tasks. To address

these gaps, we make two key contributions.

First, we propose a novel fashion-speciﬁc

pre-training framework based on weakly-

supervised triplets constructed from fashion

image-text pairs. We show the triplet-based

tasks are an effective addition to standard

multimodal pre-training tasks. Second, we

propose a ﬂexible decoder-based model archi-

tecture capable of both fashion retrieval and

captioning tasks. Together, our model design

and pre-training approach are competitive

on a diverse set of fashion tasks, including

cross-modal retrieval, image retrieval with text

feedback, image captioning, relative image

captioning, and multimodal categorization.

1 Introduction

Artiﬁcial intelligence has taken the fashion industry

by storm in recent years. Signiﬁcant advances have

been made in tasks like recommendation (McAuley

et al.,2015;Deldjoo et al.,2022) and virtual try-on

(Han et al.,2018;Yang et al.,2022). In addition

to these primarily visual tasks, multimodal tasks

are of particular interest in fashion for e-commerce

applications: for example, text-to-image retrieval

enables a shopper to identify a desired clothing

item via a language query (Zhuge et al.,2021).

Cross-Modal

Retrieval

Image Retrieval

w/ Text Feedback

Multimodal

Categorization

Relative Image

Captioning

Image

Captioning

Long sleeve wool-blend sweater in

faded pink featuring stripes in tones

of white, blue, grey, black ...

Is light blue

with no print

Buffed leather

slip-on sneakers

in black ...

+Sneakers;

Low Top Sneakers

Long sleeve hoodie in navy blue.

Drawstrings at hood. Zip closure

and scoop pockets at front ...

+Is mint green with

a smaller design

FaD-VLP

Figure 1: We present FaD-VLP, a ﬂexible architecture

and pre-training method that supports retrieval-based

and captioning-based tasks in the fashion domain.

A key opportunity to enhance customers’ shop-

ping experiences is in the development of inter-

active multimodal shopping assistants, whereby a

user could converse with a system to identify a

desired product (Yuan and Lam,2021;Han et al.,

2022). As in Figure 1, a smart assistant is desired

to perform multiple diverse tasks, e.g., cross-modal

retrieval, image retrieval with text feedback, multi-

modal categorization, image captioning, and rela-

tive image captioning. Among them, perhaps the

most notable task in fashion is image retrieval with

text feedback, where the goal is to retrieve a tar-

get image given a reference image coupled with a

user’s language feedback (e.g., “show me a similar

shirt in light blue with no print”) (Wu et al.,2021;

Lee et al.,2021;Kim et al.,2021). In addition to

retrieval-based tasks, a central capability of conver-

sational shopping assistants is in captioning-based

tasks: describing items in detail (Yang et al.,2020)

or the differences among them. However, existing

works on image retrieval with text feedback have

almost exclusively studied that task in isolation,

arXiv:2210.15028v1 [cs.CV] 26 Oct 2022

focusing on specialized architectures and fusion

methods, with data limited by particular bench-

marks (Lee et al.,2021;Kim et al.,2021).

To train a model that can perform well on sev-

eral fashion-speciﬁc multimodal use cases, we ob-

serve an opportunity in the vast availability of mul-

timodal fashion data on e-commerce platforms.

While vision-language pre-trained (VLP) models

have been highly successful for the general domain

(Lu et al.,2019;Li et al.,2020;Su et al.,2020),

prior work has suggested that general VLP mod-

els are helpful but suboptimal for the fashion do-

main (Zhuge et al.,2021;Liu et al.,2021;Goenka

et al.,2022). Fashion images represent a domain

shift from the pre-training data (Liu et al.,2021),

and fashion tasks often require ﬁne-grained repre-

sentations rather than coarse representations from

general VLP models (Zhuge et al.,2021).

To this end, we propose a domain-speciﬁc fash-

ion pre-training procedure that takes advantage of

fashion image-text data from multiple fashion cata-

logues. Our approach is inspired by the way that

users might shop, via comparisons: a user may ﬁrst

identify a product, express a desired change in lan-

guage, and then look for a new product that better

matches their preferences. Given that data in this

triplet form—reference product, modiﬁcation, tar-

get product—is not nearly as common as the paired

image-text data, we propose a lightweight method

for constructing weakly-supervised pseudo-triplet

data from image-text pairs. Additionally, we pro-

pose a uniﬁed, decoder-based model architecture

for both retrieval-based and captioning-based fash-

ion tasks. Together, we refer to our architecture

and pre-training approach as FaD-VLP: Fashion

Decoder with Vision-and-Language Pre-training.

To summarize, we make the following contri-

butions. We propose a uniﬁed architecture for

retrieval-based and captioning-based fashion tasks

(Section 3.1) and a fashion pre-training frame-

work, including 2 novel pre-training tasks based

on weakly-supervised pseudo-triplets (Section 3.2).

Our approach achieves competitive performance

on 7 downstream fashion tasks: image-to-text re-

trieval, text-to-image retrieval, image retrieval with

text feedback, category recognition, subcategory

recognition, image captioning, and relative image

captioning (Sections 4and 5.1). Finally, we con-

duct a thorough ablation study to analyze the effects

of our pre-training procedure (Section 5.2).

2 Related Work

A substantial body of work has focused on using

the Transformer architecture (Vaswani et al.,2017)

in the context of vision-and-language pre-training

(VLP) (Li et al.,2019;Su et al.,2020;Chen et al.,

2020b;Radford et al.,2021a;Li et al.,2021a;Yu

et al.,2022a). Recent works have begun to focus

on the fashion domain (Gao et al.,2020;Zhuge

et al.,2021;Zhu et al.,2021;Dong et al.,2021;

Zhang et al.,2021;Goenka et al.,2022;Yu et al.,

2022b). VLP works generally differ in their choice

of model architecture and pre-training objectives.

Model Architecture.

Most existing VLP models,

especially in the fashion domain, use encoder-style

modules for both image and text, focusing on mul-

timodal understanding tasks (which do not involve

generation—e.g., image-text retrieval, multimodal

classiﬁcation). There are two main classes of these

models: (

) single-stream early fusion (Li et al.,

2019;Su et al.,2020;Chen et al.,2020b;Li et al.,

2020), and (

) two-stream late fusion (Tan and

Bansal,2019;Lu et al.,2019;Jia et al.,2021;

Radford et al.,2021a). The nature of the down-

stream tasks often inﬂuences the choice of number

of streams; e.g., image-text retrieval is most prac-

tical with late fusion architectures which can have

faster inference. In this work, we propose a ﬂexible

decoder-based model architecture, which embraces

the advantage of both early and late fusion mech-

anisms, and is capable of not only multimodal un-

derstanding tasks, but also captioning tasks (e.g.,

image captioning and relative image captioning).

Pre-training Objectives.

Several pre-training

tasks have been effectively used for VLP. Some

of the most popular include masked modeling or

matching objectives for the different modalities

(Li et al.,2019;Lu et al.,2019;Su et al.,2020;

Chen et al.,2020b); others include cross-modal

contrastive learning (Li et al.,2021a;Radford et al.,

2021a;Li et al.,2021b), caption generation (Zhou

et al.,2020;Wang et al.,2022), and object tagging

(Li et al.,2020). Fashion data has some unique

properties that could be leveraged at pre-training,

partly to mitigate the domain shift which makes

generic VLP less effective for fashion (Zhuge et al.,

2021). For example, there are more structured at-

tributes in fashion captions, which entitles people

to naturally do comparisons when choosing their

desired shopping items. Inspired by this, we pro-

pose that weak triplet-based comparison is used as

the basis for additional pre-training tasks.

Text Decoder

Vision Encoder

Multimodal

Decoder

Contrastive

Caption

Image

Caption

Text Decoder

Vision Encoder

Multimodal

Decoder

Relative Caption

Reference Image

Relative Caption

Vision Encoder

Target Image

Text Decoder

Vision Encoder

Multimodal

Decoder

Relative Caption

Vision Encoder

(a) Aligner / Captioner (b) Relative Captioner (c) Fuser

Contrastive

Reference Image

Target Image

Figure 2: Our proposed FaD-VLP architecture consists of an image encoder, a text decoder, and a multimodal de-

coder, with three conﬁgurations conforming to various retrieval and captioning tasks. Shared colors indicate shared

parameters, curved arrows represent cross attention, and tokens with a bold border denote pooled representations.

3 Method

We introduce FaD-VLP, our architecture and pre-

training method for fashion tasks. We ﬁrst detail

our architecture design (Figure 2), which uniﬁes

several retrieval and captioning settings. We then

describe our pre-training approach.

3.1 Model Overview

To motivate our model architecture, we enumerate

three desired properties:

Dual Image & Text Encoders. As referenced in

Section 2, two-stream / dual-encoder architectures

are more efﬁcient for cross-modal retrieval than

single-stream architectures. With dual encoders,

candidate embeddings can be retrieved using a

lightweight similarity function (e.g., dot product)

with a particular query embedding.

ii.

Dual Multimodal & Text Encoders. Key to our

pre-training procedure is the alignment of multi-

modal representations with image representations.

This setup is useful for the downstream task of im-

age retrieval with text feedback: a target image is

retrieved given an image with text feedback. We

desire an architecture that is dual-stream with re-

spect to a hybrid-modal input (image and text) and

another image.

iii.

Multimodal Decoder for Text Generation. For

captioning tasks, we need to generate text given

image input. Thus, we desire that the architecture

contains a multimodal decoder.

To satisfy (i) and (iii), prior work (Li et al.,

2022) has used a mixture of unimodal encoders

and encoder-decoders; more recently, Yu et al.

(2022a) demonstrated the effectiveness of using

single decoder-based model; a decoder can be used

for generation, but can also provide global repre-

sentations given a whole sequence.

Building upon this result, our architecture is

decoder-based, and consists of three modules: a vi-

sual encoder

, a text decoder

, and a multimodal

decoder

. For

, we use a convolutional network.

We obtain image token representations from the in-

termediate outputs of the convolutional network

(i.e., the output of layers 3 and 4 in a ResNet-50,

following Kim et al. (2021)). We obtain pooled rep-

resentations from

using average pooling over the

ﬁnal feature map. We use a multi-layer transformer

architecture for

and

. Each layer consists of a

causal multi-headed self-attention module followed

by a feed-forward network and layer normalization.

For

, we also include a cross-attention layer be-

tween the image representation and the outputs of

the causal self-attention. We extract pooled repre-

sentations from

using the output of corre-

sponding to an

[EOS]

token (which has attended to

all prior tokens).

Our architecture has the following modes:

(a) Aligner / Captioner.

This mode can align

cross-modal representations or caption an image.

For alignment, we input a caption to

and an im-

age to

, extracting the pooled representations. For

captioning, we pass the outputs of

and

condition Mon the image via cross attention.

(b) Relative Captioner.

In this mode, we can in-

put a text (e.g., a relative caption comparing two

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FaD-VLP:FashionVision-and-LanguagePre-trainingtowardsUniedRetrievalandCaptioningSuvirMirchandaniStanfordUniversitysuvir@cs.stanford.eduLichengYuMetaAIlichengyu@meta.comMengjiaoWangMetaAImengjiaow@meta.comAnimeshSinhaMetaAIanimeshsinha@meta.comWenwenJiangMetaAIwenwenj@meta.comTaoXiangMetaAI/Universi...

展开>> 收起<<

FaD-VLP Fashion Vision-and-Language Pre-training towards Uniﬁed Retrieval and Captioning Suvir Mirchandani.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FaD-VLP Fashion Vision-and-Language Pre-training towards Uniﬁed Retrieval and Captioning Suvir Mirchandani

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: