Masked Vision-Language Transformer in Fashion Ge-Peng Jiy1 Mingcheng Zhugey1 Dehong Gao1 Deng-Ping Fan2 Christos Sakaridis2and Luc Van Gool2

2025-05-02 0 0 1.51MB 16 页 10玖币

侵权投诉

Masked Vision-Language Transformer in Fashion

Ge-Peng Ji†1, Mingcheng Zhuge†1, Dehong Gao1, Deng-Ping Fan∗2, Christos

Sakaridis2and Luc Van Gool2

1International Core Business Unit, Alibaba Group, Hangzhou 310051, China.

2Computer Vision Lab, ETH Z¨urich, Z¨urich 8092, Switzerland.

Abstract

We present a masked vision-language transformer (MVLT) for fashion-speciﬁc multi-modal

representation. Technically, we simply utilize vision transformer architecture for replacing

the BERT in the pre-training model, making MVLT the ﬁrst end-to-end framework for

the fashion domain. Besides, we designed masked image reconstruction (MIR) for a ﬁne-

grained understanding of fashion. MVLT is an extensible and convenient architecture that

admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implic-

itly modeling the vision-language alignments. More importantly, MVLT can easily general-

ize to various matching and generative tasks. Experimental results show obvious improve-

ments in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen

2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

Keywords: Vision-language, masked image reconstruction, transformer, fashion, e-commercial.

1 Introduction

The emergence of transformer is drawing enor-

mous attention from the academic community,

facilitating the advancement of computer vision

(CV) [3,4] and natural language processing

(NLP) [5,6]. Beneﬁting from the robustness of

transformers, researchers also contribute to the

vision-language (VL) ﬁeld [7–11] with zeal. To

better utilize the pre-trained models in CV and

NLP, existing general VL models are mainly

based on the BERT model [12] or adopt the

well-pretrained vision extractors [13,14] or both.

However, general VL methods [15–17] still struggle

when applied to the fashion domain in e-commerce

because they suﬀer from the two main issues:

†Contributed equally. ∗Corresponding author. Work was

done while Ge-Peng Ji was an research intern in Alibaba

Group.

{M, M, …, M}

{0.8, 0.3, …, 0.2}

Masked

Patch

Image Patches

ResNet

M-ViLT

2048-dim

Feature

2048-dim

MPadding

BERT

Masked Image Reconstruction (Ours)

Generative

Patch

{0.9, 0.1, …, 0.2}

2048-dim Prediction

Masked Image Modeling

Fig. 1 Diﬀerent visual reconstruction tasks for VL pre-

training [1,2] utilize masked image modeling (top) with

the random masking strategy (i.e., to use M padding

to replace raw vectors), which reconstructs pre-extracted

visual semantics (i.e., probabilities) at the feature-level.

We introduce a generative task named masked image

reconstruction (bottom), which directly reconstructs image

patches at the pixel level.

arXiv:2210.15110v1 [cs.CV] 27 Oct 2022

2MVLT

a) Insuﬃcient Granularity. Unlike the general

objects with complex backgrounds, only focus-

ing on coarse-grained semantics is insuﬃcient for

a fashion product [18–20], as it would lead the

network to generate sub-optimal results. Contrar-

ily, the fashion-oriented framework requires more

ﬁne-grained representations, such as a suit with

diﬀerent materials (e.g., wool, linen, and cotton)

or collars (e.g., band, camp, and windsor). b)

Bad Transferability. The pre-extracted visual

features are not discriminative for fashion-oriented

tasks, restricting the cross-modal representations.

To address the above issues, we present a

novel VL framework, termed masked vision-

language transformer (MVLT). Speciﬁcally, we

introduce a generative task, masked image recon-

struction (MIR), for the fashion-based VL frame-

work. Compared to previous pre-training tasks,

such as masked image modeling (regression

task) or masked image classiﬁcation (classiﬁca-

tion task), MIR enables the network to learn

more ﬁne-grained representations via pixel-level

visual knowledge (see Fig. 1). Further, inspired

by pyramid vision transformer (PVT) [21], we

utilize a pyramid architecture for our VL trans-

former. Then, we introduce the MIR task. These

two improvements signiﬁcantly enhance the abil-

ity to adapt to fashion-speciﬁc understanding and

generative tasks, and can conduct in an end-to-end

manner. To this end, MVLT can directly process

the raw multi-modal inputs in dense formats (i.e.,

linguistic tokens and visual patches) without extra

(e.g., ResNet) pre-processing models [22,23]. Our

main contributions are summarized as follows:

•We introduce a novel masked image recon-

struction (MIR) task, which is the ﬁrst real

pixel-level generative strategy utilized in VL

pre-training.

•Based on the MIR task, we present an end-

to-end VL framework, called MVLT, for the

fashion domain, greatly promoting the transfer-

ability to the downstream tasks and large-scale

web applications.

•Extensive experiments show that MVLT signif-

icantly outperforms the state-of-the-art models

on matching and generative tasks.

2 Background

In recent years, BERT-based pre-training models

have been widely investigated in VL tasks. Many

previous attempts, such as LXMERT [24], VL-

BERT [25], and FashionBERT [1], were success-

ful in a wide range of downstream applications.

Experiments and discussions show that BERT is a

powerful method for learning multi-modal repre-

sentations, outperforming several previous CNN-

based [26] or LSTM-based [27,28] approaches.

Compared to previous studies, this paper aims to

develop a more eﬃcient self-supervised objective

that can be easily implemented in pre-training

and provides better representations for real-world

applications. Thus, we review research on masked

learning strategies and end-to-end multi-modal

schemes that inspired us the most.

2.1 Masked Learning Strategies

Masked modeling is the vital self-supervised task

in BERT [12] and initially demonstrates out-

standing abilities in natural language processing.

Researchers have replicated its strength in lan-

guage models because of its utility in multi-modal

and vision tasks. Most VL works [16,25,29] trans-

fer masked modeling into visual tokens and use

aregression task to construct the token feature

from nonsense-replace or a classiﬁcation task to

predict the token’s attribute. To reduce the dif-

ﬁculty in learning, Kaleido-BERT [2] optimizes

masked modeling by employing a Kaleido strategy

that facilitates coherent learning for multi-grained

semantics. Although this work improves the per-

formance of VL-related tasks in fashion indeed, we

argue that the token-patch pre-alignment scheme

by using auxiliary tool [30,31] is still complex

and impedes the application to practical settings.

Another work [32] introduces the MLIM approach

that strengthens the masked image modeling with

an image reconstruction task, which shares a

similar idea to ours. However, our experiments

showed that requiring a model to reconstruct the

entire image without any reminder is too diﬃ-

cult. Recently, BEiT [33] and MAE [34] utilize

a BERT-style pre-training as part of the visual

learner, and they discover that models are eﬀective

at learning semantics with such a scheme. These

two works strengthen our conviction that convert-

ing the original masked image modeling (i.e., a

MVLT 3

Image Patches

Multi-Modal BERT

Embedding ResNet

Frozen

(a) FashionBERT

Language

Task VL Task Vision Task

(Regression)

Text Tokens

Multi-Modal BERT

Embedding

Text Tokens Kaleido Patches

ResNet

Frozen

(b) Kaleido-BERT

Language

Task VL Task

Kaleido

Vision Task

(Regression)

Multi-Modal PVT

Embedding

Text Tokens Image Patches

Embedding

Language

Task VL Task

Masked

Vision Task

(Reconstruction)

Fig. 2 Comparison of MVLT to cutting-edge fashion-oriented VL frameworks. FashionBERT (a) utilizes a language-based

encoder (i.e., BERT) to extract VL representations with single-scale visual input (i.e., image patches). Kaleido-BERT (b)

extends it with two upgrades: adds ﬁve ﬁxed-scale inputs (i.e., Kaleido patches) to acquire hierarchical visual features and

designs Kaleido vision tasks to fully learn VL representations. However, the visual embedding of these models is frozen

(i.e., without parameter updating), thus, a lack of domain-speciﬁc visual knowledge severely hinders their transferability.

Diﬀerently, our MVLT (c) adaptively learns hierarchical features by introducing masked vision tasks in an end-to-end

framework, signiﬁcantly boosting the VL-related understanding and generation.

regression task) to a masked image reconstruction

task is possible. However, our primary goal is to

design a generative pretext task that makes the

multi-modal modeling in VL pre-training easier

while eliminating the need for using prior knowl-

edge. It will be extremely helpful in our practical

application setting with billion-level data.

2.2 End-To-End Multi-Modal

Schemes

Pixel-BERT [35] is the ﬁrst method to consider

end-to-end pre-training. It employs 2×2 max-

pooling layers to reduce the spatial dimension of

image features, with each image being downsam-

pled 64 times. Although this work sets a precedent

for end-to-end training, such a coarse and rigid

method cannot work well in practical settings

because it is simply combined with a ResNet [13]

as part of joint pre-training, without consider-

ing the loss in speed and performance. Recently,

VX2TEXT [36] proposes to convert all modalities

into language space and then perform end-to-end

pre-training using a relaxation scheme. Though it

is exciting to translate all the modalities into a

uniﬁed latent space, it ignores that the usage of

data extracted by pre-trained methods as input

to the model cannot be regarded as an end-to-end

framework. According to the timeline, ViLT [37]

is the ﬁrst method that indeed investigates an

end-to-end framework via replacing region- or

grid-based features with patch-based projections.

However, without other designs, it cannot obtain

competitive performance since it is just a vanilla

extension of ViT [3]. Grid-VLP [38] is similar

to ViLT, but it takes a further step by demon-

strating that using a pre-trained CNN network

as the visual backbone can improve performance

on downstream tasks. SOHO [39] takes the entire

image as input and creates a visual dictionary

to aﬃne the local region. However, this method

does not ﬁt fashion-speciﬁc applications due to

the lack of reliable alignment information. As a

result, the vision dictionary may merely learn the

location of the background or foreground rather

than complex semantics. FashionVLP [40] uses a

feedback strategy to achieve better retrieval per-

formance. In practice, they use the well-pretrained

knowledge extracted from ResNet and then model

the whole, cropped, and landmark representa-

tions. Besides, they adopt Faster-RCNN as an

object detector for popping out RoI candidates.

Besides, some works are designed for end-to-end

pre-training [41–43], but they are used for spe-

ciﬁc tasks and are not directly applicable to our

research.

Despite existing methods employing diﬀerent

approaches to construct an end-to-end scheme,

solutions that forgo pre-trained methods (e.g.,

ResNet, BERT) and use raw data (i.e., text,

image) as inputs remain under-explored and are

needed urgently in multi-modal applications.

Remarks. As shown in Fig. 2, similar to the

existing two fashion-based approaches, i.e., Fash-

ionBERT (a) and Kaleido-BERT (b), the pro-

posed MVLT (c) is also a patch-based VL learner,

4MVLT

VL Transformer Encoder

Spatial Embed

Conv2D

Norm

Flatten

Norm

Linear

Norm

Linear Embed

Reduce

Multi-Head

Attention

Feed Forward

Reshape

Visual

Embedding

Language

Embedding

Divide Visual

Embedding

Language

Embedding

Position

Embedding

V/L Feature

Concat

Add

SRA

MLM

ITM

MIR

[MASK] Sleeveless

[MASK] Dress

Women’s Sleeveless

Long Dress

Stage 2

Stage 1

Stage 3

Stage 4

Fig. 3 Pipeline of our MVLT framework. Our overall architecture consists of four stages containing language and visual

embeddings and multiple transformer encoders (×Mk). Introducing the masking strategy for three sub-tasks, i.e., masked

image reconstruction (MIR), image-text matching (ITM), and masked language modeling (MLM), our MVLT can be trained

in an end-to-end manner. More details can be found in Sec. 3.

which extends the pyramid vision transformer [21]

to an architecture that adaptively extracts hier-

archical representations for fashion cross-modal

tasks. It is the ﬁrst model that solves the end-to-

end problem of VL pre-training in fashion, which

allows us to simplify the implementation of our

MVLT in the fashion industry using a twin-tower

architecture [44].

3 Masked Vision-Language

Transformer

Our goal is to build an end-to-end VL framework

for the fashion domain. The overall pipeline of

our MVLT is depicted in Fig. 3. Like PVT, our

architecture inherits four stages’ properties and

generates features with diﬀerent sizes. Two keys

of the proposed architecture are the multi-modal

encoder (Sec. 3.1) and the pre-training objectives

(Sec. 3.2).

3.1 Multi-Modal Encoder

As shown in Fig. 3, MVLT admits visual and

verbal inputs. On the language side, we ﬁrst tok-

enize the caption of a fashion product and use the

speciﬁc token [MASK] to randomly mask out the

caption tokens with the masking ratio1rl. Follow-

ing the masking procedure, we obtain a sequence

of word tokens. Then, we insert a speciﬁc [CLS]

token at the head of this sequence. Besides, we

pad the sequence to a uniﬁed length Lusing the

[PAD] token if the length is shorter than 128.

This procedure generates the language input ids

T∈RL=ht1;· · · ;tLi. On the vision side,

we treat I∈RH×W×3as visual input, where H

and Wdenote the height and width of the given

input. This input is sliced into multiple grid-like

patches V∈RN×P×P×3=hv1;· · · ;vNi, where

N=HW

P2is the total number of patches and P

denotes the patch size. Similarly, the split patches

are masked out with mask ratio rv. We provide

more details about the above masking strategy for

the language and vision parts in Sec. 3.2.

The above multi-modal inputs are embedded

and fed into the consequent four VL interaction

stages (i.e.,k∈ {1,2,3,4}). In the ﬁrst stage, we

generate the vision and language embeddings, T1

and V1respectively, via the given inputs (Tand

V). Regarding the subsequent stages, we consider

only the k-th stage, to have concise illustrations.

As shown in the bottom part of Fig. 3, we ﬁrst

embed the language embedding Tk∈RL×Dkinto

1We follow the default setting in BERT [12].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MaskedVision-LanguageTransformerinFashionGe-PengJiy1,MingchengZhugey1,DehongGao1,Deng-PingFan2,ChristosSakaridis2andLucVanGool21InternationalCoreBusinessUnit,AlibabaGroup,Hangzhou310051,China.2ComputerVisionLab,ETHZurich,Zurich8092,Switzerland.AbstractWepresentamaskedvision-languagetransformer(MV...

展开>> 收起<<

Masked Vision-Language Transformer in Fashion Ge-Peng Jiy1 Mingcheng Zhugey1 Dehong Gao1 Deng-Ping Fan2 Christos Sakaridis2and Luc Van Gool2.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Masked Vision-Language Transformer in Fashion Ge-Peng Jiy1 Mingcheng Zhugey1 Dehong Gao1 Deng-Ping Fan2 Christos Sakaridis2and Luc Van Gool2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: