Masked Vision-Language Transformer in Fashion Ge-Peng Jiy1 Mingcheng Zhugey1 Dehong Gao1 Deng-Ping Fan2 Christos Sakaridis2and Luc Van Gool2

2025-05-02 0 0 1.51MB 16 页 10玖币
侵权投诉
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji1, Mingcheng Zhuge1, Dehong Gao1, Deng-Ping Fan2, Christos
Sakaridis2and Luc Van Gool2
1International Core Business Unit, Alibaba Group, Hangzhou 310051, China.
2Computer Vision Lab, ETH Z¨urich, Z¨urich 8092, Switzerland.
Abstract
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal
representation. Technically, we simply utilize vision transformer architecture for replacing
the BERT in the pre-training model, making MVLT the first end-to-end framework for
the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-
grained understanding of fashion. MVLT is an extensible and convenient architecture that
admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implic-
itly modeling the vision-language alignments. More importantly, MVLT can easily general-
ize to various matching and generative tasks. Experimental results show obvious improve-
ments in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen
2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
Keywords: Vision-language, masked image reconstruction, transformer, fashion, e-commercial.
1 Introduction
The emergence of transformer is drawing enor-
mous attention from the academic community,
facilitating the advancement of computer vision
(CV) [3,4] and natural language processing
(NLP) [5,6]. Benefiting from the robustness of
transformers, researchers also contribute to the
vision-language (VL) field [711] with zeal. To
better utilize the pre-trained models in CV and
NLP, existing general VL models are mainly
based on the BERT model [12] or adopt the
well-pretrained vision extractors [13,14] or both.
However, general VL methods [1517] still struggle
when applied to the fashion domain in e-commerce
because they suffer from the two main issues:
Contributed equally. Corresponding author. Work was
done while Ge-Peng Ji was an research intern in Alibaba
Group.
{M, M, …, M}
{0.8, 0.3, …, 0.2}
Masked
Patch
Image Patches
Image Patches
ResNet
M-ViLT
2048-dim
Feature
2048-dim
MPadding
BERT
Masked Image Reconstruction (Ours)
Generative
Patch
{0.9, 0.1, …, 0.2}
2048-dim Prediction
Masked Image Modeling
1
2
1
2
Fig. 1 Different visual reconstruction tasks for VL pre-
training [1,2] utilize masked image modeling (top) with
the random masking strategy (i.e., to use M padding
to replace raw vectors), which reconstructs pre-extracted
visual semantics (i.e., probabilities) at the feature-level.
We introduce a generative task named masked image
reconstruction (bottom), which directly reconstructs image
patches at the pixel level.
1
arXiv:2210.15110v1 [cs.CV] 27 Oct 2022
2MVLT
a) Insufficient Granularity. Unlike the general
objects with complex backgrounds, only focus-
ing on coarse-grained semantics is insufficient for
a fashion product [1820], as it would lead the
network to generate sub-optimal results. Contrar-
ily, the fashion-oriented framework requires more
fine-grained representations, such as a suit with
different materials (e.g., wool, linen, and cotton)
or collars (e.g., band, camp, and windsor). b)
Bad Transferability. The pre-extracted visual
features are not discriminative for fashion-oriented
tasks, restricting the cross-modal representations.
To address the above issues, we present a
novel VL framework, termed masked vision-
language transformer (MVLT). Specifically, we
introduce a generative task, masked image recon-
struction (MIR), for the fashion-based VL frame-
work. Compared to previous pre-training tasks,
such as masked image modeling (regression
task) or masked image classification (classifica-
tion task), MIR enables the network to learn
more fine-grained representations via pixel-level
visual knowledge (see Fig. 1). Further, inspired
by pyramid vision transformer (PVT) [21], we
utilize a pyramid architecture for our VL trans-
former. Then, we introduce the MIR task. These
two improvements significantly enhance the abil-
ity to adapt to fashion-specific understanding and
generative tasks, and can conduct in an end-to-end
manner. To this end, MVLT can directly process
the raw multi-modal inputs in dense formats (i.e.,
linguistic tokens and visual patches) without extra
(e.g., ResNet) pre-processing models [22,23]. Our
main contributions are summarized as follows:
We introduce a novel masked image recon-
struction (MIR) task, which is the first real
pixel-level generative strategy utilized in VL
pre-training.
Based on the MIR task, we present an end-
to-end VL framework, called MVLT, for the
fashion domain, greatly promoting the transfer-
ability to the downstream tasks and large-scale
web applications.
Extensive experiments show that MVLT signif-
icantly outperforms the state-of-the-art models
on matching and generative tasks.
2 Background
In recent years, BERT-based pre-training models
have been widely investigated in VL tasks. Many
previous attempts, such as LXMERT [24], VL-
BERT [25], and FashionBERT [1], were success-
ful in a wide range of downstream applications.
Experiments and discussions show that BERT is a
powerful method for learning multi-modal repre-
sentations, outperforming several previous CNN-
based [26] or LSTM-based [27,28] approaches.
Compared to previous studies, this paper aims to
develop a more efficient self-supervised objective
that can be easily implemented in pre-training
and provides better representations for real-world
applications. Thus, we review research on masked
learning strategies and end-to-end multi-modal
schemes that inspired us the most.
2.1 Masked Learning Strategies
Masked modeling is the vital self-supervised task
in BERT [12] and initially demonstrates out-
standing abilities in natural language processing.
Researchers have replicated its strength in lan-
guage models because of its utility in multi-modal
and vision tasks. Most VL works [16,25,29] trans-
fer masked modeling into visual tokens and use
aregression task to construct the token feature
from nonsense-replace or a classification task to
predict the token’s attribute. To reduce the dif-
ficulty in learning, Kaleido-BERT [2] optimizes
masked modeling by employing a Kaleido strategy
that facilitates coherent learning for multi-grained
semantics. Although this work improves the per-
formance of VL-related tasks in fashion indeed, we
argue that the token-patch pre-alignment scheme
by using auxiliary tool [30,31] is still complex
and impedes the application to practical settings.
Another work [32] introduces the MLIM approach
that strengthens the masked image modeling with
an image reconstruction task, which shares a
similar idea to ours. However, our experiments
showed that requiring a model to reconstruct the
entire image without any reminder is too diffi-
cult. Recently, BEiT [33] and MAE [34] utilize
a BERT-style pre-training as part of the visual
learner, and they discover that models are effective
at learning semantics with such a scheme. These
two works strengthen our conviction that convert-
ing the original masked image modeling (i.e., a
MVLT 3
Image Patches
Multi-Modal BERT
Embedding ResNet
Frozen
(a) FashionBERT
Language
Task VL Task Vision Task
(Regression)
Text Tokens
Multi-Modal BERT
Embedding
Text Tokens Kaleido Patches
ResNet
Frozen
(b) Kaleido-BERT
Language
Task VL Task
Kaleido
Vision Task
(Regression)
Multi-Modal PVT
Embedding
Text Tokens Image Patches
Embedding
(c) M-ViLT (Ours)
Language
Task VL Task
Masked
Vision Task
(Reconstruction)
Fig. 2 Comparison of MVLT to cutting-edge fashion-oriented VL frameworks. FashionBERT (a) utilizes a language-based
encoder (i.e., BERT) to extract VL representations with single-scale visual input (i.e., image patches). Kaleido-BERT (b)
extends it with two upgrades: adds five fixed-scale inputs (i.e., Kaleido patches) to acquire hierarchical visual features and
designs Kaleido vision tasks to fully learn VL representations. However, the visual embedding of these models is frozen
(i.e., without parameter updating), thus, a lack of domain-specific visual knowledge severely hinders their transferability.
Differently, our MVLT (c) adaptively learns hierarchical features by introducing masked vision tasks in an end-to-end
framework, significantly boosting the VL-related understanding and generation.
regression task) to a masked image reconstruction
task is possible. However, our primary goal is to
design a generative pretext task that makes the
multi-modal modeling in VL pre-training easier
while eliminating the need for using prior knowl-
edge. It will be extremely helpful in our practical
application setting with billion-level data.
2.2 End-To-End Multi-Modal
Schemes
Pixel-BERT [35] is the first method to consider
end-to-end pre-training. It employs 2×2 max-
pooling layers to reduce the spatial dimension of
image features, with each image being downsam-
pled 64 times. Although this work sets a precedent
for end-to-end training, such a coarse and rigid
method cannot work well in practical settings
because it is simply combined with a ResNet [13]
as part of joint pre-training, without consider-
ing the loss in speed and performance. Recently,
VX2TEXT [36] proposes to convert all modalities
into language space and then perform end-to-end
pre-training using a relaxation scheme. Though it
is exciting to translate all the modalities into a
unified latent space, it ignores that the usage of
data extracted by pre-trained methods as input
to the model cannot be regarded as an end-to-end
framework. According to the timeline, ViLT [37]
is the first method that indeed investigates an
end-to-end framework via replacing region- or
grid-based features with patch-based projections.
However, without other designs, it cannot obtain
competitive performance since it is just a vanilla
extension of ViT [3]. Grid-VLP [38] is similar
to ViLT, but it takes a further step by demon-
strating that using a pre-trained CNN network
as the visual backbone can improve performance
on downstream tasks. SOHO [39] takes the entire
image as input and creates a visual dictionary
to affine the local region. However, this method
does not fit fashion-specific applications due to
the lack of reliable alignment information. As a
result, the vision dictionary may merely learn the
location of the background or foreground rather
than complex semantics. FashionVLP [40] uses a
feedback strategy to achieve better retrieval per-
formance. In practice, they use the well-pretrained
knowledge extracted from ResNet and then model
the whole, cropped, and landmark representa-
tions. Besides, they adopt Faster-RCNN as an
object detector for popping out RoI candidates.
Besides, some works are designed for end-to-end
pre-training [4143], but they are used for spe-
cific tasks and are not directly applicable to our
research.
Despite existing methods employing different
approaches to construct an end-to-end scheme,
solutions that forgo pre-trained methods (e.g.,
ResNet, BERT) and use raw data (i.e., text,
image) as inputs remain under-explored and are
needed urgently in multi-modal applications.
Remarks. As shown in Fig. 2, similar to the
existing two fashion-based approaches, i.e., Fash-
ionBERT (a) and Kaleido-BERT (b), the pro-
posed MVLT (c) is also a patch-based VL learner,
4MVLT
VL Transformer Encoder
Spatial Embed
Conv2D
Norm
Flatten
Norm
Norm
Linear
Norm
Linear Embed
Reduce
Multi-Head
Attention
Feed Forward
Reshape
Visual
Embedding
Language
Embedding
Divide Visual
Embedding
Language
Embedding
Position
Embedding
V/L Feature
C
Concat
Add
SRA
C
MLM
ITM
MIR
[MASK] Sleeveless
[MASK] Dress
Women’s Sleeveless
Long Dress
Stage 2
Stage 1
Stage 3
Stage 4
D
D
Fig. 3 Pipeline of our MVLT framework. Our overall architecture consists of four stages containing language and visual
embeddings and multiple transformer encoders (×Mk). Introducing the masking strategy for three sub-tasks, i.e., masked
image reconstruction (MIR), image-text matching (ITM), and masked language modeling (MLM), our MVLT can be trained
in an end-to-end manner. More details can be found in Sec. 3.
which extends the pyramid vision transformer [21]
to an architecture that adaptively extracts hier-
archical representations for fashion cross-modal
tasks. It is the first model that solves the end-to-
end problem of VL pre-training in fashion, which
allows us to simplify the implementation of our
MVLT in the fashion industry using a twin-tower
architecture [44].
3 Masked Vision-Language
Transformer
Our goal is to build an end-to-end VL framework
for the fashion domain. The overall pipeline of
our MVLT is depicted in Fig. 3. Like PVT, our
architecture inherits four stages’ properties and
generates features with different sizes. Two keys
of the proposed architecture are the multi-modal
encoder (Sec. 3.1) and the pre-training objectives
(Sec. 3.2).
3.1 Multi-Modal Encoder
As shown in Fig. 3, MVLT admits visual and
verbal inputs. On the language side, we first tok-
enize the caption of a fashion product and use the
specific token [MASK] to randomly mask out the
caption tokens with the masking ratio1rl. Follow-
ing the masking procedure, we obtain a sequence
of word tokens. Then, we insert a specific [CLS]
token at the head of this sequence. Besides, we
pad the sequence to a unified length Lusing the
[PAD] token if the length is shorter than 128.
This procedure generates the language input ids
TRL=ht1;· · · ;tLi. On the vision side,
we treat IRH×W×3as visual input, where H
and Wdenote the height and width of the given
input. This input is sliced into multiple grid-like
patches VRN×P×P×3=hv1;· · · ;vNi, where
N=HW
P2is the total number of patches and P
denotes the patch size. Similarly, the split patches
are masked out with mask ratio rv. We provide
more details about the above masking strategy for
the language and vision parts in Sec. 3.2.
The above multi-modal inputs are embedded
and fed into the consequent four VL interaction
stages (i.e.,k∈ {1,2,3,4}). In the first stage, we
generate the vision and language embeddings, T1
and V1respectively, via the given inputs (Tand
V). Regarding the subsequent stages, we consider
only the k-th stage, to have concise illustrations.
As shown in the bottom part of Fig. 3, we first
embed the language embedding TkRL×Dkinto
1We follow the default setting in BERT [12].
摘要:

MaskedVision-LanguageTransformerinFashionGe-PengJiy1,MingchengZhugey1,DehongGao1,Deng-PingFan2,ChristosSakaridis2andLucVanGool21InternationalCoreBusinessUnit,AlibabaGroup,Hangzhou310051,China.2ComputerVisionLab,ETHZurich,Zurich8092,Switzerland.AbstractWepresentamaskedvision-languagetransformer(MV...

展开>> 收起<<
Masked Vision-Language Transformer in Fashion Ge-Peng Jiy1 Mingcheng Zhugey1 Dehong Gao1 Deng-Ping Fan2 Christos Sakaridis2and Luc Van Gool2.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.51MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注