2MVLT
a) Insufficient Granularity. Unlike the general
objects with complex backgrounds, only focus-
ing on coarse-grained semantics is insufficient for
a fashion product [18–20], as it would lead the
network to generate sub-optimal results. Contrar-
ily, the fashion-oriented framework requires more
fine-grained representations, such as a suit with
different materials (e.g., wool, linen, and cotton)
or collars (e.g., band, camp, and windsor). b)
Bad Transferability. The pre-extracted visual
features are not discriminative for fashion-oriented
tasks, restricting the cross-modal representations.
To address the above issues, we present a
novel VL framework, termed masked vision-
language transformer (MVLT). Specifically, we
introduce a generative task, masked image recon-
struction (MIR), for the fashion-based VL frame-
work. Compared to previous pre-training tasks,
such as masked image modeling (regression
task) or masked image classification (classifica-
tion task), MIR enables the network to learn
more fine-grained representations via pixel-level
visual knowledge (see Fig. 1). Further, inspired
by pyramid vision transformer (PVT) [21], we
utilize a pyramid architecture for our VL trans-
former. Then, we introduce the MIR task. These
two improvements significantly enhance the abil-
ity to adapt to fashion-specific understanding and
generative tasks, and can conduct in an end-to-end
manner. To this end, MVLT can directly process
the raw multi-modal inputs in dense formats (i.e.,
linguistic tokens and visual patches) without extra
(e.g., ResNet) pre-processing models [22,23]. Our
main contributions are summarized as follows:
•We introduce a novel masked image recon-
struction (MIR) task, which is the first real
pixel-level generative strategy utilized in VL
pre-training.
•Based on the MIR task, we present an end-
to-end VL framework, called MVLT, for the
fashion domain, greatly promoting the transfer-
ability to the downstream tasks and large-scale
web applications.
•Extensive experiments show that MVLT signif-
icantly outperforms the state-of-the-art models
on matching and generative tasks.
2 Background
In recent years, BERT-based pre-training models
have been widely investigated in VL tasks. Many
previous attempts, such as LXMERT [24], VL-
BERT [25], and FashionBERT [1], were success-
ful in a wide range of downstream applications.
Experiments and discussions show that BERT is a
powerful method for learning multi-modal repre-
sentations, outperforming several previous CNN-
based [26] or LSTM-based [27,28] approaches.
Compared to previous studies, this paper aims to
develop a more efficient self-supervised objective
that can be easily implemented in pre-training
and provides better representations for real-world
applications. Thus, we review research on masked
learning strategies and end-to-end multi-modal
schemes that inspired us the most.
2.1 Masked Learning Strategies
Masked modeling is the vital self-supervised task
in BERT [12] and initially demonstrates out-
standing abilities in natural language processing.
Researchers have replicated its strength in lan-
guage models because of its utility in multi-modal
and vision tasks. Most VL works [16,25,29] trans-
fer masked modeling into visual tokens and use
aregression task to construct the token feature
from nonsense-replace or a classification task to
predict the token’s attribute. To reduce the dif-
ficulty in learning, Kaleido-BERT [2] optimizes
masked modeling by employing a Kaleido strategy
that facilitates coherent learning for multi-grained
semantics. Although this work improves the per-
formance of VL-related tasks in fashion indeed, we
argue that the token-patch pre-alignment scheme
by using auxiliary tool [30,31] is still complex
and impedes the application to practical settings.
Another work [32] introduces the MLIM approach
that strengthens the masked image modeling with
an image reconstruction task, which shares a
similar idea to ours. However, our experiments
showed that requiring a model to reconstruct the
entire image without any reminder is too diffi-
cult. Recently, BEiT [33] and MAE [34] utilize
a BERT-style pre-training as part of the visual
learner, and they discover that models are effective
at learning semantics with such a scheme. These
two works strengthen our conviction that convert-
ing the original masked image modeling (i.e., a