focusing on specialized architectures and fusion
methods, with data limited by particular bench-
marks (Lee et al.,2021;Kim et al.,2021).
To train a model that can perform well on sev-
eral fashion-specific multimodal use cases, we ob-
serve an opportunity in the vast availability of mul-
timodal fashion data on e-commerce platforms.
While vision-language pre-trained (VLP) models
have been highly successful for the general domain
(Lu et al.,2019;Li et al.,2020;Su et al.,2020),
prior work has suggested that general VLP mod-
els are helpful but suboptimal for the fashion do-
main (Zhuge et al.,2021;Liu et al.,2021;Goenka
et al.,2022). Fashion images represent a domain
shift from the pre-training data (Liu et al.,2021),
and fashion tasks often require fine-grained repre-
sentations rather than coarse representations from
general VLP models (Zhuge et al.,2021).
To this end, we propose a domain-specific fash-
ion pre-training procedure that takes advantage of
fashion image-text data from multiple fashion cata-
logues. Our approach is inspired by the way that
users might shop, via comparisons: a user may first
identify a product, express a desired change in lan-
guage, and then look for a new product that better
matches their preferences. Given that data in this
triplet form—reference product, modification, tar-
get product—is not nearly as common as the paired
image-text data, we propose a lightweight method
for constructing weakly-supervised pseudo-triplet
data from image-text pairs. Additionally, we pro-
pose a unified, decoder-based model architecture
for both retrieval-based and captioning-based fash-
ion tasks. Together, we refer to our architecture
and pre-training approach as FaD-VLP: Fashion
Decoder with Vision-and-Language Pre-training.
To summarize, we make the following contri-
butions. We propose a unified architecture for
retrieval-based and captioning-based fashion tasks
(Section 3.1) and a fashion pre-training frame-
work, including 2 novel pre-training tasks based
on weakly-supervised pseudo-triplets (Section 3.2).
Our approach achieves competitive performance
on 7 downstream fashion tasks: image-to-text re-
trieval, text-to-image retrieval, image retrieval with
text feedback, category recognition, subcategory
recognition, image captioning, and relative image
captioning (Sections 4and 5.1). Finally, we con-
duct a thorough ablation study to analyze the effects
of our pre-training procedure (Section 5.2).
2 Related Work
A substantial body of work has focused on using
the Transformer architecture (Vaswani et al.,2017)
in the context of vision-and-language pre-training
(VLP) (Li et al.,2019;Su et al.,2020;Chen et al.,
2020b;Radford et al.,2021a;Li et al.,2021a;Yu
et al.,2022a). Recent works have begun to focus
on the fashion domain (Gao et al.,2020;Zhuge
et al.,2021;Zhu et al.,2021;Dong et al.,2021;
Zhang et al.,2021;Goenka et al.,2022;Yu et al.,
2022b). VLP works generally differ in their choice
of model architecture and pre-training objectives.
Model Architecture.
Most existing VLP models,
especially in the fashion domain, use encoder-style
modules for both image and text, focusing on mul-
timodal understanding tasks (which do not involve
generation—e.g., image-text retrieval, multimodal
classification). There are two main classes of these
models: (
i
) single-stream early fusion (Li et al.,
2019;Su et al.,2020;Chen et al.,2020b;Li et al.,
2020), and (
ii
) two-stream late fusion (Tan and
Bansal,2019;Lu et al.,2019;Jia et al.,2021;
Radford et al.,2021a). The nature of the down-
stream tasks often influences the choice of number
of streams; e.g., image-text retrieval is most prac-
tical with late fusion architectures which can have
faster inference. In this work, we propose a flexible
decoder-based model architecture, which embraces
the advantage of both early and late fusion mech-
anisms, and is capable of not only multimodal un-
derstanding tasks, but also captioning tasks (e.g.,
image captioning and relative image captioning).
Pre-training Objectives.
Several pre-training
tasks have been effectively used for VLP. Some
of the most popular include masked modeling or
matching objectives for the different modalities
(Li et al.,2019;Lu et al.,2019;Su et al.,2020;
Chen et al.,2020b); others include cross-modal
contrastive learning (Li et al.,2021a;Radford et al.,
2021a;Li et al.,2021b), caption generation (Zhou
et al.,2020;Wang et al.,2022), and object tagging
(Li et al.,2020). Fashion data has some unique
properties that could be leveraged at pre-training,
partly to mitigate the domain shift which makes
generic VLP less effective for fashion (Zhuge et al.,
2021). For example, there are more structured at-
tributes in fashion captions, which entitles people
to naturally do comparisons when choosing their
desired shopping items. Inspired by this, we pro-
pose that weak triplet-based comparison is used as
the basis for additional pre-training tasks.