
Modeling (MRM), with variants such as Masked Object
Classification (MOC) and Masked Region Feature Regres-
sion (MRFR). Image-Text Alignment (ITA), which classifies
if the V-L inputs are aligned, is used to learn V-L alignment
on the sentence level. The optimal transport method [
8
] can
be used to learn fine-grained alignment across image regions
and words. Oscar [
32
] introduces object tags detected from
the images as the anchors [
26
] aligning word tokens and
their visual groundings. VILLA [
11
] improves other V-L
frameworks by adding adversarial perturbation to the V-L
input spaces. More recent works have been advancing VLP
by, e.g. training with larger datasets [
41
,
23
] and enriching
the image tags [
57
], which can benefit the framework such as
Oscar. ALBEF [
29
] emphasizes cross-modal alignments in
the early Transformer layers and learns from its momentum
self to improve learning on noisy data.
2.2. Weakly-supervised V-L Pre-training
Aiming to pre-train a V-L model which learns to align
V-L domains without image-text pairs, W-VLP is to save
the substantial data collection effort. Hsu et al. [
18
] studied
W-VLP in the context of medical imaging. Recently, Li et
al. [
31
] proposed U-VB to be trained without accessing the
image-text pairs. It learns cross-domain alignment with ob-
ject tags served as anchors between domains and considered
as ”fake” data paired with the images. However, U-VB’s
learning could be confined by those tags which only amount
to 1,600 object classes from Visual Genome (VG) [
25
] and
bias the model to learn strong association between the visual
and a limited amount of object tags’ representations [52].
We thereby introduce a novel Visual Vocabulary based
Feature Hallucinator (WFH), which aims to alleviate such
a bias by generating regional visual representations to be
paired with the textual description, e.g. a caption for an
image. WFH generates diverse representations to offer a
bridging signal across V-L domains. As a result, WFH
greatly enhances U-VB over various V-L tasks.
2.3. Applications in Learning from Unpaired Data
Research interest in learning from unpaired data has
grown in various applications. Along with the great ad-
vancement in Generative Adversarial Networks (GANs) [
13
],
learning to translate images from one domain to another with
different styles or artistic touches has been shown feasible
without paired images [
59
]. Learning multi-modal represen-
tations for medical image analysis, e.g. organ segmentation,
with unpaired CT and MRI scan images has also shown im-
provement in the segmentation accuracy compared to the
models learned via a single modality [
47
,
10
]. Unsupervised
machine translation [
26
] and unsupervised domain adapta-
tion [
34
,
12
,
46
,
58
] share similarity with W-VLP in that
they both learn to transfer or align domains without having
access to paired data.
3. Our Proposed WFH Model
The proposed W-VLP model with Visual Vocabulary
based Feature Hallucinator (WFH), sketched in Figure 3a,
consists of a single-stream Transformer
Tθ
which takes
multi-modal inputs and shares parameters, i.e. those as-
sociated with queries,keys, and values, across modalities.
Two sets of inputs are separately fed into
Tθ
. The first set
S1={(tl,hl)}L
l=1
consists of
L
text tokens
tl
, each of
which corresponds to a hallucinated visual representation
hl
,
which we introduce in a later section. Another set of inputs
S2={(rb, ob, ab)}B
b=1
consists of (1)
B= 36
regions of
interest
{rb}B
b=1
generated from a pre-trained object detec-
tor
O
, the predicted object class probabilities given by
O
,
and (2) the sampled object tag
ob∼Pobj
b
and attribute tag
ab∼Pattr
b
, where
Pobj
b
and
Pattr
b
are the predicted proba-
bilities over the object and attribute classes 1obtained from
O
, respectively.
Tθ
adopts the same Transformer architecture
as in U-VB.
3.1. Model Architecture
This section focuses on formulating V-L inputs, WFH,
the pre-training objectives and the losses. The differences
with U-VB are emphasized.
3.1.1 V-L Inputs from S1set
Each language token
tl
from the token sequence
{tl}L
l=1
is
obtained by tokenizing a sentence and embedded as
tl=T(TBERT (tl)) ∈R768, l = 1, ..., L, (1)
where
TBERT (·)
is the BERT’s embedding and
T
is a linear
embedding layer. Each hallucinated visual representation is
generated from the proposed WFH Hϕ, i.e.
hl=Hϕ(tl|{ti}L
i=1, D)∈R2048, l = 1, ..., L, (2)
h′
l=f(hl) = Wfhl+bf∈R768, l = 1, ..., L, (3)
where
D={dc∈R2048}C
c=1
is the pre-learned visual
dictionary.
Hϕ(·)
is the hallucinator function which we
will formally introduce later in Eqs. (8) and (9).
f
is a lin-
ear projection function parameterized by learnable weights
Wf∈R768×2048
and biases
bf∈R768
.
tl
and
h′
l
are
respectively added with the token positional embedding
pW
l∈R768
obtained by linearly transforming
l∈[1, ..., L]
,
which is the token’s position in the sequence.
ϕ
denotes
WFH’s parameters.
3.1.2 V-L Inputs from S2set
We denote the regional visual representations as
{vb∈
R2048}B
b=1
. Each
vb
is extracted from
rb
by
O
and trans-
1
We refer the object and attribute classes to VG’s object and attribute
classes. The examples of object classes are dishwasher,cat,ocean, etc;
attributes are blank,matallic,talking, etc.