Learning by Hallucinating Vision-Language Pre-training with Weak Supervision Tzu-Jui Julius Wang Jorma Laaksonen

2025-05-02 0 0 2.27MB 11 页 10玖币
侵权投诉
Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision
Tzu-Jui Julius Wang, Jorma Laaksonen
Aalto University, Finland
{tzu-jui.wang, jorma.laaksonen}@aalto.fi
Tomas Langer
Intuition Machines Inc.
tomas@intuitionmachines.com
Heikki Arponen
Systematic Alpha*
heikki.a.arponen@gmail.com
Tom E. Bishop
Glass Imaging*
tom@glass-imaging.com
Abstract
Weakly-supervised vision-language (V-L) pre-training (W-
VLP) aims at learning cross-modal alignment with little
or no paired data, such as aligned images and captions.
Recent W-VLP methods, which pair visual features with
object tags, help achieve performances comparable with
some VLP models trained with aligned pairs in various V-L
downstream tasks. This, however, is not the case in cross-
modal retrieval (XMR). We argue that the learning of such
a W-VLP model is curbed and biased by the object tags of
limited semantics.
We address the lack of paired V-L data for model su-
pervision with a novel
V
isual
V
ocabulary based
F
eature
H
allucinator (WFH), which is trained via weak supervision
as a W-VLP model, not requiring images paired with cap-
tions. WFH generates visual hallucinations from texts, which
are then paired with the originally unpaired texts, allowing
more diverse interactions across modalities.
Empirically, WFH consistently boosts the prior W-VLP
works, e.g. U-VisualBERT (U-VB), over a variety of V-L
tasks, i.e. XMR, Visual Question Answering, etc. Notably,
benchmarked with recall@
{
1,5,10
}
, it consistently improves
U-VB on image-to-text and text-to-image retrieval on two
popular datasets Flickr30K and MSCOCO. Meanwhile, it
gains by at least 14.5% in cross-dataset generalization tests
on these XMR tasks. Moreover, in other V-L downstream
tasks considered, our WFH models are on par with models
trained with paired V-L data, revealing the utility of unpaired
data. These results demonstrate greater generalization of
the proposed W-VLP model with WFH.
1. Introduction
Vision-language pre-training (VLP) has gained popular-
ity as it shows great generalizability and transferability to
*Work done at Intuition Machines Inc.
(a) Fully-supervised pre-training
(b) This work: Weakly-supervised pre-training
Figure 1: Examples of different pre-training settings: (a)
The fully-supervised setting is given image-caption pairs,
whereas this work focuses on (b) The weakly-supervised
setting which learns on unpaired images and captions.
many vision-language (V-L) downstream tasks. Pre-training
is usually done on webly-supervised datasets, which are
collected semi-automatically through the Internet and are
hence noisy, e.g. the image and captions can be of weak
mutual relevance. Furthermore, these uncurated image-text
pairs may contain a wide spectrum of inappropriate con-
tents that lead to some daunting biases when taken to train
a model [
3
]. Despite trained on noisy datasets, these VLP
models are shown to excel at various V-L downstream tasks
[
1
,
45
,
36
,
30
,
40
,
28
,
44
,
6
,
32
,
11
,
53
,
22
,
57
]. More recent
works, such as CLIP [
41
] and ALIGN [
23
], enjoy greater
downstream improvements being pre-trained on even larger
amounts of image-text pairs. These excellent prior works, on
the one hand, offer a promising direction – a model properly
pre-trained with massive amount of data, which could be
imperfectly labeled, generalizes far better than one trained
from scratch on a small dataset. On the other hand, the V-
L research has been on a data-hungry path towards larger
arXiv:2210.13591v2 [cs.CV] 27 Oct 2022
data collection efforts. This development could blur the
other path more on trading-off the data efficiency and the
generalization capability of V-L models.
Figure 2: The proposed W-VLP model with the
V
isual
V
ocabulary based
F
eature
H
allucinator (WFH) at a glance.
WFH is trained alongside to generate visual representations
to pair with the textual counterparts. The components within
the dotted frames distinguish us from the previous state-of-
the-art W-VLP model, U-VisualBERT [
31
]. Please refer to
Sec. 3.3 for the losses and their abbreviations.
Two different perspectives to enhance the data efficiency
have been suggested. The first adopts the self-knowledge
distillation principle, which guides the learning with soft
labels predicted by the exponentially-averaged self, i.e. the
same model with the parameters being updated by the expo-
nential moving average [
7
,
15
]. The second approach learns
with limited access to paired images and texts [
18
,
31
], thus
largely reducing the effort in collecting a textual description
for each image. This weakly-supervised setting makes VLP
much more challenging since the aim of VLP is to learn to
align V-L domains over paired data. Figure 1 illustrates the
difference in the supervised and weakly-supervised settings.
Weakly-supervised VLP (W-VLP), though being a crucial
step to unleash the potential of abundant web images and
texts, is much less explored than supervised VLP (S-VLP)
and only explored in some specific domains, e.g. medi-
cal imaging [
10
]. Interestingly, we find that the recently
proposed W-VLP models, e.g. the unsupervised Visual-
BERT (U-VB) [
31
], largely fall short on cross-modal re-
trieval (XMR) tasks, motivating us to improve a W-VLP
model particularly on XMR tasks. Concretely, our work en-
hances one of the pioneering W-VLP works, i.e. U-VB, by
capitalizing more on the pre-trained visual attribute and ob-
ject detectors with a novel Visual Vocabulary based Feature
Hallucinator (WFH). WFH, depicted in Figure 2, is trained
similarly to a W-VLP model without directly training on
massive amounts of paired data. The central idea of WFH is
to generate visual counterparts from textual representations
with layers of Transformer encoders. The WFH-generated
features are then paired with the originally unpaired texts.
It is worth clarifying that we do not claim the proposed
model to be unsupervised (as is claimed for U-VB by its au-
thors) but weakly-supervised. Both U-VB and our proposed
model exploit knowledge from a pre-trained object detector
for the follow-up unpaired training. Hence, they are exposed
to some amounts of paired information, e.g. image regions
and their object/attribute classes. We hereby consider that
both models are learned under weak supervision.
We summarize the contributions as follows: (1) We
present a novel WFH that enables more interactions across
modalities during pre-training. (2) We propose a W-VLP
model that accommodates object tags, attribute tags and the
WFH-generated features. (3) The proposed model consis-
tently outperforms the state-of-the-art weakly-supervised
baseline, U-VisualBERT (U-VB), on the XMR tasks (i.e.
text-to-image, image-to-text retrieval, and cross-dataset gen-
eralization), Visual Question Answering (VQA), Referring
Expression Comprehension (REC), and Visual Entailment
(VE) tasks, on totally six datasets. (4) We provide studies
on, e.g. expressiveness of the word token embeddings and
behavior of the attention probabilities in the Transformer en-
coder, to better understand the inner working of the W-VLP
models. The introduced WFH is simple but shown effective
given these quantified results.
2. Related Work
We introduce related work starting from the advance-
ments in S-VLP methods followed by the W-VLP methods.
We then explore more applications, e.g. image translation
[
59
], medical image segmentation [
47
,
10
], unsupervised
machine translation [
26
] and unsupervised domain adapta-
tion [
34
,
12
,
46
,
58
], which advocate the usefulness of the
unpaired data.
2.1. Supervised V-L Pre-training
Most recently proposed VLP models adapt Transformer
[
48
,
9
] for VLP with differences in architectures and training
objectives. The VLP model architectures can be categorized
into single- and two-stream models. The single-stream mod-
els, such as VisualBERT [
30
], ImageBERT [
40
], Unicoder-
VL [
28
], VL-BERT [
44
], UNITER [
6
], Oscar [
32
], and
SOHO [
19
] etc., adopt a unified Transformer sharing the
parameters across modalities. The two-stream models, e.g.
LXMERT [
45
] and ViLBERT [
36
], train a separate Trans-
former for each modality. These two separate Transformers
cross-attend the representations from each layer of the other
Transformer to learn cross-domain alignment through the
attention mechanism. Though being architecturally simpler
with less parameters to optimize, single-stream models are
strongly comparable to two-stream models.
The usual training objectives of the VLP models are
Masked Language Modeling (MLM) and Masked Region
Modeling (MRM), with variants such as Masked Object
Classification (MOC) and Masked Region Feature Regres-
sion (MRFR). Image-Text Alignment (ITA), which classifies
if the V-L inputs are aligned, is used to learn V-L alignment
on the sentence level. The optimal transport method [
8
] can
be used to learn fine-grained alignment across image regions
and words. Oscar [
32
] introduces object tags detected from
the images as the anchors [
26
] aligning word tokens and
their visual groundings. VILLA [
11
] improves other V-L
frameworks by adding adversarial perturbation to the V-L
input spaces. More recent works have been advancing VLP
by, e.g. training with larger datasets [
41
,
23
] and enriching
the image tags [
57
], which can benefit the framework such as
Oscar. ALBEF [
29
] emphasizes cross-modal alignments in
the early Transformer layers and learns from its momentum
self to improve learning on noisy data.
2.2. Weakly-supervised V-L Pre-training
Aiming to pre-train a V-L model which learns to align
V-L domains without image-text pairs, W-VLP is to save
the substantial data collection effort. Hsu et al. [
18
] studied
W-VLP in the context of medical imaging. Recently, Li et
al. [
31
] proposed U-VB to be trained without accessing the
image-text pairs. It learns cross-domain alignment with ob-
ject tags served as anchors between domains and considered
as ”fake” data paired with the images. However, U-VB’s
learning could be confined by those tags which only amount
to 1,600 object classes from Visual Genome (VG) [
25
] and
bias the model to learn strong association between the visual
and a limited amount of object tags’ representations [52].
We thereby introduce a novel Visual Vocabulary based
Feature Hallucinator (WFH), which aims to alleviate such
a bias by generating regional visual representations to be
paired with the textual description, e.g. a caption for an
image. WFH generates diverse representations to offer a
bridging signal across V-L domains. As a result, WFH
greatly enhances U-VB over various V-L tasks.
2.3. Applications in Learning from Unpaired Data
Research interest in learning from unpaired data has
grown in various applications. Along with the great ad-
vancement in Generative Adversarial Networks (GANs) [
13
],
learning to translate images from one domain to another with
different styles or artistic touches has been shown feasible
without paired images [
59
]. Learning multi-modal represen-
tations for medical image analysis, e.g. organ segmentation,
with unpaired CT and MRI scan images has also shown im-
provement in the segmentation accuracy compared to the
models learned via a single modality [
47
,
10
]. Unsupervised
machine translation [
26
] and unsupervised domain adapta-
tion [
34
,
12
,
46
,
58
] share similarity with W-VLP in that
they both learn to transfer or align domains without having
access to paired data.
3. Our Proposed WFH Model
The proposed W-VLP model with Visual Vocabulary
based Feature Hallucinator (WFH), sketched in Figure 3a,
consists of a single-stream Transformer
Tθ
which takes
multi-modal inputs and shares parameters, i.e. those as-
sociated with queries,keys, and values, across modalities.
Two sets of inputs are separately fed into
Tθ
. The first set
S1={(tl,hl)}L
l=1
consists of
L
text tokens
tl
, each of
which corresponds to a hallucinated visual representation
hl
,
which we introduce in a later section. Another set of inputs
S2={(rb, ob, ab)}B
b=1
consists of (1)
B= 36
regions of
interest
{rb}B
b=1
generated from a pre-trained object detec-
tor
O
, the predicted object class probabilities given by
O
,
and (2) the sampled object tag
obPobj
b
and attribute tag
abPattr
b
, where
Pobj
b
and
Pattr
b
are the predicted proba-
bilities over the object and attribute classes 1obtained from
O
, respectively.
Tθ
adopts the same Transformer architecture
as in U-VB.
3.1. Model Architecture
This section focuses on formulating V-L inputs, WFH,
the pre-training objectives and the losses. The differences
with U-VB are emphasized.
3.1.1 V-L Inputs from S1set
Each language token
tl
from the token sequence
{tl}L
l=1
is
obtained by tokenizing a sentence and embedded as
tl=T(TBERT (tl)) R768, l = 1, ..., L, (1)
where
TBERT (·)
is the BERT’s embedding and
T
is a linear
embedding layer. Each hallucinated visual representation is
generated from the proposed WFH Hϕ, i.e.
hl=Hϕ(tl|{ti}L
i=1, D)R2048, l = 1, ..., L, (2)
h
l=f(hl) = Wfhl+bfR768, l = 1, ..., L, (3)
where
D={dcR2048}C
c=1
is the pre-learned visual
dictionary.
Hϕ(·)
is the hallucinator function which we
will formally introduce later in Eqs. (8) and (9).
f
is a lin-
ear projection function parameterized by learnable weights
WfR768×2048
and biases
bfR768
.
tl
and
h
l
are
respectively added with the token positional embedding
pW
lR768
obtained by linearly transforming
l[1, ..., L]
,
which is the token’s position in the sequence.
ϕ
denotes
WFH’s parameters.
3.1.2 V-L Inputs from S2set
We denote the regional visual representations as
{vb
R2048}B
b=1
. Each
vb
is extracted from
rb
by
O
and trans-
1
We refer the object and attribute classes to VG’s object and attribute
classes. The examples of object classes are dishwasher,cat,ocean, etc;
attributes are blank,matallic,talking, etc.
摘要:

LearningbyHallucinating:Vision-LanguagePre-trainingwithWeakSupervisionTzu-JuiJuliusWang,JormaLaaksonenAaltoUniversity,Finland{tzu-jui.wang,jorma.laaksonen}@aalto.fiTomasLangerIntuitionMachinesInc.tomas@intuitionmachines.comHeikkiArponenSystematicAlpha*heikki.a.arponen@gmail.comTomE.BishopGlassImagin...

展开>> 收起<<
Learning by Hallucinating Vision-Language Pre-training with Weak Supervision Tzu-Jui Julius Wang Jorma Laaksonen.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:2.27MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注