Learning by Hallucinating Vision-Language Pre-training with Weak Supervision Tzu-Jui Julius Wang Jorma Laaksonen

2025-05-02 1 0 2.27MB 11 页 10玖币

侵权投诉

Learning by Hallucinating: Vision-Language Pre-training with Weak

Supervision

Tzu-Jui Julius Wang, Jorma Laaksonen

Aalto University, Finland

{tzu-jui.wang, jorma.laaksonen}@aalto.fi

Tomas Langer

Intuition Machines Inc.

tomas@intuitionmachines.com

Heikki Arponen

Systematic Alpha*

heikki.a.arponen@gmail.com

Tom E. Bishop

Glass Imaging*

tom@glass-imaging.com

Abstract

Weakly-supervised vision-language (V-L) pre-training (W-

VLP) aims at learning cross-modal alignment with little

or no paired data, such as aligned images and captions.

Recent W-VLP methods, which pair visual features with

object tags, help achieve performances comparable with

some VLP models trained with aligned pairs in various V-L

downstream tasks. This, however, is not the case in cross-

modal retrieval (XMR). We argue that the learning of such

a W-VLP model is curbed and biased by the object tags of

limited semantics.

We address the lack of paired V-L data for model su-

pervision with a novel

isual

ocabulary based

eature

allucinator (WFH), which is trained via weak supervision

as a W-VLP model, not requiring images paired with cap-

tions. WFH generates visual hallucinations from texts, which

are then paired with the originally unpaired texts, allowing

more diverse interactions across modalities.

Empirically, WFH consistently boosts the prior W-VLP

works, e.g. U-VisualBERT (U-VB), over a variety of V-L

tasks, i.e. XMR, Visual Question Answering, etc. Notably,

benchmarked with recall@

{

1,5,10

}

, it consistently improves

U-VB on image-to-text and text-to-image retrieval on two

popular datasets Flickr30K and MSCOCO. Meanwhile, it

gains by at least 14.5% in cross-dataset generalization tests

on these XMR tasks. Moreover, in other V-L downstream

tasks considered, our WFH models are on par with models

trained with paired V-L data, revealing the utility of unpaired

data. These results demonstrate greater generalization of

the proposed W-VLP model with WFH.

1. Introduction

Vision-language pre-training (VLP) has gained popular-

ity as it shows great generalizability and transferability to

*Work done at Intuition Machines Inc.

(a) Fully-supervised pre-training

(b) This work: Weakly-supervised pre-training

Figure 1: Examples of different pre-training settings: (a)

The fully-supervised setting is given image-caption pairs,

whereas this work focuses on (b) The weakly-supervised

setting which learns on unpaired images and captions.

many vision-language (V-L) downstream tasks. Pre-training

is usually done on webly-supervised datasets, which are

collected semi-automatically through the Internet and are

hence noisy, e.g. the image and captions can be of weak

mutual relevance. Furthermore, these uncurated image-text

pairs may contain a wide spectrum of inappropriate con-

tents that lead to some daunting biases when taken to train

a model [

]. Despite trained on noisy datasets, these VLP

models are shown to excel at various V-L downstream tasks

[

]. More recent

works, such as CLIP [

] and ALIGN [

], enjoy greater

downstream improvements being pre-trained on even larger

amounts of image-text pairs. These excellent prior works, on

the one hand, offer a promising direction – a model properly

pre-trained with massive amount of data, which could be

imperfectly labeled, generalizes far better than one trained

from scratch on a small dataset. On the other hand, the V-

L research has been on a data-hungry path towards larger

arXiv:2210.13591v2 [cs.CV] 27 Oct 2022

data collection efforts. This development could blur the

other path more on trading-off the data efﬁciency and the

generalization capability of V-L models.

Figure 2: The proposed W-VLP model with the

isual

ocabulary based

eature

allucinator (WFH) at a glance.

WFH is trained alongside to generate visual representations

to pair with the textual counterparts. The components within

the dotted frames distinguish us from the previous state-of-

the-art W-VLP model, U-VisualBERT [

]. Please refer to

Sec. 3.3 for the losses and their abbreviations.

Two different perspectives to enhance the data efﬁciency

have been suggested. The ﬁrst adopts the self-knowledge

distillation principle, which guides the learning with soft

labels predicted by the exponentially-averaged self, i.e. the

same model with the parameters being updated by the expo-

nential moving average [

]. The second approach learns

with limited access to paired images and texts [

], thus

largely reducing the effort in collecting a textual description

for each image. This weakly-supervised setting makes VLP

much more challenging since the aim of VLP is to learn to

align V-L domains over paired data. Figure 1 illustrates the

difference in the supervised and weakly-supervised settings.

Weakly-supervised VLP (W-VLP), though being a crucial

step to unleash the potential of abundant web images and

texts, is much less explored than supervised VLP (S-VLP)

and only explored in some speciﬁc domains, e.g. medi-

cal imaging [

]. Interestingly, we ﬁnd that the recently

proposed W-VLP models, e.g. the unsupervised Visual-

BERT (U-VB) [

], largely fall short on cross-modal re-

trieval (XMR) tasks, motivating us to improve a W-VLP

model particularly on XMR tasks. Concretely, our work en-

hances one of the pioneering W-VLP works, i.e. U-VB, by

capitalizing more on the pre-trained visual attribute and ob-

ject detectors with a novel Visual Vocabulary based Feature

Hallucinator (WFH). WFH, depicted in Figure 2, is trained

similarly to a W-VLP model without directly training on

massive amounts of paired data. The central idea of WFH is

to generate visual counterparts from textual representations

with layers of Transformer encoders. The WFH-generated

features are then paired with the originally unpaired texts.

It is worth clarifying that we do not claim the proposed

model to be unsupervised (as is claimed for U-VB by its au-

thors) but weakly-supervised. Both U-VB and our proposed

model exploit knowledge from a pre-trained object detector

for the follow-up unpaired training. Hence, they are exposed

to some amounts of paired information, e.g. image regions

and their object/attribute classes. We hereby consider that

both models are learned under weak supervision.

We summarize the contributions as follows: (1) We

present a novel WFH that enables more interactions across

modalities during pre-training. (2) We propose a W-VLP

model that accommodates object tags, attribute tags and the

WFH-generated features. (3) The proposed model consis-

tently outperforms the state-of-the-art weakly-supervised

baseline, U-VisualBERT (U-VB), on the XMR tasks (i.e.

text-to-image, image-to-text retrieval, and cross-dataset gen-

eralization), Visual Question Answering (VQA), Referring

Expression Comprehension (REC), and Visual Entailment

(VE) tasks, on totally six datasets. (4) We provide studies

on, e.g. expressiveness of the word token embeddings and

behavior of the attention probabilities in the Transformer en-

coder, to better understand the inner working of the W-VLP

models. The introduced WFH is simple but shown effective

given these quantiﬁed results.

2. Related Work

We introduce related work starting from the advance-

ments in S-VLP methods followed by the W-VLP methods.

We then explore more applications, e.g. image translation

[

], medical image segmentation [

], unsupervised

machine translation [

] and unsupervised domain adapta-

tion [

], which advocate the usefulness of the

unpaired data.

2.1. Supervised V-L Pre-training

Most recently proposed VLP models adapt Transformer

[

] for VLP with differences in architectures and training

objectives. The VLP model architectures can be categorized

into single- and two-stream models. The single-stream mod-

els, such as VisualBERT [

], ImageBERT [

], Unicoder-

VL [

], VL-BERT [

], UNITER [

], Oscar [

], and

SOHO [

] etc., adopt a uniﬁed Transformer sharing the

parameters across modalities. The two-stream models, e.g.

LXMERT [

] and ViLBERT [

], train a separate Trans-

former for each modality. These two separate Transformers

cross-attend the representations from each layer of the other

Transformer to learn cross-domain alignment through the

attention mechanism. Though being architecturally simpler

with less parameters to optimize, single-stream models are

strongly comparable to two-stream models.

The usual training objectives of the VLP models are

Masked Language Modeling (MLM) and Masked Region

Modeling (MRM), with variants such as Masked Object

Classiﬁcation (MOC) and Masked Region Feature Regres-

sion (MRFR). Image-Text Alignment (ITA), which classiﬁes

if the V-L inputs are aligned, is used to learn V-L alignment

on the sentence level. The optimal transport method [

] can

be used to learn ﬁne-grained alignment across image regions

and words. Oscar [

] introduces object tags detected from

the images as the anchors [

] aligning word tokens and

their visual groundings. VILLA [

] improves other V-L

frameworks by adding adversarial perturbation to the V-L

input spaces. More recent works have been advancing VLP

by, e.g. training with larger datasets [

] and enriching

the image tags [

], which can beneﬁt the framework such as

Oscar. ALBEF [

] emphasizes cross-modal alignments in

the early Transformer layers and learns from its momentum

self to improve learning on noisy data.

2.2. Weakly-supervised V-L Pre-training

Aiming to pre-train a V-L model which learns to align

V-L domains without image-text pairs, W-VLP is to save

the substantial data collection effort. Hsu et al. [

] studied

W-VLP in the context of medical imaging. Recently, Li et

al. [

] proposed U-VB to be trained without accessing the

image-text pairs. It learns cross-domain alignment with ob-

ject tags served as anchors between domains and considered

as ”fake” data paired with the images. However, U-VB’s

learning could be conﬁned by those tags which only amount

to 1,600 object classes from Visual Genome (VG) [

] and

bias the model to learn strong association between the visual

and a limited amount of object tags’ representations [52].

We thereby introduce a novel Visual Vocabulary based

Feature Hallucinator (WFH), which aims to alleviate such

a bias by generating regional visual representations to be

paired with the textual description, e.g. a caption for an

image. WFH generates diverse representations to offer a

bridging signal across V-L domains. As a result, WFH

greatly enhances U-VB over various V-L tasks.

2.3. Applications in Learning from Unpaired Data

Research interest in learning from unpaired data has

grown in various applications. Along with the great ad-

vancement in Generative Adversarial Networks (GANs) [

learning to translate images from one domain to another with

different styles or artistic touches has been shown feasible

without paired images [

]. Learning multi-modal represen-

tations for medical image analysis, e.g. organ segmentation,

with unpaired CT and MRI scan images has also shown im-

provement in the segmentation accuracy compared to the

models learned via a single modality [

]. Unsupervised

machine translation [

] and unsupervised domain adapta-

tion [

] share similarity with W-VLP in that

they both learn to transfer or align domains without having

access to paired data.

3. Our Proposed WFH Model

The proposed W-VLP model with Visual Vocabulary

based Feature Hallucinator (WFH), sketched in Figure 3a,

consists of a single-stream Transformer

Tθ

which takes

multi-modal inputs and shares parameters, i.e. those as-

sociated with queries,keys, and values, across modalities.

Two sets of inputs are separately fed into

Tθ

. The ﬁrst set

S1={(tl,hl)}L

l=1

consists of

text tokens

, each of

which corresponds to a hallucinated visual representation

which we introduce in a later section. Another set of inputs

S2={(rb, ob, ab)}B

b=1

consists of (1)

B= 36

regions of

interest

{rb}B

b=1

generated from a pre-trained object detec-

tor

, the predicted object class probabilities given by

and (2) the sampled object tag

ob∼Pobj

and attribute tag

ab∼Pattr

, where

Pobj

and

Pattr

are the predicted proba-

bilities over the object and attribute classes 1obtained from

, respectively.

Tθ

adopts the same Transformer architecture

as in U-VB.

3.1. Model Architecture

This section focuses on formulating V-L inputs, WFH,

the pre-training objectives and the losses. The differences

with U-VB are emphasized.

3.1.1 V-L Inputs from S1set

Each language token

from the token sequence

{tl}L

l=1

obtained by tokenizing a sentence and embedded as

tl=T(TBERT (tl)) ∈R768, l = 1, ..., L, (1)

where

TBERT (·)

is the BERT’s embedding and

is a linear

embedding layer. Each hallucinated visual representation is

generated from the proposed WFH Hϕ, i.e.

hl=Hϕ(tl|{ti}L

i=1, D)∈R2048, l = 1, ..., L, (2)

h′

l=f(hl) = Wfhl+bf∈R768, l = 1, ..., L, (3)

where

D={dc∈R2048}C

c=1

is the pre-learned visual

dictionary.

Hϕ(·)

is the hallucinator function which we

will formally introduce later in Eqs. (8) and (9).

is a lin-

ear projection function parameterized by learnable weights

Wf∈R768×2048

and biases

bf∈R768

and

h′

are

respectively added with the token positional embedding

l∈R768

obtained by linearly transforming

l∈[1, ..., L]

which is the token’s position in the sequence.

denotes

WFH’s parameters.

3.1.2 V-L Inputs from S2set

We denote the regional visual representations as

{vb∈

R2048}B

b=1

. Each

is extracted from

and trans-

We refer the object and attribute classes to VG’s object and attribute

classes. The examples of object classes are dishwasher,cat,ocean, etc;

attributes are blank,matallic,talking, etc.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningbyHallucinating:Vision-LanguagePre-trainingwithWeakSupervisionTzu-JuiJuliusWang,JormaLaaksonenAaltoUniversity,Finland{tzu-jui.wang,jorma.laaksonen}@aalto.fiTomasLangerIntuitionMachinesInc.tomas@intuitionmachines.comHeikkiArponenSystematicAlpha*heikki.a.arponen@gmail.comTomE.BishopGlassImagin...

展开>> 收起<<

Learning by Hallucinating Vision-Language Pre-training with Weak Supervision Tzu-Jui Julius Wang Jorma Laaksonen.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning by Hallucinating Vision-Language Pre-training with Weak Supervision Tzu-Jui Julius Wang Jorma Laaksonen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: