MuRAG Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text Wenhu Chen Hexiang Hu Xi Chen Pat Verga William W. Cohen

2025-05-02 0 0 3.64MB 13 页 10玖币

侵权投诉

MuRAG: Multimodal Retrieval-Augmented Generator

for Open Question Answering over Images and Text

Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, William W. Cohen

Google Research

{wenhuchen,hexiang,patverga,wcohen}@google.com

Abstract

While language Models store a massive

amount of world knowledge implicitly in their

parameters, even very large models often fail

to encode information about rare entities and

events, while incurring huge computational

costs. Recently, retrieval-augmented models,

such as REALM, RAG, and RETRO, have

incorporated world knowledge into language

generation by leveraging an external non-

parametric index and have demonstrated im-

pressive performance with constrained model

sizes. However, these methods are restricted

to retrieving only textual knowledge, neglect-

ing the ubiquitous amount of knowledge in

other modalities like images – much of which

contains information not covered by any text.

To address this limitation, we propose the

ﬁrst Multimodal Retrieval-Augmented Trans-

former (MuRAG), which accesses an external

non-parametric multimodal memory to aug-

ment language generation. MuRAG is pre-

trained with a mixture of large-scale image-

text and text-only corpora using a joint con-

trastive and generative loss. We perform ex-

periments on two different datasets that re-

quire retrieving and reasoning over both im-

ages and text to answer a given query: We-

bQA, and MultimodalQA. Our results show

that MuRAG achieves state-of-the-art accu-

racy, outperforming existing models by 10-

20% absolute on both datasets and under both

distractor and full-wiki settings.

1 Introduction

Pre-trained language models like GPT-3 (Brown

et al.,2020), PaLM (Chowdhery et al.,2022), etc

have been shown to capture a massive amount

of world knowledge implicitly in their parame-

ters. However, using such large models incurs an

extremely high computation cost. As an alterna-

tive to a singular monolithic transformer, retrieval-

augmented architectures like KNN-LM (Khandel-

wal et al.,2019), REALM (Guu et al.,2020),

Figure 1: Visual information-seeking queries: These

queries are unanswerable with text-only retrieval and

require retrieving and reasoning over images.

RAG (Lewis et al.,2020), FiD (Izacard and Grave,

2021), and RETRO (Borgeaud et al.,2021) have

been proposed to decouple world knowledge from

the model’s parameters. More speciﬁcally, these

models are trained to access an external mem-

ory to enhance the model’s predictions. Such

retrieval-augmented architectures have multiple

beneﬁcial properties including: decreased model

size (Borgeaud et al.,2021), better attribution/-

explanation for model predictions (Lewis et al.,

2020), and adaptability to new information with-

out retraining (Verga et al.,2021). However, pre-

vious retrieval-augmented models are limited to

memories that contain only text or structured data

and hence cannot make use of the massive amount

of multimodal knowledge available on the web—

much of which contains information only available

in non-text modalities.

Figure 1, shows several information-seeking

queries that require retrieving and reasoning over

visual knowledge. Here, a user ﬁrst poses a ques-

tion such as “What can be found on the White

House balconies at Christmas”. The system then

retrieves relevant items from its memory, for exam-

arXiv:2210.02928v2 [cs.CL] 20 Oct 2022

ple, the ﬁrst image of Figure 1 with the caption

“White House during Christmas”, which it uses to

produce the answer “wreaths and garlands”. Ex-

isting text retrieval-augmented models would strug-

gle with such queries because, in many cases, they

would simply not have access to the answer as some

knowledge does not exist in text form. That, cou-

pled with the abundance of multimodal knowledge

that exists, leads to the conclusion that retrieval-

augmented models should ultimately be developed

to retrieve and reason over multiple modalities.

Figure 2: Model Overview: retrieval-and-predict pro-

cess of MuRAG on downstream datasets.

In this paper, we are speciﬁcally interested in

endowing pre-trained language models with a non-

parametric multimodal memory containing images,

text, or image-text pairs. To accomplish this, we

ﬁrst combine pre-trained T5 (Raffel et al.,2020)

and ViT (Dosovitskiy et al.,2020) models to build

a backbone encoder (Figure 3), which encodes

image-text pairs, image-only, and text-only inputs

into a multimodal representation. MuRAG uses the

backbone encoder to embed items into an external

memory as well as queries to retrieve multimodal

knowledge from that memory. These retrievals

then augment a language model to generate more

visually-grounded outputs.

We pre-train MuRAG with a mixture of

image-text and text-only datasets including

LAION (Schuhmann et al.,2021), Conceptual-

Caption (Sharma et al.,2018), VQA (An-

tol et al.,2015) and Probably-Asked-Questions

(PAQ) (Lewis et al.,2021). More speciﬁcally, we

reformulate these datasets in a retrieve-and-predict

format. Here, the model’s input is an image along

with a text prompt. The model then retrieves from

a memory containing captions and passages, which

it uses to generate a target token sequence. The

model is trained with both a contrastive and a gen-

erative loss; this teaches the model to discriminate

relevant from irrelevant memory entries, and guides

the model to leverage the multimodal knowledge

into generation.

Unlike the pre-training stage, during ﬁne-

tuning Figure 2 the model’s input is a question,

and the memory contains a collection of captioned

images and text snippets. We ﬁne-tune MuRAG

on the downstream datasets with a contrastive and

generative loss similar to pre-training. To avoid ex-

cessive computation cost, we develop a two-stage

training pipeline to ﬁrst train with small in-batch

memory, and then with a statically encoded and

indexed large global memory.

Our experiments show that MuRAG achieves

state-of-the-art performance on two different open-

multimodal-QA datasets, both of which require

retrieving images and text from a large corpus to

answer factoid questions: WebQA (Chang et al.,

2022) and MultimodalQA (Talmor et al.,2021). On

both datasets, we outperform sophisticated base-

lines (Li et al.,2020;Radford et al.,2021;Zhang

et al.,2021) by 10-20% accuracy under both dis-

tractor (from 40+ candidates) and full-wiki settings

(from 1M candidates). We also perform a compre-

hensive study to ablate different components of the

pre-training to see their contributions. These em-

pirical results demonstrate the effectiveness of our

proposed models to integrate multimodal knowl-

edge into pre-trained generation models and pave

the way to uniﬁed retrieval-augmented frameworks.

2 Related Work

Retrieval Augmented Models

Retrieval aug-

mented models are hybrid models containing

both parameterized sequence models and a non-

parametric memory, infusing world knowledge into

existing language models. Among them, KNN-

LM (Khandelwal et al.,2019) was ﬁrst proposed

to retrieve instances from a text training corpus to

help language modeling. Later, RETRO (Borgeaud

et al.,2021) was proposed to scale up the text cor-

pus to trillions of tokens, enabling the model to

achieve similar perplexity to GPT-3 (Brown et al.,

2020) with 25x fewer model parameters. Another

family of models, such as REALM (Guu et al.,

2020), RAG (Lewis et al.,2020), and FiD (Izacard

and Grave,2021), integrate Wikipedia passages as

a datastore to beneﬁt downstream knowledge in-

tensive tasks (e.g. Question Answering). REALM

is an encoder-only model trained with masked lan-

guage modeling, while RAG and FiD adopt an

encoder-decoder model with a generative language

modeling objective. Compared to them, MuRAG

is the ﬁrst retrieval-augmented model that is ca-

pable of using knowledge presented in multiple

modalities (i.e. visual and textual knowledge data),

whereas all prior methods are restricted to using

text-only knowledge.

Multimodal Transformers

Multimodal trans-

formers have demonstrated strong performances

in learning cross-modal representation that are gen-

erally beneﬁcial on downstream vision and lan-

guage tasks, such as image-text retrieval (Karpa-

thy and Fei-Fei,2015), image captioning (Chen

et al.,2015), and VQA (Antol et al.,2015). These

methods typically learn a joint transformer model

on top of unimodal visual and textual backbones,

via fusing deep features from each modality. The

early version of multimodal transformers (Lu et al.,

2019;Chen et al.,2020;Li et al.,2020) usually

learns a Transformer on pre-extracted unimodal

features for contextualization, which makes it im-

possible to adjust those unimodal features to the

target tasks. Recently, SimVLM (Wang et al.,2022)

and COCA (Yu et al.,2022) proposed end-to-end

training for both deep multimodal transformers and

unimodal featurization networks and demonstrated

strong performance in both multimodal and uni-

modal downstream tasks. The multimodal memory

encoder of MuRAG is broadly similar to SimVLM

and CoCa, but has a different focus to encode and

retrieve multimodal knowledge (i.e. images and

texts) to augment language generation models.

Multimodal Question Answering

The problem

of multimodal question answering has been ex-

tensively studied. VQA was the ﬁrst proposed to

answer questions from visual-only inputs. Later,

OK-VQA (Marino et al.,2019) enlarged VQA’s

scope to annotate questions requiring both image

and implicit textual/common-sense knowledge to

answer. More recently, MuMuQA (Reddy et al.,

2021), ManyModelQA (Hannan et al.,2020) and

MIMOQA (Singh et al.,2021) provide questions

which require reasoning over images and explicitly

provided text snippets. However, these datasets

are restricted to dealing with given text and images

without requiring any retrieval from the web: they

are analogous to machine-reading approaches to

QA from text like SQuAD, rather than open-book

QA. To study the more realistic open multimodal

QA task, WebQA (Chang et al.,2022) and Multi-

modalQA (Talmor et al.,2021) have been proposed

to evaluate answers to open queries which require

retrieving and reasoning over a large-scale web

multimodal corpus. Our model uses these datasets

to study open-world multimodal question answer-

ing, obtaining state-of-the-art results.

3 Model

3.1 Backbone Encoder

Figure 3: Backbone encoder: ViT encodes image

patches into a sequence of vectors eI, while word em-

bedding converts text tokens into another sequence of

vectors eT. These vectors are concatenated to form

fθ(e)and fed to a decoder for text generation.

MuRAG is built on top of a simpler model we

call a “backbone” model, which is pre-trained to

encode image-text pairs such that they are suitable

for both answer generation and retrieval. The back-

bone model’s encoder is used as a component of

the MuRAG model. The backbone model is built

with a pre-trained visual Transformer (Dosovitskiy

et al.,2020) and a T5 text Transformer (Raffel et al.,

2020), and consists of a multimodal encoder

fθ

and

decoder

gθ

. The encoder takes as input a sequence

of image-text pairs, where either the image or the

text component can be empty to accommodate text-

only and image-only cases.

As depicted in Figure 3, the encoder can take a

sequence of images and text. For image input, we

ﬁrst split each into 16x16 patches and feed them

to a ViT (Dosovitskiy et al.,2020) transformer to

generate a sequence of visual embedding denoted

eI∈RLi×D

, where

is the length of the im-

age tokens. For text input, we use word embedding

to produce another sequence of textual embedding

eT∈RLt×D

. For

images and

text inputs, we

concatenate all their embeddings in the input or-

der as

e= [e1

I;e1

T;· · · ;ek

I;en

T]∈R(kLt+nLi)×D

which is fed to another bi-directional transformer

fθ

initialized from T5. We enable cross-attention

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MuRAG:MultimodalRetrieval-AugmentedGeneratorforOpenQuestionAnsweringoverImagesandTextWenhuChen,HexiangHu,XiChen,PatVerga,WilliamW.CohenGoogleResearch{wenhuchen,hexiang,patverga,wcohen}@google.comAbstractWhilelanguageModelsstoreamassiveamountofworldknowledgeimplicitlyintheirparameters,evenverylargemo...

展开>> 收起<<

MuRAG Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text Wenhu Chen Hexiang Hu Xi Chen Pat Verga William W. Cohen.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MuRAG Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text Wenhu Chen Hexiang Hu Xi Chen Pat Verga William W. Cohen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: