MuRAG Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text Wenhu Chen Hexiang Hu Xi Chen Pat Verga William W. Cohen

2025-05-02 0 0 3.64MB 13 页 10玖币
侵权投诉
MuRAG: Multimodal Retrieval-Augmented Generator
for Open Question Answering over Images and Text
Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, William W. Cohen
Google Research
{wenhuchen,hexiang,patverga,wcohen}@google.com
Abstract
While language Models store a massive
amount of world knowledge implicitly in their
parameters, even very large models often fail
to encode information about rare entities and
events, while incurring huge computational
costs. Recently, retrieval-augmented models,
such as REALM, RAG, and RETRO, have
incorporated world knowledge into language
generation by leveraging an external non-
parametric index and have demonstrated im-
pressive performance with constrained model
sizes. However, these methods are restricted
to retrieving only textual knowledge, neglect-
ing the ubiquitous amount of knowledge in
other modalities like images – much of which
contains information not covered by any text.
To address this limitation, we propose the
first Multimodal Retrieval-Augmented Trans-
former (MuRAG), which accesses an external
non-parametric multimodal memory to aug-
ment language generation. MuRAG is pre-
trained with a mixture of large-scale image-
text and text-only corpora using a joint con-
trastive and generative loss. We perform ex-
periments on two different datasets that re-
quire retrieving and reasoning over both im-
ages and text to answer a given query: We-
bQA, and MultimodalQA. Our results show
that MuRAG achieves state-of-the-art accu-
racy, outperforming existing models by 10-
20% absolute on both datasets and under both
distractor and full-wiki settings.
1 Introduction
Pre-trained language models like GPT-3 (Brown
et al.,2020), PaLM (Chowdhery et al.,2022), etc
have been shown to capture a massive amount
of world knowledge implicitly in their parame-
ters. However, using such large models incurs an
extremely high computation cost. As an alterna-
tive to a singular monolithic transformer, retrieval-
augmented architectures like KNN-LM (Khandel-
wal et al.,2019), REALM (Guu et al.,2020),
Figure 1: Visual information-seeking queries: These
queries are unanswerable with text-only retrieval and
require retrieving and reasoning over images.
RAG (Lewis et al.,2020), FiD (Izacard and Grave,
2021), and RETRO (Borgeaud et al.,2021) have
been proposed to decouple world knowledge from
the model’s parameters. More specifically, these
models are trained to access an external mem-
ory to enhance the model’s predictions. Such
retrieval-augmented architectures have multiple
beneficial properties including: decreased model
size (Borgeaud et al.,2021), better attribution/-
explanation for model predictions (Lewis et al.,
2020), and adaptability to new information with-
out retraining (Verga et al.,2021). However, pre-
vious retrieval-augmented models are limited to
memories that contain only text or structured data
and hence cannot make use of the massive amount
of multimodal knowledge available on the web—
much of which contains information only available
in non-text modalities.
Figure 1, shows several information-seeking
queries that require retrieving and reasoning over
visual knowledge. Here, a user first poses a ques-
tion such as “What can be found on the White
House balconies at Christmas”. The system then
retrieves relevant items from its memory, for exam-
arXiv:2210.02928v2 [cs.CL] 20 Oct 2022
ple, the first image of Figure 1 with the caption
“White House during Christmas”, which it uses to
produce the answer “wreaths and garlands”. Ex-
isting text retrieval-augmented models would strug-
gle with such queries because, in many cases, they
would simply not have access to the answer as some
knowledge does not exist in text form. That, cou-
pled with the abundance of multimodal knowledge
that exists, leads to the conclusion that retrieval-
augmented models should ultimately be developed
to retrieve and reason over multiple modalities.
Figure 2: Model Overview: retrieval-and-predict pro-
cess of MuRAG on downstream datasets.
In this paper, we are specifically interested in
endowing pre-trained language models with a non-
parametric multimodal memory containing images,
text, or image-text pairs. To accomplish this, we
first combine pre-trained T5 (Raffel et al.,2020)
and ViT (Dosovitskiy et al.,2020) models to build
a backbone encoder (Figure 3), which encodes
image-text pairs, image-only, and text-only inputs
into a multimodal representation. MuRAG uses the
backbone encoder to embed items into an external
memory as well as queries to retrieve multimodal
knowledge from that memory. These retrievals
then augment a language model to generate more
visually-grounded outputs.
We pre-train MuRAG with a mixture of
image-text and text-only datasets including
LAION (Schuhmann et al.,2021), Conceptual-
Caption (Sharma et al.,2018), VQA (An-
tol et al.,2015) and Probably-Asked-Questions
(PAQ) (Lewis et al.,2021). More specifically, we
reformulate these datasets in a retrieve-and-predict
format. Here, the model’s input is an image along
with a text prompt. The model then retrieves from
a memory containing captions and passages, which
it uses to generate a target token sequence. The
model is trained with both a contrastive and a gen-
erative loss; this teaches the model to discriminate
relevant from irrelevant memory entries, and guides
the model to leverage the multimodal knowledge
into generation.
Unlike the pre-training stage, during fine-
tuning Figure 2 the model’s input is a question,
and the memory contains a collection of captioned
images and text snippets. We fine-tune MuRAG
on the downstream datasets with a contrastive and
generative loss similar to pre-training. To avoid ex-
cessive computation cost, we develop a two-stage
training pipeline to first train with small in-batch
memory, and then with a statically encoded and
indexed large global memory.
Our experiments show that MuRAG achieves
state-of-the-art performance on two different open-
multimodal-QA datasets, both of which require
retrieving images and text from a large corpus to
answer factoid questions: WebQA (Chang et al.,
2022) and MultimodalQA (Talmor et al.,2021). On
both datasets, we outperform sophisticated base-
lines (Li et al.,2020;Radford et al.,2021;Zhang
et al.,2021) by 10-20% accuracy under both dis-
tractor (from 40+ candidates) and full-wiki settings
(from 1M candidates). We also perform a compre-
hensive study to ablate different components of the
pre-training to see their contributions. These em-
pirical results demonstrate the effectiveness of our
proposed models to integrate multimodal knowl-
edge into pre-trained generation models and pave
the way to unified retrieval-augmented frameworks.
2 Related Work
Retrieval Augmented Models
Retrieval aug-
mented models are hybrid models containing
both parameterized sequence models and a non-
parametric memory, infusing world knowledge into
existing language models. Among them, KNN-
LM (Khandelwal et al.,2019) was first proposed
to retrieve instances from a text training corpus to
help language modeling. Later, RETRO (Borgeaud
et al.,2021) was proposed to scale up the text cor-
pus to trillions of tokens, enabling the model to
achieve similar perplexity to GPT-3 (Brown et al.,
2020) with 25x fewer model parameters. Another
family of models, such as REALM (Guu et al.,
2020), RAG (Lewis et al.,2020), and FiD (Izacard
and Grave,2021), integrate Wikipedia passages as
a datastore to benefit downstream knowledge in-
tensive tasks (e.g. Question Answering). REALM
is an encoder-only model trained with masked lan-
guage modeling, while RAG and FiD adopt an
encoder-decoder model with a generative language
modeling objective. Compared to them, MuRAG
is the first retrieval-augmented model that is ca-
pable of using knowledge presented in multiple
modalities (i.e. visual and textual knowledge data),
whereas all prior methods are restricted to using
text-only knowledge.
Multimodal Transformers
Multimodal trans-
formers have demonstrated strong performances
in learning cross-modal representation that are gen-
erally beneficial on downstream vision and lan-
guage tasks, such as image-text retrieval (Karpa-
thy and Fei-Fei,2015), image captioning (Chen
et al.,2015), and VQA (Antol et al.,2015). These
methods typically learn a joint transformer model
on top of unimodal visual and textual backbones,
via fusing deep features from each modality. The
early version of multimodal transformers (Lu et al.,
2019;Chen et al.,2020;Li et al.,2020) usually
learns a Transformer on pre-extracted unimodal
features for contextualization, which makes it im-
possible to adjust those unimodal features to the
target tasks. Recently, SimVLM (Wang et al.,2022)
and COCA (Yu et al.,2022) proposed end-to-end
training for both deep multimodal transformers and
unimodal featurization networks and demonstrated
strong performance in both multimodal and uni-
modal downstream tasks. The multimodal memory
encoder of MuRAG is broadly similar to SimVLM
and CoCa, but has a different focus to encode and
retrieve multimodal knowledge (i.e. images and
texts) to augment language generation models.
Multimodal Question Answering
The problem
of multimodal question answering has been ex-
tensively studied. VQA was the first proposed to
answer questions from visual-only inputs. Later,
OK-VQA (Marino et al.,2019) enlarged VQAs
scope to annotate questions requiring both image
and implicit textual/common-sense knowledge to
answer. More recently, MuMuQA (Reddy et al.,
2021), ManyModelQA (Hannan et al.,2020) and
MIMOQA (Singh et al.,2021) provide questions
which require reasoning over images and explicitly
provided text snippets. However, these datasets
are restricted to dealing with given text and images
without requiring any retrieval from the web: they
are analogous to machine-reading approaches to
QA from text like SQuAD, rather than open-book
QA. To study the more realistic open multimodal
QA task, WebQA (Chang et al.,2022) and Multi-
modalQA (Talmor et al.,2021) have been proposed
to evaluate answers to open queries which require
retrieving and reasoning over a large-scale web
multimodal corpus. Our model uses these datasets
to study open-world multimodal question answer-
ing, obtaining state-of-the-art results.
3 Model
3.1 Backbone Encoder
Figure 3: Backbone encoder: ViT encodes image
patches into a sequence of vectors eI, while word em-
bedding converts text tokens into another sequence of
vectors eT. These vectors are concatenated to form
fθ(e)and fed to a decoder for text generation.
MuRAG is built on top of a simpler model we
call a “backbone” model, which is pre-trained to
encode image-text pairs such that they are suitable
for both answer generation and retrieval. The back-
bone model’s encoder is used as a component of
the MuRAG model. The backbone model is built
with a pre-trained visual Transformer (Dosovitskiy
et al.,2020) and a T5 text Transformer (Raffel et al.,
2020), and consists of a multimodal encoder
fθ
and
decoder
gθ
. The encoder takes as input a sequence
of image-text pairs, where either the image or the
text component can be empty to accommodate text-
only and image-only cases.
As depicted in Figure 3, the encoder can take a
sequence of images and text. For image input, we
first split each into 16x16 patches and feed them
to a ViT (Dosovitskiy et al.,2020) transformer to
generate a sequence of visual embedding denoted
as
eIRLi×D
, where
Li
is the length of the im-
age tokens. For text input, we use word embedding
to produce another sequence of textual embedding
eTRLt×D
. For
k
images and
n
text inputs, we
concatenate all their embeddings in the input or-
der as
e= [e1
I;e1
T;· · · ;ek
I;en
T]R(kLt+nLi)×D
,
which is fed to another bi-directional transformer
fθ
initialized from T5. We enable cross-attention
摘要:

MuRAG:MultimodalRetrieval-AugmentedGeneratorforOpenQuestionAnsweringoverImagesandTextWenhuChen,HexiangHu,XiChen,PatVerga,WilliamW.CohenGoogleResearch{wenhuchen,hexiang,patverga,wcohen}@google.comAbstractWhilelanguageModelsstoreamassiveamountofworldknowledgeimplicitlyintheirparameters,evenverylargemo...

展开>> 收起<<
MuRAG Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text Wenhu Chen Hexiang Hu Xi Chen Pat Verga William W. Cohen.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:3.64MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注