
ple, the first image of Figure 1 with the caption
“White House during Christmas”, which it uses to
produce the answer “wreaths and garlands”. Ex-
isting text retrieval-augmented models would strug-
gle with such queries because, in many cases, they
would simply not have access to the answer as some
knowledge does not exist in text form. That, cou-
pled with the abundance of multimodal knowledge
that exists, leads to the conclusion that retrieval-
augmented models should ultimately be developed
to retrieve and reason over multiple modalities.
Figure 2: Model Overview: retrieval-and-predict pro-
cess of MuRAG on downstream datasets.
In this paper, we are specifically interested in
endowing pre-trained language models with a non-
parametric multimodal memory containing images,
text, or image-text pairs. To accomplish this, we
first combine pre-trained T5 (Raffel et al.,2020)
and ViT (Dosovitskiy et al.,2020) models to build
a backbone encoder (Figure 3), which encodes
image-text pairs, image-only, and text-only inputs
into a multimodal representation. MuRAG uses the
backbone encoder to embed items into an external
memory as well as queries to retrieve multimodal
knowledge from that memory. These retrievals
then augment a language model to generate more
visually-grounded outputs.
We pre-train MuRAG with a mixture of
image-text and text-only datasets including
LAION (Schuhmann et al.,2021), Conceptual-
Caption (Sharma et al.,2018), VQA (An-
tol et al.,2015) and Probably-Asked-Questions
(PAQ) (Lewis et al.,2021). More specifically, we
reformulate these datasets in a retrieve-and-predict
format. Here, the model’s input is an image along
with a text prompt. The model then retrieves from
a memory containing captions and passages, which
it uses to generate a target token sequence. The
model is trained with both a contrastive and a gen-
erative loss; this teaches the model to discriminate
relevant from irrelevant memory entries, and guides
the model to leverage the multimodal knowledge
into generation.
Unlike the pre-training stage, during fine-
tuning Figure 2 the model’s input is a question,
and the memory contains a collection of captioned
images and text snippets. We fine-tune MuRAG
on the downstream datasets with a contrastive and
generative loss similar to pre-training. To avoid ex-
cessive computation cost, we develop a two-stage
training pipeline to first train with small in-batch
memory, and then with a statically encoded and
indexed large global memory.
Our experiments show that MuRAG achieves
state-of-the-art performance on two different open-
multimodal-QA datasets, both of which require
retrieving images and text from a large corpus to
answer factoid questions: WebQA (Chang et al.,
2022) and MultimodalQA (Talmor et al.,2021). On
both datasets, we outperform sophisticated base-
lines (Li et al.,2020;Radford et al.,2021;Zhang
et al.,2021) by 10-20% accuracy under both dis-
tractor (from 40+ candidates) and full-wiki settings
(from 1M candidates). We also perform a compre-
hensive study to ablate different components of the
pre-training to see their contributions. These em-
pirical results demonstrate the effectiveness of our
proposed models to integrate multimodal knowl-
edge into pre-trained generation models and pave
the way to unified retrieval-augmented frameworks.
2 Related Work
Retrieval Augmented Models
Retrieval aug-
mented models are hybrid models containing
both parameterized sequence models and a non-
parametric memory, infusing world knowledge into
existing language models. Among them, KNN-
LM (Khandelwal et al.,2019) was first proposed
to retrieve instances from a text training corpus to
help language modeling. Later, RETRO (Borgeaud
et al.,2021) was proposed to scale up the text cor-
pus to trillions of tokens, enabling the model to
achieve similar perplexity to GPT-3 (Brown et al.,
2020) with 25x fewer model parameters. Another
family of models, such as REALM (Guu et al.,
2020), RAG (Lewis et al.,2020), and FiD (Izacard
and Grave,2021), integrate Wikipedia passages as
a datastore to benefit downstream knowledge in-
tensive tasks (e.g. Question Answering). REALM
is an encoder-only model trained with masked lan-