
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer
Question Answering
Victor Zhong∗† , Weijia Shi† , Wen-tau Yih†,and Luke Zettlemoyer†
University of Washington
†Meta AI
Abstract
We introduce RoMQA, the first benchmark
for robust, multi-evidence, multi-answer ques-
tion answering (QA). RoMQA contains clus-
ters of questions that are derived from re-
lated constraints mined from the Wikidata
knowledge graph. RoMQA evaluates robust-
ness of QA models to varying constraints
by measuring worst-case performance within
each question cluster. Compared to prior QA
datasets, RoMQA has more human-written ques-
tions that require reasoning over more evidence
text and have, on average, many more cor-
rect answers. In addition, human annotators
rate RoMQA questions as more natural or likely
to be asked by people. We evaluate state-of-the-
art large language models in zero-shot, few-shot,
and fine-tuning settings, and find that RoMQA is
challenging: zero-shot and few-shot models per-
form similarly to naive baselines, while super-
vised retrieval methods perform well below gold
evidence upper bounds. Moreover, existing mod-
els are not robust to variations in question con-
straints, but can be made more robust by tun-
ing on clusters of related questions. Our results
show that RoMQA is a challenging benchmark
for large language models, and provides a quan-
tifiable test to build more robust QA methods.
1 Introduction
A high quality compositional question answering
(QA) model should be robust to small variations in
the underlying meaning of input questions. Con-
sider the question “which pianists born in Paris
play Western classical music?” To show robust un-
derstanding, a QA model should not only be able
to correctly answer this direct question, but also a
wide range of related queries that different in only
a few constraints (e.g. who was a pianist born in
Paris?, who was a Western classical pianist, not
born in Paris?). Prior compositional QA datasets
do not evaluate the robustness of QA models to
variations in question constraints.
We introduce RoMQA, a benchmark for
Robust, Multi-evidence, multi-answer QA, which
∗Corresponding author vzhong@cs.washington.edu
explicitly evaluates for robustness to small ques-
tion perturbations. Figure 1shows examples
from RoMQA. RoMQA differs from previous
work in a number of ways.
Evaluates robustness to constraint variations.
RoMQA contains clusters of related questions that
are used to measure robustness to varying implicit
question constraints. For each cluster, we com-
pute a robustness score that is the the minimum
score over the questions it contains. In order to
perform well on RoMQA robustness evaluation, a
model must be able to understand many different
combinations of the implicit constraints that define
the cluster, such as what it means to be a pianist,
to be born in Paris, and to play Western classical
music. To our knowledge, RoMQA is the first QA
benchmark that evaluates this type of robustness.
More complex questions. Human questions of-
ten have many answers and cannot be answered
from a single text. When compared to exist-
ing datasets, RoMQA questions have more an-
swers (mean 108.6, median 11), cover more di-
verse topics, and require more pieces of evidence
text (mean 41.6, median 24). RoMQA also con-
tains entity-linked, relation-extracted text that pro-
vide provenance for the constraints, showing the
questions are answerable with multi-evidence rea-
soning from the text corpus.
More natural human written questions. Com-
pared to prior multi-answer compositional QA
datasets, RoMQA provides an order of magni-
tude more human-written questions. Human eval-
uations show that these questions are more nat-
ural, as gauged by how likely a person is to
ask the question. Qualitatively, RoMQA ques-
tions are less likely to contain overly precise con-
straints, unusual attribute comparisons, or overly
large numbers of referential hops.
We evaluate state-of-the-art large language
arXiv:2210.14353v2 [cs.CL] 15 Nov 2022