RoMQA A Benchmark for Robust Multi-evidence Multi-answer Question Answering Victor Zhongy Weijia Shiy Wen-tau YihyandLuke Zettlemoyery

2025-05-03 0 0 642.16KB 12 页 10玖币
侵权投诉
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer
Question Answering
Victor Zhong† , Weijia Shi† , Wen-tau Yih,and Luke Zettlemoyer†
University of Washington
Meta AI
Abstract
We introduce RoMQA, the first benchmark
for robust, multi-evidence, multi-answer ques-
tion answering (QA). RoMQA contains clus-
ters of questions that are derived from re-
lated constraints mined from the Wikidata
knowledge graph. RoMQA evaluates robust-
ness of QA models to varying constraints
by measuring worst-case performance within
each question cluster. Compared to prior QA
datasets, RoMQA has more human-written ques-
tions that require reasoning over more evidence
text and have, on average, many more cor-
rect answers. In addition, human annotators
rate RoMQA questions as more natural or likely
to be asked by people. We evaluate state-of-the-
art large language models in zero-shot, few-shot,
and fine-tuning settings, and find that RoMQA is
challenging: zero-shot and few-shot models per-
form similarly to naive baselines, while super-
vised retrieval methods perform well below gold
evidence upper bounds. Moreover, existing mod-
els are not robust to variations in question con-
straints, but can be made more robust by tun-
ing on clusters of related questions. Our results
show that RoMQA is a challenging benchmark
for large language models, and provides a quan-
tifiable test to build more robust QA methods.
1 Introduction
A high quality compositional question answering
(QA) model should be robust to small variations in
the underlying meaning of input questions. Con-
sider the question “which pianists born in Paris
play Western classical music?” To show robust un-
derstanding, a QA model should not only be able
to correctly answer this direct question, but also a
wide range of related queries that different in only
a few constraints (e.g. who was a pianist born in
Paris?, who was a Western classical pianist, not
born in Paris?). Prior compositional QA datasets
do not evaluate the robustness of QA models to
variations in question constraints.
We introduce RoMQA, a benchmark for
Robust, Multi-evidence, multi-answer QA, which
Corresponding author vzhong@cs.washington.edu
explicitly evaluates for robustness to small ques-
tion perturbations. Figure 1shows examples
from RoMQA. RoMQA differs from previous
work in a number of ways.
Evaluates robustness to constraint variations.
RoMQA contains clusters of related questions that
are used to measure robustness to varying implicit
question constraints. For each cluster, we com-
pute a robustness score that is the the minimum
score over the questions it contains. In order to
perform well on RoMQA robustness evaluation, a
model must be able to understand many different
combinations of the implicit constraints that define
the cluster, such as what it means to be a pianist,
to be born in Paris, and to play Western classical
music. To our knowledge, RoMQA is the first QA
benchmark that evaluates this type of robustness.
More complex questions. Human questions of-
ten have many answers and cannot be answered
from a single text. When compared to exist-
ing datasets, RoMQA questions have more an-
swers (mean 108.6, median 11), cover more di-
verse topics, and require more pieces of evidence
text (mean 41.6, median 24). RoMQA also con-
tains entity-linked, relation-extracted text that pro-
vide provenance for the constraints, showing the
questions are answerable with multi-evidence rea-
soning from the text corpus.
More natural human written questions. Com-
pared to prior multi-answer compositional QA
datasets, RoMQA provides an order of magni-
tude more human-written questions. Human eval-
uations show that these questions are more nat-
ural, as gauged by how likely a person is to
ask the question. Qualitatively, RoMQA ques-
tions are less likely to contain overly precise con-
straints, unusual attribute comparisons, or overly
large numbers of referential hops.
We evaluate state-of-the-art large language
arXiv:2210.14353v2 [cs.CL] 15 Nov 2022
+ occupation pianist subj
+ born_in Paris subj
+ genre western_classical subj
Lily Maisky (born July 28, 1987 in Paris) is a classical pianist.
Anne Queélec (born 17 January 1948) is a French classical pianist, born in Paris.
Lily Musky
Anne Queélec
Implicit constraints Example evidence Answers
+ occupation pianist subj
+ born_in Paris subj
Claude Heler (June 18, 1922 – October 27, 2004) was a French pianist noted
particularly for his advocacy of 20th-century music…Heler was born in Paris,
and began piano lessons at the age of five and from the age of ten until the
outbreak of World War II he studied with Robert Casadesus…
Gilbert Amy
Claude Heler
+ genre western_classical subj
+ occupation pianist subj
- born_in Paris subj
David Fray (born 24 May 1981) is a French classical pianist…David Fray was born
in Tarbes, near the Pyrenees.
André Watts (born June 20, 1946) is an American classical pianist and professor
at the Jacobs School of Music of Indiana University…Born in Nuremberg,
Germany, Watts is the son of a Hungarian mother…
David Fray
André Watts
Which pianists born in
Paris play Western
classical music?
Who was a pianist born
in Paris?
Who was a Western
classical mucic pianist,
not born in Paris?
Question
Figure 1: A cluster of related questions, implicit constraints, evidence text, and answers from RoMQA. Within a RoMQA clus-
ter, related questions differ in implicit constraints. In addition to evaluating model performance across questions, RoMQA eval-
uates robustness to variations in question constraints by scoring worst-case performance among related questions.
models (LMs) on RoMQA in zero-shot prompt-
ing, few-shot in-context learning, and supervised
learning settings. On the closed setting where
the model selects among 100 candidate enti-
ties, we find that zero-shot and few-shot LMs
perform on par (e.g. 38.5 F1 by 8-shot OPT-
175B, Zhang et al. (2022)) with simple base-
lines such as predicting all candidate entities (33.5
F1). RoMQA also remains very challenging
to state-of-the-art supervised methods, with the
best retrieve-then-classify model achieving 63.8
F1 compared to a gold-evidence upper bound
of 95.0 F1. The open setting, where no candi-
dates are given, is even more challenging to exist-
ing methods — the state-of-the-art Instruct-GPT3
(text-davinci-002,Ouyang et al. (2022))
obtains 12.6 Pr@10 (precision at 10) while super-
vised retrieve-then-generate obtains 58.6 Pr@10.
Finally, no test model is robust to variations
in question constraints. The best performing re-
trieval method obtains a worse-case related ques-
tion test score of 37.9 F1 in the closed setting
— a 25.9 F1 absolute drop compared to evalu-
ating questions independently. Training on clus-
ters of related questions, such RoMQA clusters,
improves model robustness over training on unre-
lated questions. However the robustness gap re-
mains large — closing this gap will likely require
significant advances in natural language under-
standing. We open-source RoMQA at github.
com/facebookresearch/romqa.
2 RoMQA
We describe RoMQA construction and how it dif-
fers from prior compositional QA datasets.
2.1 Dataset construction
RoMQA construction has three goals. First, we
want a diverse selection of question topics. Sec-
ond, these questions should require reasoning over
multiple pieces of evidence. Third, we need to
understand what implicit constraints the questions
contain in order to evaluate robustness to varying
constraints. At a high level, RoMQA construction
involves 1) sampling constraints from knowledge
base (KB) triples, 2) clustering related constraints,
3) sampling implicit constraints that form logical
queries, and 4) annotating language questions.
Sampling constraints from a KB. We cre-
ate RoMQA questions from Wikidata (Vrandeˇ
ci´
c
and Krötzsch,2014) that are answerable given
entity-linked and relation-extracted text (Elsa-
har et al.,2018). Wikidata consists of subject-
proposition-object triples such as Gilbert_Amy
occupation pianist. We convert these
triples into entity-relation constraints. For in-
stance, the previous example is decomposed into
constraints Gilbert_Amy occupation obj
and pianist occupation subj.
Clustering related constraints. Acluster of re-
lated constraints share at least two answer entities.
For example, occupation pianist subj
and place_of_birth Paris subj are in the
same cluster because they share the same answers
Gilbert_Amy and Claude_Helffer (Paris-
born pianists). As Wikidata has a skewed proposi-
tion distribution, we resample cluster constraints
with probability inversely proportional to their
proposition frequency in the KB (Appendix A).
This down-samples over-represented propositions
such as country. We keep clusters with 3 con-
straints to be able to generate many related ques-
tions from each cluster. We discard clusters of po-
tentially spuriously related constraints with a sin-
gle shared answer. 10k clusters are randomly cho-
sen for training and 15k clusters for evaluation.
RoMQA
A film composed by S. Thaman and produced by Ganesh Babu.
Who did not play for the Carolina Panthers but was a linebacker and was on the Chicago Bears?
Which members of the Royal Society received the Order of Australia, but were not employed by the University of Oxford?
Sub-orbital spaceflight that launched from Cape Canaveral Air Force Station Launch Complex 5. Launched by Mercury-
Redstone Launch Vehicle
Who is an athlete who participated in diving, and was born in Stockholm?
HotpotQA
Are Random House Tower and 888 7th Avenue both used for real estate?
Which American singer and songwriter has a mezzo-soprano vocal range, Tim Armstrong or Tori Amos?
WFMT FM radio transmits from the second tallest building in the United States, which is located where?
Who was the recipient of a prize also given to a player for Chinese club Tianjin Quanjian?
Which of Tara Strong major voice role in animated series is an American animated television series based on the DC Comics
fictional superhero team, the "Teen Titans"?
ComplexWebQuestions
What university has more than 15,835 undergraduates and is the university Derek Fisher attended?
Who influenced Whitman’s poetry who was the public speaker who spoke about the American Civil War?
What is the main train station called in the governmental jurisdiction where the government includes the position Mayor of
San Francisco?
Which country that borders Russia has the smallest ISO?
What country that’s a neighbor of Russia is a governmental jurisdiction where Erik Asanbayev holds a governmental office?
QAMParI
Where did a Roman Catholic archbishop of San Francisco attend school?
At what institution did a Bishop of Derby receive their education?
For which movie did Mani Ratnam work on the script and serve as producer?
What Type VII C/41 and Type VII ships was in both the vessel classes of German?
Philip Kaufman was responsible for both writing and directing of which movie?
Table 1: Randomly sampled examples from RoMQA and other compositional QA datasets. Human evaluations show that peo-
ple are more likely to ask RoMQA questions than those from other compositional QA datasets. Qualitatively, RoMQA questions
exhibit fewer artifacts such as overly precise constraints (e.g. 15,835 undergraduates), overly numerous references (e.g. is an
American animated. . . based on. . . the “Teen Titans”), and unusual attribute comparisons (e.g. smallest ISO).
Dataset Train Dev Test Human
written
Multi
answer
Gold
evidence
Robustness
evaluation
RoMQA (Ours) 11k 7k 11k Yes Yes Yes Yes
HotpotQA 90k 7k 7k Yes No Yes No
CWQ 28k 3k 3k Yes Yes No No
QAMParI 64k 1k 1k Eval only Yes Yes No
Table 2: Dataset size and question complexity.
Sampling constraints to form logical queries.
We generate up to 5 logical queries using each
cluster. For each logical query, we copy the clus-
ter and remove constraints with probability 0.1 and
negate with 0.1. We negate sparingly because ini-
tial trials showed that a large number of negative
constraints resulted in unnatural questions. We
further remove redundant constraints (e.g. Ameri-
can presidents born in the US), and uniformly sub-
sample up to 4 constraints. This constitutes a log-
ical query with multiple conjunctions and subtrac-
tions. For instance, the cluster {occupation
pianist subj,born_in Paris subj} can
form a logical query occupation pianist
subj AND born_in Paris subj. We discard
overly general queries with 5000 answers.
Creating natural language questions. Me-
chanical Turk crowd-workers annotate logical
queries marked with Wikidata titles, descriptions,
and aliases into questions. Appendix BFig-
ure 11 shows the interface. Two more anno-
tators verify each annotation to confirm that it
matches the logical query. We keep only annota-
tions with 100% agreement, resulting in 11% be-
ing discarded. After verification, we additionally
discard clusters with 2 questions.
2.2 Dataset analyses and comparison
We compare RoMQA to prior compositional QA
datasets: HotpotQA (Yang et al.,2018), Com-
plexWebQuestions (CWQ; Talmor and Berant,
2018), and QAMParI (Amouyal et al.,2022).
摘要:

RoMQA:ABenchmarkforRobust,Multi-evidence,Multi-answerQuestionAnsweringVictorZhongy,WeijiaShiy,Wen-tauYihy,andLukeZettlemoyeryUniversityofWashingtonyMetaAIAbstractWeintroduceRoMQA,therstbenchmarkforrobust,multi-evidence,multi-answerques-tionanswering(QA).RoMQAcontainsclus-tersofquestionsthatare...

展开>> 收起<<
RoMQA A Benchmark for Robust Multi-evidence Multi-answer Question Answering Victor Zhongy Weijia Shiy Wen-tau YihyandLuke Zettlemoyery.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:642.16KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注