RoMQA A Benchmark for Robust Multi-evidence Multi-answer Question Answering Victor Zhongy Weijia Shiy Wen-tau YihyandLuke Zettlemoyery

2025-05-03 0 0 642.16KB 12 页 10玖币

侵权投诉

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer

Question Answering

Victor Zhong∗† , Weijia Shi† , Wen-tau Yih†,and Luke Zettlemoyer†

University of Washington

†Meta AI

Abstract

We introduce RoMQA, the ﬁrst benchmark

for robust, multi-evidence, multi-answer ques-

tion answering (QA). RoMQA contains clus-

ters of questions that are derived from re-

lated constraints mined from the Wikidata

knowledge graph. RoMQA evaluates robust-

ness of QA models to varying constraints

by measuring worst-case performance within

each question cluster. Compared to prior QA

datasets, RoMQA has more human-written ques-

tions that require reasoning over more evidence

text and have, on average, many more cor-

rect answers. In addition, human annotators

rate RoMQA questions as more natural or likely

to be asked by people. We evaluate state-of-the-

art large language models in zero-shot, few-shot,

and ﬁne-tuning settings, and ﬁnd that RoMQA is

challenging: zero-shot and few-shot models per-

form similarly to naive baselines, while super-

vised retrieval methods perform well below gold

evidence upper bounds. Moreover, existing mod-

els are not robust to variations in question con-

straints, but can be made more robust by tun-

ing on clusters of related questions. Our results

show that RoMQA is a challenging benchmark

for large language models, and provides a quan-

tiﬁable test to build more robust QA methods.

1 Introduction

A high quality compositional question answering

(QA) model should be robust to small variations in

the underlying meaning of input questions. Con-

sider the question “which pianists born in Paris

play Western classical music?” To show robust un-

derstanding, a QA model should not only be able

to correctly answer this direct question, but also a

wide range of related queries that different in only

a few constraints (e.g. who was a pianist born in

Paris?, who was a Western classical pianist, not

born in Paris?). Prior compositional QA datasets

do not evaluate the robustness of QA models to

variations in question constraints.

We introduce RoMQA, a benchmark for

Robust, Multi-evidence, multi-answer QA, which

∗Corresponding author vzhong@cs.washington.edu

explicitly evaluates for robustness to small ques-

tion perturbations. Figure 1shows examples

from RoMQA. RoMQA differs from previous

work in a number of ways.

Evaluates robustness to constraint variations.

RoMQA contains clusters of related questions that

are used to measure robustness to varying implicit

question constraints. For each cluster, we com-

pute a robustness score that is the the minimum

score over the questions it contains. In order to

perform well on RoMQA robustness evaluation, a

model must be able to understand many different

combinations of the implicit constraints that deﬁne

the cluster, such as what it means to be a pianist,

to be born in Paris, and to play Western classical

music. To our knowledge, RoMQA is the ﬁrst QA

benchmark that evaluates this type of robustness.

More complex questions. Human questions of-

ten have many answers and cannot be answered

from a single text. When compared to exist-

ing datasets, RoMQA questions have more an-

swers (mean 108.6, median 11), cover more di-

verse topics, and require more pieces of evidence

text (mean 41.6, median 24). RoMQA also con-

tains entity-linked, relation-extracted text that pro-

vide provenance for the constraints, showing the

questions are answerable with multi-evidence rea-

soning from the text corpus.

More natural human written questions. Com-

pared to prior multi-answer compositional QA

datasets, RoMQA provides an order of magni-

tude more human-written questions. Human eval-

uations show that these questions are more nat-

ural, as gauged by how likely a person is to

ask the question. Qualitatively, RoMQA ques-

tions are less likely to contain overly precise con-

straints, unusual attribute comparisons, or overly

large numbers of referential hops.

We evaluate state-of-the-art large language

arXiv:2210.14353v2 [cs.CL] 15 Nov 2022

+ occupation pianist subj

+ born_in Paris subj

+ genre western_classical subj

Lily Maisky (born July 28, 1987 in Paris) is a classical pianist.

Anne Queﬀélec (born 17 January 1948) is a French classical pianist, born in Paris.

…

Lily Musky

Anne Queﬀélec

…

Implicit constraints Example evidence Answers

+ occupation pianist subj

+ born_in Paris subj

Claude Helﬀer (June 18, 1922 – October 27, 2004) was a French pianist noted

particularly for his advocacy of 20th-century music…Helﬀer was born in Paris,

and began piano lessons at the age of ﬁve and from the age of ten until the

outbreak of World War II he studied with Robert Casadesus…

…

Gilbert Amy

Claude Helﬀer

…

+ genre western_classical subj

+ occupation pianist subj

- born_in Paris subj

David Fray (born 24 May 1981) is a French classical pianist…David Fray was born

in Tarbes, near the Pyrenees.

André Watts (born June 20, 1946) is an American classical pianist and professor

at the Jacobs School of Music of Indiana University…Born in Nuremberg,

Germany, Watts is the son of a Hungarian mother…

…

David Fray

André Watts

…

Which pianists born in

Paris play Western

classical music?

Who was a pianist born

in Paris?

Who was a Western

classical mucic pianist,

not born in Paris?

Question

Figure 1: A cluster of related questions, implicit constraints, evidence text, and answers from RoMQA. Within a RoMQA clus-

ter, related questions differ in implicit constraints. In addition to evaluating model performance across questions, RoMQA eval-

uates robustness to variations in question constraints by scoring worst-case performance among related questions.

models (LMs) on RoMQA in zero-shot prompt-

ing, few-shot in-context learning, and supervised

learning settings. On the closed setting where

the model selects among 100 candidate enti-

ties, we ﬁnd that zero-shot and few-shot LMs

perform on par (e.g. 38.5 F1 by 8-shot OPT-

175B, Zhang et al. (2022)) with simple base-

lines such as predicting all candidate entities (33.5

F1). RoMQA also remains very challenging

to state-of-the-art supervised methods, with the

best retrieve-then-classify model achieving 63.8

F1 compared to a gold-evidence upper bound

of 95.0 F1. The open setting, where no candi-

dates are given, is even more challenging to exist-

ing methods — the state-of-the-art Instruct-GPT3

(text-davinci-002,Ouyang et al. (2022))

obtains 12.6 Pr@10 (precision at 10) while super-

vised retrieve-then-generate obtains 58.6 Pr@10.

Finally, no test model is robust to variations

in question constraints. The best performing re-

trieval method obtains a worse-case related ques-

tion test score of 37.9 F1 in the closed setting

— a 25.9 F1 absolute drop compared to evalu-

ating questions independently. Training on clus-

ters of related questions, such RoMQA clusters,

improves model robustness over training on unre-

lated questions. However the robustness gap re-

mains large — closing this gap will likely require

signiﬁcant advances in natural language under-

standing. We open-source RoMQA at github.

com/facebookresearch/romqa.

2 RoMQA

We describe RoMQA construction and how it dif-

fers from prior compositional QA datasets.

2.1 Dataset construction

RoMQA construction has three goals. First, we

want a diverse selection of question topics. Sec-

ond, these questions should require reasoning over

multiple pieces of evidence. Third, we need to

understand what implicit constraints the questions

contain in order to evaluate robustness to varying

constraints. At a high level, RoMQA construction

involves 1) sampling constraints from knowledge

base (KB) triples, 2) clustering related constraints,

3) sampling implicit constraints that form logical

queries, and 4) annotating language questions.

Sampling constraints from a KB. We cre-

ate RoMQA questions from Wikidata (Vrandeˇ

ci´

and Krötzsch,2014) that are answerable given

entity-linked and relation-extracted text (Elsa-

har et al.,2018). Wikidata consists of subject-

proposition-object triples such as Gilbert_Amy

occupation pianist. We convert these

triples into entity-relation constraints. For in-

stance, the previous example is decomposed into

constraints Gilbert_Amy occupation obj

and pianist occupation subj.

Clustering related constraints. Acluster of re-

lated constraints share at least two answer entities.

For example, occupation pianist subj

and place_of_birth Paris subj are in the

same cluster because they share the same answers

Gilbert_Amy and Claude_Helffer (Paris-

born pianists). As Wikidata has a skewed proposi-

tion distribution, we resample cluster constraints

with probability inversely proportional to their

proposition frequency in the KB (Appendix A).

This down-samples over-represented propositions

such as country. We keep clusters with ≥3 con-

straints to be able to generate many related ques-

tions from each cluster. We discard clusters of po-

tentially spuriously related constraints with a sin-

gle shared answer. 10k clusters are randomly cho-

sen for training and 15k clusters for evaluation.

RoMQA

A ﬁlm composed by S. Thaman and produced by Ganesh Babu.

Who did not play for the Carolina Panthers but was a linebacker and was on the Chicago Bears?

Which members of the Royal Society received the Order of Australia, but were not employed by the University of Oxford?

Sub-orbital spaceﬂight that launched from Cape Canaveral Air Force Station Launch Complex 5. Launched by Mercury-

Redstone Launch Vehicle

Who is an athlete who participated in diving, and was born in Stockholm?

HotpotQA

Are Random House Tower and 888 7th Avenue both used for real estate?

Which American singer and songwriter has a mezzo-soprano vocal range, Tim Armstrong or Tori Amos?

WFMT FM radio transmits from the second tallest building in the United States, which is located where?

Who was the recipient of a prize also given to a player for Chinese club Tianjin Quanjian?

Which of Tara Strong major voice role in animated series is an American animated television series based on the DC Comics

ﬁctional superhero team, the "Teen Titans"?

ComplexWebQuestions

What university has more than 15,835 undergraduates and is the university Derek Fisher attended?

Who inﬂuenced Whitman’s poetry who was the public speaker who spoke about the American Civil War?

What is the main train station called in the governmental jurisdiction where the government includes the position Mayor of

San Francisco?

Which country that borders Russia has the smallest ISO?

What country that’s a neighbor of Russia is a governmental jurisdiction where Erik Asanbayev holds a governmental ofﬁce?

QAMParI

Where did a Roman Catholic archbishop of San Francisco attend school?

At what institution did a Bishop of Derby receive their education?

For which movie did Mani Ratnam work on the script and serve as producer?

What Type VII C/41 and Type VII ships was in both the vessel classes of German?

Philip Kaufman was responsible for both writing and directing of which movie?

Table 1: Randomly sampled examples from RoMQA and other compositional QA datasets. Human evaluations show that peo-

ple are more likely to ask RoMQA questions than those from other compositional QA datasets. Qualitatively, RoMQA questions

exhibit fewer artifacts such as overly precise constraints (e.g. 15,835 undergraduates), overly numerous references (e.g. is an

American animated. . . based on. . . the “Teen Titans”), and unusual attribute comparisons (e.g. smallest ISO).

Dataset Train Dev Test Human

written

Multi

answer

Gold

evidence

Robustness

evaluation

RoMQA (Ours) 11k 7k 11k Yes Yes Yes Yes

HotpotQA 90k 7k 7k Yes No Yes No

CWQ 28k 3k 3k Yes Yes No No

QAMParI 64k 1k 1k Eval only Yes Yes No

Table 2: Dataset size and question complexity.

Sampling constraints to form logical queries.

We generate up to 5 logical queries using each

cluster. For each logical query, we copy the clus-

ter and remove constraints with probability 0.1 and

negate with 0.1. We negate sparingly because ini-

tial trials showed that a large number of negative

constraints resulted in unnatural questions. We

further remove redundant constraints (e.g. Ameri-

can presidents born in the US), and uniformly sub-

sample up to 4 constraints. This constitutes a log-

ical query with multiple conjunctions and subtrac-

tions. For instance, the cluster {occupation

pianist subj,born_in Paris subj} can

form a logical query occupation pianist

subj AND born_in Paris subj. We discard

overly general queries with ≥5000 answers.

Creating natural language questions. Me-

chanical Turk crowd-workers annotate logical

queries marked with Wikidata titles, descriptions,

and aliases into questions. Appendix BFig-

ure 11 shows the interface. Two more anno-

tators verify each annotation to conﬁrm that it

matches the logical query. We keep only annota-

tions with 100% agreement, resulting in 11% be-

ing discarded. After veriﬁcation, we additionally

discard clusters with ≤2 questions.

2.2 Dataset analyses and comparison

We compare RoMQA to prior compositional QA

datasets: HotpotQA (Yang et al.,2018), Com-

plexWebQuestions (CWQ; Talmor and Berant,

2018), and QAMParI (Amouyal et al.,2022).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RoMQA:ABenchmarkforRobust,Multi-evidence,Multi-answerQuestionAnsweringVictorZhongy,WeijiaShiy,Wen-tauYihy,andLukeZettlemoyeryUniversityofWashingtonyMetaAIAbstractWeintroduceRoMQA,therstbenchmarkforrobust,multi-evidence,multi-answerques-tionanswering(QA).RoMQAcontainsclus-tersofquestionsthatare...

展开>> 收起<<

RoMQA A Benchmark for Robust Multi-evidence Multi-answer Question Answering Victor Zhongy Weijia Shiy Wen-tau YihyandLuke Zettlemoyery.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RoMQA A Benchmark for Robust Multi-evidence Multi-answer Question Answering Victor Zhongy Weijia Shiy Wen-tau YihyandLuke Zettlemoyery

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: