Mintaka A Complex Natural and Multilingual Dataset for End-to-End Question Answering Priyanka Sen

2025-05-02 0 0 565.32KB 16 页 10玖币

侵权投诉

Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End

Question Answering

Priyanka Sen

Amazon Alexa AI

Cambridge, UK

sepriyan@amazon.com

Alham Fikri Aji

Amazon Alexa AI

Cambridge, UK

afaji@amazon.com

Amir Saffari

Amazon Alexa AI

Cambridge, UK

amsafari@amazon.com

Abstract

We introduce MINTAKA, a complex, natural,

and multilingual dataset designed for experi-

menting with end-to-end question-answering

models. Mintaka is composed of 20,000

question-answer pairs collected in English, an-

notated with Wikidata entities, and translated

into Arabic, French, German, Hindi, Italian,

Japanese, Portuguese, and Spanish for a total

of 180,000 samples. Mintaka includes 8 types

of complex questions, including superlative,

intersection, and multi-hop questions, which

were naturally elicited from crowd workers.

We run baselines over Mintaka, the best of

which achieves 38% hits@1 in English and

31% hits@1 multilingually, showing that ex-

isting models have room for improvement.

We release Mintaka at https://github.

com/amazon-research/mintaka.

1 Introduction

Question answering (QA) is the task of learning

to predict answers to questions. Approaches to

question answering include knowledge graph (KG)

based methods, which use structured data to ﬁnd

the correct answer (Miller et al.,2016;Saxena et al.,

2020); machine reading comprehension methods,

which extract answers from input documents (Ra-

jpurkar et al.,2016;Kwiatkowski et al.,2019);

open domain methods, which learn to retrieve rel-

evant documents and extract or generate answers

(Zhu et al.,2021), and closed-book methods, which

use knowledge implicitly stored in model parame-

ters to answer questions (Roberts et al.,2020).

With state-of-the-art techniques, QA models can

achieve high performance on simple questions (Shi

et al.,2020,2021) that require a single fact lookup

in either a knowledge graph or a text document

(e.g., "Where was Natalie Portman born?"). How-

ever not all questions are simple in real-world ap-

plications. We deﬁne complex questions (Lan et al.,

2021) as questions that require an operation beyond

a single fact lookup, such as multi-hop, compar-

ative, or set intersection questions. For example,

"What movie had a higher budget, Titanic or Men

in Black?" requires looking up the budget of two

movies, comparing the values, and selecting the

movie with the higher budget. Handling more com-

plex questions remains an open problem.

One of the challenges in measuring and improv-

ing QA performance on complex questions is a lack

of datasets. While several QA datasets exist, they

have shortcomings of being either large but sim-

ple, such as SimpleQuestions (Bordes et al.,2015),

or complex but small, such as ComplexQuestions

(Bao et al.,2016) or QALD (Usbeck et al.,2018).

Recently, several large and complex datasets have

been released, including KQA Pro (Shi et al.,2020)

and GrailQA (Gu et al.,2021). These datasets use

automatically generated questions followed by hu-

man paraphrasing, which can result in less natural

questions, such as "Is the WOEID of Tuscaloosa

14605?" (KQA Pro) or "1520.0 is the minimum

width for which size rail gauge?" (GrailQA). This

can lead to a mismatch between training data and

real-world use cases of QA models.

In order to address these gaps, we release

MINTAKA, a large, complex, naturally-elicited, and

multilingual question answering dataset. Mintaka

contains 20,000 question-answer pairs elicited in

English from crowd workers. We link Mintaka to a

knowledge graph by asking crowd workers to an-

notate the question and answer text with Wikidata

IDs. Professional translators translated the 20,000

English questions into Arabic, French, German,

Hindi, Italian, Japanese, Portuguese, and Spanish,

creating a total dataset size of 180,000 questions.

In this paper, we present an overview of Mintaka

in §3, explain how we built Mintaka in §4, provide

a statistical analysis of the dataset in §5, including a

demographic analysis of our crowd workers in §5.3.

Finally in §6, we present results of existing baseline

models on Mintaka, the best of which scores 38%

arXiv:2210.01613v1 [cs.CL] 4 Oct 2022

Dataset Samples Text or KG Complex Natural Languages

SQuAD 150K Wikipedia × × 1

XQuAD 2K Wikipedia × × 11

Natural Questions 300K Wikipedia ×X1

HotpotQA 100K Wikipedia X×1

DROP 100K Wikipedia X×1

WebQuestionsSP 5K FreeBase ×X1

ComplexQuestions 2K FreeBase X X 1

ComplexWebQuestions 35K FreeBase X×1

LC-QuAD 2.0 30K Wikidata, DBPedia X×1

GrailQA 64K Wikidata X×1

KQA Pro 120K Wikidata X×1

QALD 400 DBPedia X X 11

Mintaka (ours) 20K Wikidata X X 9

Table 1: Comparison of Mintaka to existing QA datasets

hits@1. These results show that existing models

have room for improvement.

We publicly release the Mintaka dataset

with our randomly split train (14,000 sam-

ples), dev (2,000 samples), and test (4,000

samples) sets at

https://github.com/

amazon-research/mintaka.

2 Related Works

Question answering has no shortage of datasets.

Datasets for question-answering with reading com-

prehension, such as SQuAD (Rajpurkar et al.,

2016) or Natural Questions (Kwiatkowski et al.,

2019) are often large, and some are even multilin-

gual, such as XQuAD (Artetxe et al.,2019), MLQA

(Lewis et al.,2019), or TyDi QA (Clark et al.,2020).

These datasets, however, are not explicitly built to

be complex, and the answer is usually found in a

single passage of text.

HotpotQA (Yang et al.,2018) and MuSiQue

(Trivedi et al.,2022) add complexity to reading

comprehension by introducing multi-hop questions

where the answer requires reasoning over two

documents, but neither of these datasets naturally

elicit their questions. HotpotQA pre-selects two

Wikipedia passages and asks workers to write ques-

tions using both passages, and MuSiQue composes

multi-hop questions from existing single hop ques-

tions. DROP (Dua et al.,2019) is another complex

reading comprehension dataset, including complex

operations such as addition, counting, and sorting.

Again, DROP asks crowd workers to write ques-

tions about a selected Wikipedia passage. DROP

additionally introduces a constraint where workers

need to write questions that can’t be solved by an

existing model.

Within knowledge graph-based question answer-

ing (KGQA), WebQuestionsSP (Berant et al.,2013;

Yih et al.,2016) and ComplexQuestions (Bao et al.,

2016) are more natural QA datasets. Both collected

real user questions using search query logs or the

Google Suggest API. The answers were annotated

manually using FreeBase as a knowledge graph.

WebQuestionsSP contains mostly simple questions,

but ComplexQuestions is more complex, including

multi-hop questions, temporal constraints, and ag-

gregations. The main drawback of these datasets

is size. WebQuestionsSP contains 5K QA pairs,

while ComplexQuestions contains only 2K.

ComplexWebQuestions (Talmor and Berant,

2018) is a dataset based on WebQuestionsSP, which

increases the size to 35K QA pairs and introduces

more complex operations, including multi-hop,

comparatives, and superlatives. However Com-

plexWebQuestions loses some naturalness, as the

dataset is built by automatically generating queries

and questions, and then asking crowd workers to

paraphrase the generated questions.

Recently, several larger-scale complex KGQA

datasets have been released. LC-QuAD 2.0 (Dubey

et al.,2019) includes 30K questions, including

multi-hop questions, and uses the more up-to-date

Wikidata and DBpedia knowledge graphs. GrailQA

(Gu et al.,2021) is even larger at 64K questions

based on FreeBase with complex questions, includ-

ing multi-hop, count, and comparatives. KQA Pro

(Shi et al.,2020) is even larger still with 120K

questions based on Wikidata and with complex

questions, including intersection and superlatives.

All these datasets make the trade-off of scale over

naturalness. To collect question-answer pairs, the

authors generate queries from a knowledge graph,

generate questions based on the queries, and then

ask crowd workers to paraphrase the questions.

Finally, most datasets are only in English. Mul-

tilingual and complex datasets are rare. QALD

2018 (Usbeck et al.,2018) is one multilingual and

complex dataset including 11 languages and com-

plex operations such as counts and comparatives,

however includes only 400 questions.

By building Mintaka, we hope to address an im-

portant gap in existing datasets. Mintaka question-

answer pairs are both complex and naturally-

elicited from crowd workers with no restrictions

on what facts or articles the questions can be about.

We also translate Mintaka into 8 languages, making

it one of the ﬁrst large-scale complex and multilin-

gual question answering datasets. A comparison of

Mintaka to existing datasets can be seen in Table 1.

3 Mintaka

Mintaka is a complex question answering dataset

of 20,000 questions collected in English and trans-

lated into 8 languages, for a total of 180,000 ques-

tions. Mintaka contains question-answer pairs writ-

ten by crowd workers and annotated with Wikidata

entities in both the question and answer.

We collected questions in eight topics, which

were chosen for being broadly appealing and suit-

able for writing complex questions: MOVIES,

MUSIC, SPORTS, BOOKS, GEOGRAPHY, POLI-

TICS, VIDEO GAMES, and HISTORY. Since we

want Mintaka to be a complex question answer-

ing dataset, we explicitly collected questions in the

following complexity types: (Note: All examples

below are from the Mintaka dataset.)

•

COUNT: questions where the answer requires

counting. For example, Q: How many astro-

nauts have been elected to Congress? A: 4

•

COMPARATIVE: questions that compare two

objects on a given attribute (e.g., age, height).

For example, Q: Is Mont Blanc taller than

Mount Rainier? A: Yes

•

SUPERLATIVE: questions about the maxi-

mums or minimums of a given attribute. For

example, Q: Who was the youngest tribute in

the Hunger Games? A: Rue

•

ORDINAL: questions based on an object’s

position in an ordered list. For example, Q:

Who was the last Ptolemaic ruler of Egypt?

A: Cleopatra

•

MULTI-HOP: questions that require 2 or more

steps (multiple hops) to answer. For example,

Q: Who was the quarterback of the team that

won Super Bowl 50? A: Peyton Manning

•

INTERSECTION: questions that have two or

more conditions that the answer must fulﬁll.

For example, Q: Which movie was directed

by Denis Villeneuve and stars Timothee Cha-

lamet? A: Dune

•

DIFFERENCE: questions with a condition that

contains a negation. For example, Q: Which

Mario Kart game did Yoshi not appear in? A:

Mario Kart Live: Home Circuit

•

YES/NO: questions where the answer is Yes

or No. For example, Q: Has Lady Gaga ever

made a song with Ariana Grande? A: Yes.

•

GENERIC: questions where the worker was

only given the topic and no constraints of com-

plexity. These tend to be simpler fact lookups,

such as Q: Where was Michael Phelps born?

A: Baltimore, Maryland

For each of the 8 topics, we collected 250 ques-

tions per complexity type and 500 generic ques-

tions, for a total of 2,500 questions per topic.

We also collected translations of the 20,000 En-

glish questions in 8 languages using professional

translators. Since all questions were collected in

English from U.S. workers, the questions may have

a U.S. bias in terms of the entities (for example,

U.S. politicians or books written in English). This

is a choice we make since it allows us to create a

fully parallel dataset where models can be easily

compared across languages. This choice was also

made in previous QA datasets (Usbeck et al.,2018;

Artetxe et al.,2019;Lewis et al.,2019).

4 Dataset Collection

To build our dataset, we used Amazon Mechanical

Turk (MTurk) in three different tasks. All of our

MTurk workers were located in the United States,

and to ensure high quality, we required workers

have an approval rating of 98% and at least 5,000

approved tasks. Each of our tasks are explained in

the sections below, and examples of the interfaces

can be seen in Appendix A.

4.1 Question Elicitation

The ﬁrst task was to elicit complex questions. To do

this, we created tasks for each topic/complexity pair

(e.g., Superlative Movie questions, Ordinal Sports

questions, etc.). In each task, a worker was asked

to write 5 questions and answers about the topic

using the given complexity type. The questions and

answers were written in free text ﬁelds. We had no

restrictions on what sources workers could use to

write their questions, so workers were not limited to

writing questions based on a given article or facts.

Workers were given explanations of the complexity

type and examples in the instructions. The topics

were left general, so within History, workers could

write about Ancient Egypt as well as World War II.

For Count and Superlative answers, we addition-

ally asked workers to provide a numerical value as

part of the answer. For example, in Count ques-

tions, workers would both provide the answer as a

number (e.g., 3) as well as the entities that make

up that answer (e.g., Best Picture, Best Adapted

Screenplay, and Best Film Editing). In Superla-

tive questions, workers provided the answer (e.g.,

Missouri River) as well as the numerical value that

makes the entity the maximum or minimum (e.g.,

2,341 miles). Additionally in Count questions, if a

question had multiple answers, we asked workers

to list a minimum of ﬁve. For example, for the

question "How many cities have hosted a Summer

Olympics?", a worker could give the numerical an-

swer 23 but provide only ﬁve of the cities. For

this reason, answers to questions with more than

ﬁve entities are not guaranteed to be complete but

instead provide a sample of the correct answer.

We paid $1.25 per task to write ﬁve questions.

Workers were limited to completing one task per

topic-complexity pair. After collection, we also

surveyed the MTurk workers who completed our

Question Elicitation task about their demographics.

The results of this survey are discussed in §5.3.

4.2 Answer Entity Linking

Answers were collected in the previous task in nat-

ural language. In order to link the answers to a

knowledge graph, we built an Answer Entity Link-

ing task. We chose to link the answers to Wikidata,

since it is a large and up-to-date public knowledge

graph. Although we link to Wikidata, we don’t

guarantee that every question can be answered by

Wikidata at the time of writing. It is possible that

there are missing or incomplete facts that would

prevent a KGQA system from reaching the answer

entity in Wikidata given the question.

In this task, workers were shown a question-

answer pair and asked to 1) highlight the entities

in the answer, and 2) search for the entities on

Wikidata and provide the correct URLs. We built

a UI for MTurk workers where they could easily

highlight entities and the highlighted entities would

automatically generate links to search Wikidata.

Each answer was annotated by two MTurk work-

ers. For agreement, we required two workers to

identify the same entities and the same Wikidata

URLs for all entities. If there was disagreement,

we sent the question-answer pair to a third annota-

tor. Question-answer pairs where the answer was a

number or yes or no were excluded from answer en-

tity linking. Overall, we annotated 20,996 answer

entities and achieved 82% agreement after two an-

notators and 97% agreement after three annotators.

The remaining 3% were veriﬁed by the authors.

We paid a base rate of $0.10 per task, which

consisted of a single question-answer pair. If

the answer had multiple entities, we paid a $0.05

bonus for every additional entity identiﬁed that was

agreed upon by another annotator.

4.3 Question Entity Linking

An end-to-end question answering model can be

trained using the question and answer alone (Oliya

et al.,2021). However to better evaluate end-to-

end methods and train models requiring entities,

we also created an MTurk task to link entities in

the question text.

Linking entities in questions is more challenging

than answers. While answer texts are often short

and contain a clear entity (e.g., "Joe Biden"), ques-

tion texts can contain multiple possible entities. In

the question "Who is the president of the United

States?", a worker could select "United States", or

"president" and "United States", or even "president

of the United States". Since early test runs showed

it would be difﬁcult to get agreement on question

entities, we modiﬁed the task so workers only veri-

ﬁed a span and linked the entity in Wikidata.

To identify spans in questions, we used spaCy’s

(Honnibal et al.,2020) en_core_web_trf model to

identify named entities and noun chunks with cap-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Mintaka:AComplex,Natural,andMultilingualDatasetforEnd-to-EndQuestionAnsweringPriyankaSenAmazonAlexaAICambridge,UKsepriyan@amazon.comAlhamFikriAjiAmazonAlexaAICambridge,UKafaji@amazon.comAmirSaffariAmazonAlexaAICambridge,UKamsafari@amazon.comAbstractWeintroduceMINTAKA,acomplex,natural,andmultilingual...

展开>> 收起<<

Mintaka A Complex Natural and Multilingual Dataset for End-to-End Question Answering Priyanka Sen.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mintaka A Complex Natural and Multilingual Dataset for End-to-End Question Answering Priyanka Sen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: