Mintaka A Complex Natural and Multilingual Dataset for End-to-End Question Answering Priyanka Sen

2025-05-02 0 0 565.32KB 16 页 10玖币
侵权投诉
Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End
Question Answering
Priyanka Sen
Amazon Alexa AI
Cambridge, UK
sepriyan@amazon.com
Alham Fikri Aji
Amazon Alexa AI
Cambridge, UK
afaji@amazon.com
Amir Saffari
Amazon Alexa AI
Cambridge, UK
amsafari@amazon.com
Abstract
We introduce MINTAKA, a complex, natural,
and multilingual dataset designed for experi-
menting with end-to-end question-answering
models. Mintaka is composed of 20,000
question-answer pairs collected in English, an-
notated with Wikidata entities, and translated
into Arabic, French, German, Hindi, Italian,
Japanese, Portuguese, and Spanish for a total
of 180,000 samples. Mintaka includes 8 types
of complex questions, including superlative,
intersection, and multi-hop questions, which
were naturally elicited from crowd workers.
We run baselines over Mintaka, the best of
which achieves 38% hits@1 in English and
31% hits@1 multilingually, showing that ex-
isting models have room for improvement.
We release Mintaka at https://github.
com/amazon-research/mintaka.
1 Introduction
Question answering (QA) is the task of learning
to predict answers to questions. Approaches to
question answering include knowledge graph (KG)
based methods, which use structured data to find
the correct answer (Miller et al.,2016;Saxena et al.,
2020); machine reading comprehension methods,
which extract answers from input documents (Ra-
jpurkar et al.,2016;Kwiatkowski et al.,2019);
open domain methods, which learn to retrieve rel-
evant documents and extract or generate answers
(Zhu et al.,2021), and closed-book methods, which
use knowledge implicitly stored in model parame-
ters to answer questions (Roberts et al.,2020).
With state-of-the-art techniques, QA models can
achieve high performance on simple questions (Shi
et al.,2020,2021) that require a single fact lookup
in either a knowledge graph or a text document
(e.g., "Where was Natalie Portman born?"). How-
ever not all questions are simple in real-world ap-
plications. We define complex questions (Lan et al.,
2021) as questions that require an operation beyond
a single fact lookup, such as multi-hop, compar-
ative, or set intersection questions. For example,
"What movie had a higher budget, Titanic or Men
in Black?" requires looking up the budget of two
movies, comparing the values, and selecting the
movie with the higher budget. Handling more com-
plex questions remains an open problem.
One of the challenges in measuring and improv-
ing QA performance on complex questions is a lack
of datasets. While several QA datasets exist, they
have shortcomings of being either large but sim-
ple, such as SimpleQuestions (Bordes et al.,2015),
or complex but small, such as ComplexQuestions
(Bao et al.,2016) or QALD (Usbeck et al.,2018).
Recently, several large and complex datasets have
been released, including KQA Pro (Shi et al.,2020)
and GrailQA (Gu et al.,2021). These datasets use
automatically generated questions followed by hu-
man paraphrasing, which can result in less natural
questions, such as "Is the WOEID of Tuscaloosa
14605?" (KQA Pro) or "1520.0 is the minimum
width for which size rail gauge?" (GrailQA). This
can lead to a mismatch between training data and
real-world use cases of QA models.
In order to address these gaps, we release
MINTAKA, a large, complex, naturally-elicited, and
multilingual question answering dataset. Mintaka
contains 20,000 question-answer pairs elicited in
English from crowd workers. We link Mintaka to a
knowledge graph by asking crowd workers to an-
notate the question and answer text with Wikidata
IDs. Professional translators translated the 20,000
English questions into Arabic, French, German,
Hindi, Italian, Japanese, Portuguese, and Spanish,
creating a total dataset size of 180,000 questions.
In this paper, we present an overview of Mintaka
in §3, explain how we built Mintaka in §4, provide
a statistical analysis of the dataset in §5, including a
demographic analysis of our crowd workers in §5.3.
Finally in §6, we present results of existing baseline
models on Mintaka, the best of which scores 38%
arXiv:2210.01613v1 [cs.CL] 4 Oct 2022
Dataset Samples Text or KG Complex Natural Languages
SQuAD 150K Wikipedia × × 1
XQuAD 2K Wikipedia × × 11
Natural Questions 300K Wikipedia ×X1
HotpotQA 100K Wikipedia X×1
DROP 100K Wikipedia X×1
WebQuestionsSP 5K FreeBase ×X1
ComplexQuestions 2K FreeBase X X 1
ComplexWebQuestions 35K FreeBase X×1
LC-QuAD 2.0 30K Wikidata, DBPedia X×1
GrailQA 64K Wikidata X×1
KQA Pro 120K Wikidata X×1
QALD 400 DBPedia X X 11
Mintaka (ours) 20K Wikidata X X 9
Table 1: Comparison of Mintaka to existing QA datasets
hits@1. These results show that existing models
have room for improvement.
We publicly release the Mintaka dataset
with our randomly split train (14,000 sam-
ples), dev (2,000 samples), and test (4,000
samples) sets at
https://github.com/
amazon-research/mintaka.
2 Related Works
Question answering has no shortage of datasets.
Datasets for question-answering with reading com-
prehension, such as SQuAD (Rajpurkar et al.,
2016) or Natural Questions (Kwiatkowski et al.,
2019) are often large, and some are even multilin-
gual, such as XQuAD (Artetxe et al.,2019), MLQA
(Lewis et al.,2019), or TyDi QA (Clark et al.,2020).
These datasets, however, are not explicitly built to
be complex, and the answer is usually found in a
single passage of text.
HotpotQA (Yang et al.,2018) and MuSiQue
(Trivedi et al.,2022) add complexity to reading
comprehension by introducing multi-hop questions
where the answer requires reasoning over two
documents, but neither of these datasets naturally
elicit their questions. HotpotQA pre-selects two
Wikipedia passages and asks workers to write ques-
tions using both passages, and MuSiQue composes
multi-hop questions from existing single hop ques-
tions. DROP (Dua et al.,2019) is another complex
reading comprehension dataset, including complex
operations such as addition, counting, and sorting.
Again, DROP asks crowd workers to write ques-
tions about a selected Wikipedia passage. DROP
additionally introduces a constraint where workers
need to write questions that can’t be solved by an
existing model.
Within knowledge graph-based question answer-
ing (KGQA), WebQuestionsSP (Berant et al.,2013;
Yih et al.,2016) and ComplexQuestions (Bao et al.,
2016) are more natural QA datasets. Both collected
real user questions using search query logs or the
Google Suggest API. The answers were annotated
manually using FreeBase as a knowledge graph.
WebQuestionsSP contains mostly simple questions,
but ComplexQuestions is more complex, including
multi-hop questions, temporal constraints, and ag-
gregations. The main drawback of these datasets
is size. WebQuestionsSP contains 5K QA pairs,
while ComplexQuestions contains only 2K.
ComplexWebQuestions (Talmor and Berant,
2018) is a dataset based on WebQuestionsSP, which
increases the size to 35K QA pairs and introduces
more complex operations, including multi-hop,
comparatives, and superlatives. However Com-
plexWebQuestions loses some naturalness, as the
dataset is built by automatically generating queries
and questions, and then asking crowd workers to
paraphrase the generated questions.
Recently, several larger-scale complex KGQA
datasets have been released. LC-QuAD 2.0 (Dubey
et al.,2019) includes 30K questions, including
multi-hop questions, and uses the more up-to-date
Wikidata and DBpedia knowledge graphs. GrailQA
(Gu et al.,2021) is even larger at 64K questions
based on FreeBase with complex questions, includ-
ing multi-hop, count, and comparatives. KQA Pro
(Shi et al.,2020) is even larger still with 120K
questions based on Wikidata and with complex
questions, including intersection and superlatives.
All these datasets make the trade-off of scale over
naturalness. To collect question-answer pairs, the
authors generate queries from a knowledge graph,
generate questions based on the queries, and then
ask crowd workers to paraphrase the questions.
Finally, most datasets are only in English. Mul-
tilingual and complex datasets are rare. QALD
2018 (Usbeck et al.,2018) is one multilingual and
complex dataset including 11 languages and com-
plex operations such as counts and comparatives,
however includes only 400 questions.
By building Mintaka, we hope to address an im-
portant gap in existing datasets. Mintaka question-
answer pairs are both complex and naturally-
elicited from crowd workers with no restrictions
on what facts or articles the questions can be about.
We also translate Mintaka into 8 languages, making
it one of the first large-scale complex and multilin-
gual question answering datasets. A comparison of
Mintaka to existing datasets can be seen in Table 1.
3 Mintaka
Mintaka is a complex question answering dataset
of 20,000 questions collected in English and trans-
lated into 8 languages, for a total of 180,000 ques-
tions. Mintaka contains question-answer pairs writ-
ten by crowd workers and annotated with Wikidata
entities in both the question and answer.
We collected questions in eight topics, which
were chosen for being broadly appealing and suit-
able for writing complex questions: MOVIES,
MUSIC, SPORTS, BOOKS, GEOGRAPHY, POLI-
TICS, VIDEO GAMES, and HISTORY. Since we
want Mintaka to be a complex question answer-
ing dataset, we explicitly collected questions in the
following complexity types: (Note: All examples
below are from the Mintaka dataset.)
COUNT: questions where the answer requires
counting. For example, Q: How many astro-
nauts have been elected to Congress? A: 4
COMPARATIVE: questions that compare two
objects on a given attribute (e.g., age, height).
For example, Q: Is Mont Blanc taller than
Mount Rainier? A: Yes
SUPERLATIVE: questions about the maxi-
mums or minimums of a given attribute. For
example, Q: Who was the youngest tribute in
the Hunger Games? A: Rue
ORDINAL: questions based on an object’s
position in an ordered list. For example, Q:
Who was the last Ptolemaic ruler of Egypt?
A: Cleopatra
MULTI-HOP: questions that require 2 or more
steps (multiple hops) to answer. For example,
Q: Who was the quarterback of the team that
won Super Bowl 50? A: Peyton Manning
INTERSECTION: questions that have two or
more conditions that the answer must fulfill.
For example, Q: Which movie was directed
by Denis Villeneuve and stars Timothee Cha-
lamet? A: Dune
DIFFERENCE: questions with a condition that
contains a negation. For example, Q: Which
Mario Kart game did Yoshi not appear in? A:
Mario Kart Live: Home Circuit
YES/NO: questions where the answer is Yes
or No. For example, Q: Has Lady Gaga ever
made a song with Ariana Grande? A: Yes.
GENERIC: questions where the worker was
only given the topic and no constraints of com-
plexity. These tend to be simpler fact lookups,
such as Q: Where was Michael Phelps born?
A: Baltimore, Maryland
For each of the 8 topics, we collected 250 ques-
tions per complexity type and 500 generic ques-
tions, for a total of 2,500 questions per topic.
We also collected translations of the 20,000 En-
glish questions in 8 languages using professional
translators. Since all questions were collected in
English from U.S. workers, the questions may have
a U.S. bias in terms of the entities (for example,
U.S. politicians or books written in English). This
is a choice we make since it allows us to create a
fully parallel dataset where models can be easily
compared across languages. This choice was also
made in previous QA datasets (Usbeck et al.,2018;
Artetxe et al.,2019;Lewis et al.,2019).
4 Dataset Collection
To build our dataset, we used Amazon Mechanical
Turk (MTurk) in three different tasks. All of our
MTurk workers were located in the United States,
and to ensure high quality, we required workers
have an approval rating of 98% and at least 5,000
approved tasks. Each of our tasks are explained in
the sections below, and examples of the interfaces
can be seen in Appendix A.
4.1 Question Elicitation
The first task was to elicit complex questions. To do
this, we created tasks for each topic/complexity pair
(e.g., Superlative Movie questions, Ordinal Sports
questions, etc.). In each task, a worker was asked
to write 5 questions and answers about the topic
using the given complexity type. The questions and
answers were written in free text fields. We had no
restrictions on what sources workers could use to
write their questions, so workers were not limited to
writing questions based on a given article or facts.
Workers were given explanations of the complexity
type and examples in the instructions. The topics
were left general, so within History, workers could
write about Ancient Egypt as well as World War II.
For Count and Superlative answers, we addition-
ally asked workers to provide a numerical value as
part of the answer. For example, in Count ques-
tions, workers would both provide the answer as a
number (e.g., 3) as well as the entities that make
up that answer (e.g., Best Picture, Best Adapted
Screenplay, and Best Film Editing). In Superla-
tive questions, workers provided the answer (e.g.,
Missouri River) as well as the numerical value that
makes the entity the maximum or minimum (e.g.,
2,341 miles). Additionally in Count questions, if a
question had multiple answers, we asked workers
to list a minimum of five. For example, for the
question "How many cities have hosted a Summer
Olympics?", a worker could give the numerical an-
swer 23 but provide only five of the cities. For
this reason, answers to questions with more than
ve entities are not guaranteed to be complete but
instead provide a sample of the correct answer.
We paid $1.25 per task to write ve questions.
Workers were limited to completing one task per
topic-complexity pair. After collection, we also
surveyed the MTurk workers who completed our
Question Elicitation task about their demographics.
The results of this survey are discussed in §5.3.
4.2 Answer Entity Linking
Answers were collected in the previous task in nat-
ural language. In order to link the answers to a
knowledge graph, we built an Answer Entity Link-
ing task. We chose to link the answers to Wikidata,
since it is a large and up-to-date public knowledge
graph. Although we link to Wikidata, we don’t
guarantee that every question can be answered by
Wikidata at the time of writing. It is possible that
there are missing or incomplete facts that would
prevent a KGQA system from reaching the answer
entity in Wikidata given the question.
In this task, workers were shown a question-
answer pair and asked to 1) highlight the entities
in the answer, and 2) search for the entities on
Wikidata and provide the correct URLs. We built
a UI for MTurk workers where they could easily
highlight entities and the highlighted entities would
automatically generate links to search Wikidata.
Each answer was annotated by two MTurk work-
ers. For agreement, we required two workers to
identify the same entities and the same Wikidata
URLs for all entities. If there was disagreement,
we sent the question-answer pair to a third annota-
tor. Question-answer pairs where the answer was a
number or yes or no were excluded from answer en-
tity linking. Overall, we annotated 20,996 answer
entities and achieved 82% agreement after two an-
notators and 97% agreement after three annotators.
The remaining 3% were verified by the authors.
We paid a base rate of $0.10 per task, which
consisted of a single question-answer pair. If
the answer had multiple entities, we paid a $0.05
bonus for every additional entity identified that was
agreed upon by another annotator.
4.3 Question Entity Linking
An end-to-end question answering model can be
trained using the question and answer alone (Oliya
et al.,2021). However to better evaluate end-to-
end methods and train models requiring entities,
we also created an MTurk task to link entities in
the question text.
Linking entities in questions is more challenging
than answers. While answer texts are often short
and contain a clear entity (e.g., "Joe Biden"), ques-
tion texts can contain multiple possible entities. In
the question "Who is the president of the United
States?", a worker could select "United States", or
"president" and "United States", or even "president
of the United States". Since early test runs showed
it would be difficult to get agreement on question
entities, we modified the task so workers only veri-
fied a span and linked the entity in Wikidata.
To identify spans in questions, we used spaCy’s
(Honnibal et al.,2020) en_core_web_trf model to
identify named entities and noun chunks with cap-
摘要:

Mintaka:AComplex,Natural,andMultilingualDatasetforEnd-to-EndQuestionAnsweringPriyankaSenAmazonAlexaAICambridge,UKsepriyan@amazon.comAlhamFikriAjiAmazonAlexaAICambridge,UKafaji@amazon.comAmirSaffariAmazonAlexaAICambridge,UKamsafari@amazon.comAbstractWeintroduceMINTAKA,acomplex,natural,andmultilingual...

展开>> 收起<<
Mintaka A Complex Natural and Multilingual Dataset for End-to-End Question Answering Priyanka Sen.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:565.32KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注