have an approval rating of 98% and at least 5,000
approved tasks. Each of our tasks are explained in
the sections below, and examples of the interfaces
can be seen in Appendix A.
4.1 Question Elicitation
The first task was to elicit complex questions. To do
this, we created tasks for each topic/complexity pair
(e.g., Superlative Movie questions, Ordinal Sports
questions, etc.). In each task, a worker was asked
to write 5 questions and answers about the topic
using the given complexity type. The questions and
answers were written in free text fields. We had no
restrictions on what sources workers could use to
write their questions, so workers were not limited to
writing questions based on a given article or facts.
Workers were given explanations of the complexity
type and examples in the instructions. The topics
were left general, so within History, workers could
write about Ancient Egypt as well as World War II.
For Count and Superlative answers, we addition-
ally asked workers to provide a numerical value as
part of the answer. For example, in Count ques-
tions, workers would both provide the answer as a
number (e.g., 3) as well as the entities that make
up that answer (e.g., Best Picture, Best Adapted
Screenplay, and Best Film Editing). In Superla-
tive questions, workers provided the answer (e.g.,
Missouri River) as well as the numerical value that
makes the entity the maximum or minimum (e.g.,
2,341 miles). Additionally in Count questions, if a
question had multiple answers, we asked workers
to list a minimum of five. For example, for the
question "How many cities have hosted a Summer
Olympics?", a worker could give the numerical an-
swer 23 but provide only five of the cities. For
this reason, answers to questions with more than
five entities are not guaranteed to be complete but
instead provide a sample of the correct answer.
We paid $1.25 per task to write five questions.
Workers were limited to completing one task per
topic-complexity pair. After collection, we also
surveyed the MTurk workers who completed our
Question Elicitation task about their demographics.
The results of this survey are discussed in §5.3.
4.2 Answer Entity Linking
Answers were collected in the previous task in nat-
ural language. In order to link the answers to a
knowledge graph, we built an Answer Entity Link-
ing task. We chose to link the answers to Wikidata,
since it is a large and up-to-date public knowledge
graph. Although we link to Wikidata, we don’t
guarantee that every question can be answered by
Wikidata at the time of writing. It is possible that
there are missing or incomplete facts that would
prevent a KGQA system from reaching the answer
entity in Wikidata given the question.
In this task, workers were shown a question-
answer pair and asked to 1) highlight the entities
in the answer, and 2) search for the entities on
Wikidata and provide the correct URLs. We built
a UI for MTurk workers where they could easily
highlight entities and the highlighted entities would
automatically generate links to search Wikidata.
Each answer was annotated by two MTurk work-
ers. For agreement, we required two workers to
identify the same entities and the same Wikidata
URLs for all entities. If there was disagreement,
we sent the question-answer pair to a third annota-
tor. Question-answer pairs where the answer was a
number or yes or no were excluded from answer en-
tity linking. Overall, we annotated 20,996 answer
entities and achieved 82% agreement after two an-
notators and 97% agreement after three annotators.
The remaining 3% were verified by the authors.
We paid a base rate of $0.10 per task, which
consisted of a single question-answer pair. If
the answer had multiple entities, we paid a $0.05
bonus for every additional entity identified that was
agreed upon by another annotator.
4.3 Question Entity Linking
An end-to-end question answering model can be
trained using the question and answer alone (Oliya
et al.,2021). However to better evaluate end-to-
end methods and train models requiring entities,
we also created an MTurk task to link entities in
the question text.
Linking entities in questions is more challenging
than answers. While answer texts are often short
and contain a clear entity (e.g., "Joe Biden"), ques-
tion texts can contain multiple possible entities. In
the question "Who is the president of the United
States?", a worker could select "United States", or
"president" and "United States", or even "president
of the United States". Since early test runs showed
it would be difficult to get agreement on question
entities, we modified the task so workers only veri-
fied a span and linked the entity in Wikidata.
To identify spans in questions, we used spaCy’s
(Honnibal et al.,2020) en_core_web_trf model to
identify named entities and noun chunks with cap-