
to answer compositional questions about it.
We next narrow the compositionality gap by us-
ing what we call elicitive prompts. Compositional
questions require more computation and knowl-
edge retrieval than 1-hop ones; however, by us-
ing naive prompting (which expects the answer to
be output immediately after the question), we al-
ways give the model approximately the same num-
ber of steps to answer questions. We show that
elicitive prompting, such as chain of thought (Wei
et al.,2022b), lets the model “think things through”
before it outputs a final answer, which markedly
boosts performance. We then show that our self-ask
prompt, where the prompt has the LM decompose
complex questions into easier sub-questions that it
answers before answering the main question, can
improve performance even more.
Beyond CC, we also apply elicitive prompts
to two existing, automatically generated datasets,
(2WikiMultiHopQA, Ho et al.,2020, and Musique,
Trivedi et al.,2022), and on a third dataset of
125 questions, Bamboogle, that we manually con-
structed. Bamboogle is a dataset with 2-hop ques-
tions written by the authors, where all questions are
sufficiently difficult to be unanswerable by a popu-
lar internet search engine, but where both support-
ing pieces of evidence can be found in Wikipedia
(and hence were probably included in the pretrain-
ing set of any LM).
The two datasets we constructed—Bamboogle
and the previously mentioned Compositional
Celebrities—are complementary and serve differ-
ent research purposes. Bamboogle is a small, hand-
crafted dataset that covers many different types
of questions on different areas written in unique
ways, whereas CC (similar to Musique and 2Wiki-
MultiHopQA) is a large, automatically generated
dataset where each question fits into one of the
17 templates we made (i.e., much lower variation
than Bamboogle). Compositional Celebrities is de-
signed to estimate the compositionality gap on a
large set of questions, and Bamboogle is designed
to measure the extent to which a question answer-
ing system can answer varied compositional ques-
tions, albeit with less statistical power.
Finally, we show that the structure of self-ask
combines easily with an internet search engine to
further improve results on compositional questions.
In summary, we systematically reveal that al-
though LMs can sometimes compose two facts they
observed separately during pretraining, they fail to
do so in a large fraction of cases, even when they
demonstrate knowledge of the constituent facts in-
dividually. We call this ratio the compositionality
gap and show that it does not shrink with scale.
We show that elicitive prompts, such as chain of
thought and our self-ask prompting, narrow and
sometimes close this gap and improve the ability of
LMs to solve complex compositional questions. Fi-
nally, self-ask can be easily combined with a search
engine to further improve performance.
2 Systematically Measuring the
Compositionality Gap
As LMs grow in size, they contain increasing
amounts of knowledge about the world (Brown
et al.,2020;Srivastava et al.,2022). But how does
their compositional ability scale? We investigate
this with a new method that shows how to formally
quantify the compositional ability of an LM.
Our method is based on 2-hop questions that
are grammatical but unlikely to have been previ-
ously uttered, e.g., “What is the calling code of
the birthplace of Frida Kahlo?” We generate them
by crawling a list of celebrities and where/when
they were born. We then retrieve facts about each
birth country (capital, currency, calling code, ...)
and each birth year (winner of the Master’s tour-
nament or Nobel Prize in literature that year ...),
and we generate 2-hop questions by combining
pairs of facts. Appendix Table 3shows an exam-
ple question from each of the 17 categories in our
Compositional Celebrities (CC) dataset; Appendix
Section A.2 elaborates on that dataset.
CC, which we design to measure the compo-
sitionality gap, intentionally contains direct and
unambiguous questions, where (1) each fact has
likely appeared many times in the training dataset,
but (2) the combination of both facts is sufficiently
unnatural that it likely never appeared in the train-
ing set or on the internet at all.3
This question format has many advantages: al-
most all questions have a single correct answer,
they are easily decomposable into sub-questions
(letting us verify if the LM knows the background
facts), and for most questions the answer domain is
very large (unlike yes/no or multiple choice ques-
tions). Hence, the chances of randomly guessing
the correct answer are low.
3
Our manual internet searches using keywords from these
questions did not turn up any questions like those in CC, so
we believe it is highly unlikely that they were in any LM’s
training data at the time the dataset was constructed.