and reasoning over these sub-questions in order to
arrive at a final answer (multi-hop reasoning) has
been a common goal (Khot et al.,2020;Min et al.,
2019;Yang et al.,2018;Khashabi et al.,2018).
At the system level, investigating a system’s abil-
ity to generalize from question types seen during
training (e.g., “Who directed x?”) to new, unseen
instances of the same type (e.g., “Who directed In-
ception?”) has attracted increasing attention (Key-
sers et al.,2019). Further works have explored
both problems—multi-hop reasoning and composi-
tional generalization—through the lens of semantic
parsing (Wolfson et al.,2020;Shaw et al.,2021).
In contrast, we focus on a new schema of recom-
mendation tasks, where by design the decomposi-
tion required to perform the task is not transparent
from the question itself but is known a priori across
a variety of domains. This schema allows us to eval-
uate the effectiveness of a novel CFT approach in
two domains, and to compare it against the recent
chain of thought prompting approach.
2.3 Curriculum Learning (CL)
The seminal work in CL (Bengio et al.,2009) in-
cluded a language modeling experiment in which
training data were ordered from most to least fre-
quent based on corpus statistics. Since then, many
works in NLP have explored different measures of
example difficulty, as simple as sequence length
for NLG (Rajeswar et al.,2017) and as complex
as estimates based on model performance (Sachan
and Xing,2016;Xu et al.,2020). However, such
a focus on example difficulty has kept these works
distant from the “shaping hypothesis” that inspired
(Bengio et al.,2009): the idea that a complex task
can be taught by breaking it into a sequence of
smaller steps of incremental complexity (Krueger
and Dayan,2009). In this work, instead of incre-
mental example difficulty, we explore a different
approach to incremental complexity based on orga-
nizing training data around component tasks.
To the best of our knowledge, the closest works
can be found in the domain of spatial navigation
instructions (Dan et al.,2021;Lake and Baroni,
2018), in which an LM starts with simple block-
moving instructions and progresses to composi-
tional ones. However, our work differs in the diver-
sity of our component tasks, in the more extensive
experimentation that ensues, and in the applicabil-
ity of CFT to other similarly diverse domains.
3 Problem Definition
The recommendation task depicted in Figure 1
takes as input a set of items (set
I
) and a set of user
preferences (set
P
), such that
Recommend(P, I)
outputs the item that best matches the user prefer-
ences. In its simplest form, we have a pair of items
I={i1
,
i2
} and a single preference
P={p}
, such
that
Recommend({p},{i1, i2})
. This form maps
naturally to what we call a
“decision template,”
composed of two sentences: one with a prefer-
ence (e.g., “You don’t like cold weather.”), and
another with a sufficiently different pair of items
(e.g., “Between London and Lisbon, you should
visit”
→
Lisbon). We use the term “decision” be-
cause
Recommend(P, I)
can be considered an in-
stance of a decision task where
I
represents options
and Pexpresses the criteria to be applied.
Breaking down
Recommend({p},{i1, i2})
into component tasks, the first task consists of
comparing two items along a given attribute. This
can be defined as
Compare(a, o, {i1, i2})
that
takes as input an attribute
a
(e.g., temperature),
an order
o
(e.g., higher), and the two items, and
then outputs the item that satisfies the comparison.
We call this task a
“factual comparison”
(e.g.,
“Between London and Lisbon, the city with
warmer weather is”
→
Lisbon), which is further
decomposed into
“factual statements”
that
simply enunciate the attribute value of an item
(e.g., “The average temperature in Lisbon is”
→
17.5C).
With that, a domain
D
can be formalized as
D= (Ifull, A)
where
Ifull
is the full set of
items and
A
the set of attributes. Considering
the world travel domain, for example,
Ifull
may
represent a list of well-known cities and
A=
{temperature, population}
the average tempera-
ture and total population, respectively. We instanti-
ate this schema in our experiments in §5, but it can
be used to generate new recommendation datasets
or repurposed for other decision tasks.
3.1 A Challenging Task for Pretrained LMs
Even state-of-the-art LMs such as GPT-3 (Brown
et al.,2020) struggle at this recommendation task,
as evidenced by experiments fully described in §5.
As shown in Table 1, 175B parameter DaVinci in 8-
shot mode can accurately recommend 83% of test
cases in the world travel domain, but only 55% in
the local dining domain, which cannot be improved
with chain of thought prompting. As shown in