Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models Victor S. Bursztyn1David Demeter1Doug Downey12 and Larry Birnbaum1

2025-05-02 0 0 1.07MB 11 页 10玖币
侵权投诉
Learning to Perform Complex Tasks through
Compositional Fine-Tuning of Language Models
Victor S. Bursztyn1,David Demeter1,Doug Downey1,2, and Larry Birnbaum1
1Department of Computer Science, Northwestern University, Evanston, IL, USA
2Allen Institute for Artificial Intelligence, Seattle, WA, USA
{v-bursztyn,ddemeter}@u.northwestern.edu
{d-downey,l-birnbaum}@northwestern.edu
Abstract
How to usefully encode compositional task
structure has long been a core challenge in AI.
Recent work in chain of thought prompting
has shown that for very large neural language
models (LMs), explicitly demonstrating the in-
ferential steps involved in a target task may
improve performance over end-to-end learning
that focuses on the target task alone. However,
chain of thought prompting has significant lim-
itations due to its dependency on huge pre-
trained LMs. In this work, we present compo-
sitional fine-tuning (CFT): an approach based
on explicitly decomposing a target task into
component tasks, and then fine-tuning smaller
LMs on a curriculum of such component tasks.
We apply CFT to recommendation tasks in
two domains, world travel and local dining,
as well as a previously studied inferential task
(sports understanding). We show that CFT out-
performs end-to-end learning even with equal
amounts of data, and gets consistently bet-
ter as more component tasks are modeled via
fine-tuning. Compared with chain of thought
prompting, CFT performs at least as well us-
ing LMs only 7.4% of the size, and is more-
over applicable to task domains for which data
are not available during pretraining.
1 Introduction
Philosophy, linguistics, and computer science have
long debated how and whether to explicitly encode
the compositionality of task structure in models
of language understanding and generation (Fodor
and Pylyshyn,1988). The prevailing paradigm
in today’s NLP is end-to-end learning, in which
the learning of compositional task structure is sub-
sumed by the learning of a complex target task,
with the support of increasingly powerful language
models (LMs) (Devlin et al.,2019;Raffel et al.,
2020;Brown et al.,2020).
Recent work in compositionality in NLP has
been mostly limited to semantic parsing and multi-
hop reasoning for the purpose of Q&A (Shaw et al.,
Figure 1: Component tasks involved in a recommen-
dation prompt (above) and in sports understanding (be-
low). In compositional fine-tuning (CFT), component
tasks shaded in light blue precede those in light purple.
2021;Wolfson et al.,2020;Min et al.,2019). How-
ever, a series of recent works have proposed gen-
erating “chains of thought” as a means to expand
an LM’s ability to reason beyond a single forward
pass (Wei et al.,2022;Zelikman et al.,2022;Nye
et al.,2021). The success of chain of thought ap-
proaches suggests broader opportunities to study
the use of compositional structure as a means to
improve the learning of complex tasks, rather than
as a byproduct of end-to-end learning.
Breaking down a complex task into sub-tasks is a
ubiquitous construct in human problem-solving. In
machine learning, it has inspired curriculum learn-
ing (CL) (Bengio et al.,2009), which hypothesizes
that a model should start learning from easier con-
cepts and progress to harder ones, as humans do.
In this work, we explore the idea of CL through
the lens of incremental task complexity, which is
fundamentally different from prior works in NLP
centered on incremental example difficulty (e.g.,
organizing training data by increasing sequence
length or decreasing word frequency).
arXiv:2210.12607v1 [cs.CL] 23 Oct 2022
We propose compositional fine-tuning (CFT), a
fine-tuning strategy in which sub-tasks are orga-
nized as components of a curriculum that progres-
sively teaches a target task, as shown visually in
Figure 1. CFT is novel in two ways: it is a CL
approach in NLP that focuses on incremental task
complexity instead of incremental example diffi-
culty; and unlike chain of thought prompting, CFT
does not depend on huge, pretrained LMs—it relies
on smaller, fine-tuned LMs instead. This is advan-
tageous because the largest LMs are hard to access
and expensive, and their pretraining data, while
vast, still fail to cover a wide range of domains.
We focus on conversational recommendation,
which is especially rich in complex tasks (Bursztyn
et al.,2021). As shown in Figure 1, a relatively
short recommendation prompt may comprise com-
ponent tasks as diverse as understanding a user
preference—related to pragmatics—and finding
an item that correctly matches the semantics of
such a preference. Despite this diversity in compo-
nent tasks, recommendation tasks are still underex-
plored in the NLP community (Penha and Hauff,
2020;Malkiel et al.,2020;Wang et al.,2021).
We make the following contributions:
We contribute a new schema for generating
recommendation datasets, which we instan-
tiate in two domains: world travel and local
dining. By design, LMs are more likely to
hold prior knowledge about world cities than
about local restaurants, making our released
dataset challenging to different degrees.
We propose
compositional fine-tuning
(CFT)
: an approach based on decomposing
a target task into component tasks, and then
fine-tuning smaller LMs on a curriculum of
such component tasks. We instantiate CFT
in our recommendation tasks as well as the
sports understanding task from (Wei et al.,
2022).
We present experiments
1
showing that CFT
consistently outperforms end-to-end learning,
with up to 32% gains in the local dining do-
main given equal amounts of training data.
When compared to chain of thought prompt-
ing, we further find that CFT performs equally
or better while requiring LMs only 7.4% of
the size (as seen in Table 1).
1
Data and code fully available at:
https://github.
com/vbursztyn/compositional-fine-tuning
Base
Model Method
Score on
Decision
Templates
DaVinci 8-Shot Prompting 0.83 ± 0.08
DaVinci 8-Shot Chain of Thought 0.98 ± 0.02
Curie 8-Shot Chain of Thought 0.50 ± 0.12
Curie CFT on Factual Statements, Factual
Comparisons, and Decision Templates 0.95 ± 0.01
Base
Model Method
Score on
Decision
Templates
DaVinci 8-Shot Prompting 0.54 ± 0.09
DaVinci 8-Shot Chain of Thought 0.55 ± 0.07
Curie 8-Shot Chain of Thought 0.50 ± 0.06
Curie CFT on Factual Statements, Factual
Comparisons, and Decision Templates 0.74 ± 0.05
Table 1: Comparison to chain of thought in the
world travel domain (above) and local dining (be-
low). CFT performs as well as chain of thought
prompting for world cities and 35% better for local
restaurants, with an LM only 7.4% of the size (13B
vs 175B).
2 Related Work
2.1 Chain of Thought Approaches
Chain of thought approaches are the most recent
stream of research connected to ours (Wei et al.,
2022;Zelikman et al.,2022;Gu et al.,2021;Nye
et al.,2021;Talmor et al.,2020;Rajani et al.,2019).
Wei et al. (2022) recently proposed chain of thought
prompting, the idea that very large LMs can do
much better at “system 2 tasks”—tasks that require
deeper reasoning skills, such as math problems or
symbolic reasoning—if they are given examples in
the prompt that explicitly describe the intermediate
steps of the task. Although effective in improving
accuracy, its dependency on huge, pretrained LMs
still limits chain of thought prompting. In contrast,
our CFT approach shows similar gains vs end-to-
end learning on our tasks, but in a setting with LMs
that are more than an order of magnitude smaller.
Among these previous works, we highlight (Tal-
mor et al.,2020) as an attempt to study the effect
of factual knowledge injection in LM performance
on tasks that involve chaining different facts. In
our ablation studies in §5, we cover a configuration
that is analogous to theirs and show improvements
from having an additional component task.
2.2 Compositionality in Question Answering
Many recent works in the Q&A literature have
strived to study compositionality on either a ques-
tion or system level. At the question level, learning
to decompose a question into smaller questions
and reasoning over these sub-questions in order to
arrive at a final answer (multi-hop reasoning) has
been a common goal (Khot et al.,2020;Min et al.,
2019;Yang et al.,2018;Khashabi et al.,2018).
At the system level, investigating a system’s abil-
ity to generalize from question types seen during
training (e.g., “Who directed x?”) to new, unseen
instances of the same type (e.g., “Who directed In-
ception?”) has attracted increasing attention (Key-
sers et al.,2019). Further works have explored
both problems—multi-hop reasoning and composi-
tional generalization—through the lens of semantic
parsing (Wolfson et al.,2020;Shaw et al.,2021).
In contrast, we focus on a new schema of recom-
mendation tasks, where by design the decomposi-
tion required to perform the task is not transparent
from the question itself but is known a priori across
a variety of domains. This schema allows us to eval-
uate the effectiveness of a novel CFT approach in
two domains, and to compare it against the recent
chain of thought prompting approach.
2.3 Curriculum Learning (CL)
The seminal work in CL (Bengio et al.,2009) in-
cluded a language modeling experiment in which
training data were ordered from most to least fre-
quent based on corpus statistics. Since then, many
works in NLP have explored different measures of
example difficulty, as simple as sequence length
for NLG (Rajeswar et al.,2017) and as complex
as estimates based on model performance (Sachan
and Xing,2016;Xu et al.,2020). However, such
a focus on example difficulty has kept these works
distant from the “shaping hypothesis” that inspired
(Bengio et al.,2009): the idea that a complex task
can be taught by breaking it into a sequence of
smaller steps of incremental complexity (Krueger
and Dayan,2009). In this work, instead of incre-
mental example difficulty, we explore a different
approach to incremental complexity based on orga-
nizing training data around component tasks.
To the best of our knowledge, the closest works
can be found in the domain of spatial navigation
instructions (Dan et al.,2021;Lake and Baroni,
2018), in which an LM starts with simple block-
moving instructions and progresses to composi-
tional ones. However, our work differs in the diver-
sity of our component tasks, in the more extensive
experimentation that ensues, and in the applicabil-
ity of CFT to other similarly diverse domains.
3 Problem Definition
The recommendation task depicted in Figure 1
takes as input a set of items (set
I
) and a set of user
preferences (set
P
), such that
Recommend(P, I)
outputs the item that best matches the user prefer-
ences. In its simplest form, we have a pair of items
I={i1
,
i2
} and a single preference
P={p}
, such
that
Recommend({p},{i1, i2})
. This form maps
naturally to what we call a
“decision template,
composed of two sentences: one with a prefer-
ence (e.g., “You don’t like cold weather.”), and
another with a sufficiently different pair of items
(e.g., “Between London and Lisbon, you should
visit”
Lisbon). We use the term “decision” be-
cause
Recommend(P, I)
can be considered an in-
stance of a decision task where
I
represents options
and Pexpresses the criteria to be applied.
Breaking down
Recommend({p},{i1, i2})
into component tasks, the first task consists of
comparing two items along a given attribute. This
can be defined as
Compare(a, o, {i1, i2})
that
takes as input an attribute
a
(e.g., temperature),
an order
o
(e.g., higher), and the two items, and
then outputs the item that satisfies the comparison.
We call this task a
“factual comparison”
(e.g.,
“Between London and Lisbon, the city with
warmer weather is”
Lisbon), which is further
decomposed into
“factual statements”
that
simply enunciate the attribute value of an item
(e.g., “The average temperature in Lisbon is”
17.5C).
With that, a domain
D
can be formalized as
D= (Ifull, A)
where
Ifull
is the full set of
items and
A
the set of attributes. Considering
the world travel domain, for example,
Ifull
may
represent a list of well-known cities and
A=
{temperature, population}
the average tempera-
ture and total population, respectively. We instanti-
ate this schema in our experiments in §5, but it can
be used to generate new recommendation datasets
or repurposed for other decision tasks.
3.1 A Challenging Task for Pretrained LMs
Even state-of-the-art LMs such as GPT-3 (Brown
et al.,2020) struggle at this recommendation task,
as evidenced by experiments fully described in §5.
As shown in Table 1, 175B parameter DaVinci in 8-
shot mode can accurately recommend 83% of test
cases in the world travel domain, but only 55% in
the local dining domain, which cannot be improved
with chain of thought prompting. As shown in
摘要:

LearningtoPerformComplexTasksthroughCompositionalFine-TuningofLanguageModelsVictorS.Bursztyn1,DavidDemeter1,DougDowney1,2,andLarryBirnbaum11DepartmentofComputerScience,NorthwesternUniversity,Evanston,IL,USA2AllenInstituteforArticialIntelligence,Seattle,WA,USA{v-bursztyn,ddemeter}@u.northwestern.edu...

展开>> 收起<<
Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models Victor S. Bursztyn1David Demeter1Doug Downey12 and Larry Birnbaum1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.07MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注