Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models Victor S. Bursztyn1David Demeter1Doug Downey12 and Larry Birnbaum1

2025-05-02 0 0 1.07MB 11 页 10玖币

侵权投诉

Learning to Perform Complex Tasks through

Compositional Fine-Tuning of Language Models

Victor S. Bursztyn1,David Demeter1,Doug Downey1,2, and Larry Birnbaum1

1Department of Computer Science, Northwestern University, Evanston, IL, USA

2Allen Institute for Artiﬁcial Intelligence, Seattle, WA, USA

{v-bursztyn,ddemeter}@u.northwestern.edu

{d-downey,l-birnbaum}@northwestern.edu

Abstract

How to usefully encode compositional task

structure has long been a core challenge in AI.

Recent work in chain of thought prompting

has shown that for very large neural language

models (LMs), explicitly demonstrating the in-

ferential steps involved in a target task may

improve performance over end-to-end learning

that focuses on the target task alone. However,

chain of thought prompting has signiﬁcant lim-

itations due to its dependency on huge pre-

trained LMs. In this work, we present compo-

sitional ﬁne-tuning (CFT): an approach based

on explicitly decomposing a target task into

component tasks, and then ﬁne-tuning smaller

LMs on a curriculum of such component tasks.

We apply CFT to recommendation tasks in

two domains, world travel and local dining,

as well as a previously studied inferential task

(sports understanding). We show that CFT out-

performs end-to-end learning even with equal

amounts of data, and gets consistently bet-

ter as more component tasks are modeled via

ﬁne-tuning. Compared with chain of thought

prompting, CFT performs at least as well us-

ing LMs only 7.4% of the size, and is more-

over applicable to task domains for which data

are not available during pretraining.

1 Introduction

Philosophy, linguistics, and computer science have

long debated how and whether to explicitly encode

the compositionality of task structure in models

of language understanding and generation (Fodor

and Pylyshyn,1988). The prevailing paradigm

in today’s NLP is end-to-end learning, in which

the learning of compositional task structure is sub-

sumed by the learning of a complex target task,

with the support of increasingly powerful language

models (LMs) (Devlin et al.,2019;Raffel et al.,

2020;Brown et al.,2020).

Recent work in compositionality in NLP has

been mostly limited to semantic parsing and multi-

hop reasoning for the purpose of Q&A (Shaw et al.,

Figure 1: Component tasks involved in a recommen-

dation prompt (above) and in sports understanding (be-

low). In compositional ﬁne-tuning (CFT), component

tasks shaded in light blue precede those in light purple.

2021;Wolfson et al.,2020;Min et al.,2019). How-

ever, a series of recent works have proposed gen-

erating “chains of thought” as a means to expand

an LM’s ability to reason beyond a single forward

pass (Wei et al.,2022;Zelikman et al.,2022;Nye

et al.,2021). The success of chain of thought ap-

proaches suggests broader opportunities to study

the use of compositional structure as a means to

improve the learning of complex tasks, rather than

as a byproduct of end-to-end learning.

Breaking down a complex task into sub-tasks is a

ubiquitous construct in human problem-solving. In

machine learning, it has inspired curriculum learn-

ing (CL) (Bengio et al.,2009), which hypothesizes

that a model should start learning from easier con-

cepts and progress to harder ones, as humans do.

In this work, we explore the idea of CL through

the lens of incremental task complexity, which is

fundamentally different from prior works in NLP

centered on incremental example difﬁculty (e.g.,

organizing training data by increasing sequence

length or decreasing word frequency).

arXiv:2210.12607v1 [cs.CL] 23 Oct 2022

We propose compositional ﬁne-tuning (CFT), a

ﬁne-tuning strategy in which sub-tasks are orga-

nized as components of a curriculum that progres-

sively teaches a target task, as shown visually in

Figure 1. CFT is novel in two ways: it is a CL

approach in NLP that focuses on incremental task

complexity instead of incremental example difﬁ-

culty; and unlike chain of thought prompting, CFT

does not depend on huge, pretrained LMs—it relies

on smaller, ﬁne-tuned LMs instead. This is advan-

tageous because the largest LMs are hard to access

and expensive, and their pretraining data, while

vast, still fail to cover a wide range of domains.

We focus on conversational recommendation,

which is especially rich in complex tasks (Bursztyn

et al.,2021). As shown in Figure 1, a relatively

short recommendation prompt may comprise com-

ponent tasks as diverse as understanding a user

preference—related to pragmatics—and ﬁnding

an item that correctly matches the semantics of

such a preference. Despite this diversity in compo-

nent tasks, recommendation tasks are still underex-

plored in the NLP community (Penha and Hauff,

2020;Malkiel et al.,2020;Wang et al.,2021).

We make the following contributions:

•

We contribute a new schema for generating

recommendation datasets, which we instan-

tiate in two domains: world travel and local

dining. By design, LMs are more likely to

hold prior knowledge about world cities than

about local restaurants, making our released

dataset challenging to different degrees.

•

We propose

compositional ﬁne-tuning

(CFT)

: an approach based on decomposing

a target task into component tasks, and then

ﬁne-tuning smaller LMs on a curriculum of

such component tasks. We instantiate CFT

in our recommendation tasks as well as the

sports understanding task from (Wei et al.,

2022).

•

We present experiments

showing that CFT

consistently outperforms end-to-end learning,

with up to 32% gains in the local dining do-

main given equal amounts of training data.

When compared to chain of thought prompt-

ing, we further ﬁnd that CFT performs equally

or better while requiring LMs only 7.4% of

the size (as seen in Table 1).

Data and code fully available at:

https://github.

com/vbursztyn/compositional-fine-tuning

Base

Model Method

Score on

Decision

Templates

DaVinci 8-Shot Prompting 0.83 ± 0.08

DaVinci 8-Shot Chain of Thought 0.98 ± 0.02

Curie 8-Shot Chain of Thought 0.50 ± 0.12

Curie CFT on Factual Statements, Factual

Comparisons, and Decision Templates 0.95 ± 0.01

Base

Model Method

Score on

Decision

Templates

DaVinci 8-Shot Prompting 0.54 ± 0.09

DaVinci 8-Shot Chain of Thought 0.55 ± 0.07

Curie 8-Shot Chain of Thought 0.50 ± 0.06

Curie CFT on Factual Statements, Factual

Comparisons, and Decision Templates 0.74 ± 0.05

Table 1: Comparison to chain of thought in the

world travel domain (above) and local dining (be-

low). CFT performs as well as chain of thought

prompting for world cities and 35% better for local

restaurants, with an LM only 7.4% of the size (13B

vs 175B).

2 Related Work

2.1 Chain of Thought Approaches

Chain of thought approaches are the most recent

stream of research connected to ours (Wei et al.,

2022;Zelikman et al.,2022;Gu et al.,2021;Nye

et al.,2021;Talmor et al.,2020;Rajani et al.,2019).

Wei et al. (2022) recently proposed chain of thought

prompting, the idea that very large LMs can do

much better at “system 2 tasks”—tasks that require

deeper reasoning skills, such as math problems or

symbolic reasoning—if they are given examples in

the prompt that explicitly describe the intermediate

steps of the task. Although effective in improving

accuracy, its dependency on huge, pretrained LMs

still limits chain of thought prompting. In contrast,

our CFT approach shows similar gains vs end-to-

end learning on our tasks, but in a setting with LMs

that are more than an order of magnitude smaller.

Among these previous works, we highlight (Tal-

mor et al.,2020) as an attempt to study the effect

of factual knowledge injection in LM performance

on tasks that involve chaining different facts. In

our ablation studies in §5, we cover a conﬁguration

that is analogous to theirs and show improvements

from having an additional component task.

2.2 Compositionality in Question Answering

Many recent works in the Q&A literature have

strived to study compositionality on either a ques-

tion or system level. At the question level, learning

to decompose a question into smaller questions

and reasoning over these sub-questions in order to

arrive at a ﬁnal answer (multi-hop reasoning) has

been a common goal (Khot et al.,2020;Min et al.,

2019;Yang et al.,2018;Khashabi et al.,2018).

At the system level, investigating a system’s abil-

ity to generalize from question types seen during

training (e.g., “Who directed x?”) to new, unseen

instances of the same type (e.g., “Who directed In-

ception?”) has attracted increasing attention (Key-

sers et al.,2019). Further works have explored

both problems—multi-hop reasoning and composi-

tional generalization—through the lens of semantic

parsing (Wolfson et al.,2020;Shaw et al.,2021).

In contrast, we focus on a new schema of recom-

mendation tasks, where by design the decomposi-

tion required to perform the task is not transparent

from the question itself but is known a priori across

a variety of domains. This schema allows us to eval-

uate the effectiveness of a novel CFT approach in

two domains, and to compare it against the recent

chain of thought prompting approach.

2.3 Curriculum Learning (CL)

The seminal work in CL (Bengio et al.,2009) in-

cluded a language modeling experiment in which

training data were ordered from most to least fre-

quent based on corpus statistics. Since then, many

works in NLP have explored different measures of

example difﬁculty, as simple as sequence length

for NLG (Rajeswar et al.,2017) and as complex

as estimates based on model performance (Sachan

and Xing,2016;Xu et al.,2020). However, such

a focus on example difﬁculty has kept these works

distant from the “shaping hypothesis” that inspired

(Bengio et al.,2009): the idea that a complex task

can be taught by breaking it into a sequence of

smaller steps of incremental complexity (Krueger

and Dayan,2009). In this work, instead of incre-

mental example difﬁculty, we explore a different

approach to incremental complexity based on orga-

nizing training data around component tasks.

To the best of our knowledge, the closest works

can be found in the domain of spatial navigation

instructions (Dan et al.,2021;Lake and Baroni,

2018), in which an LM starts with simple block-

moving instructions and progresses to composi-

tional ones. However, our work differs in the diver-

sity of our component tasks, in the more extensive

experimentation that ensues, and in the applicabil-

ity of CFT to other similarly diverse domains.

3 Problem Deﬁnition

The recommendation task depicted in Figure 1

takes as input a set of items (set

) and a set of user

preferences (set

), such that

Recommend(P, I)

outputs the item that best matches the user prefer-

ences. In its simplest form, we have a pair of items

I={i1

} and a single preference

P={p}

, such

that

Recommend({p},{i1, i2})

. This form maps

naturally to what we call a

“decision template,”

composed of two sentences: one with a prefer-

ence (e.g., “You don’t like cold weather.”), and

another with a sufﬁciently different pair of items

(e.g., “Between London and Lisbon, you should

visit”

→

Lisbon). We use the term “decision” be-

cause

Recommend(P, I)

can be considered an in-

stance of a decision task where

represents options

and Pexpresses the criteria to be applied.

Breaking down

Recommend({p},{i1, i2})

into component tasks, the ﬁrst task consists of

comparing two items along a given attribute. This

can be deﬁned as

Compare(a, o, {i1, i2})

that

takes as input an attribute

(e.g., temperature),

an order

(e.g., higher), and the two items, and

then outputs the item that satisﬁes the comparison.

We call this task a

“factual comparison”

(e.g.,

“Between London and Lisbon, the city with

warmer weather is”

→

Lisbon), which is further

decomposed into

“factual statements”

that

simply enunciate the attribute value of an item

(e.g., “The average temperature in Lisbon is”

→

17.5C).

With that, a domain

can be formalized as

D= (Ifull, A)

where

Ifull

is the full set of

items and

the set of attributes. Considering

the world travel domain, for example,

Ifull

may

represent a list of well-known cities and

{temperature, population}

the average tempera-

ture and total population, respectively. We instanti-

ate this schema in our experiments in §5, but it can

be used to generate new recommendation datasets

or repurposed for other decision tasks.

3.1 A Challenging Task for Pretrained LMs

Even state-of-the-art LMs such as GPT-3 (Brown

et al.,2020) struggle at this recommendation task,

as evidenced by experiments fully described in §5.

As shown in Table 1, 175B parameter DaVinci in 8-

shot mode can accurately recommend 83% of test

cases in the world travel domain, but only 55% in

the local dining domain, which cannot be improved

with chain of thought prompting. As shown in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningtoPerformComplexTasksthroughCompositionalFine-TuningofLanguageModelsVictorS.Bursztyn1,DavidDemeter1,DougDowney1,2,andLarryBirnbaum11DepartmentofComputerScience,NorthwesternUniversity,Evanston,IL,USA2AllenInstituteforArticialIntelligence,Seattle,WA,USA{v-bursztyn,ddemeter}@u.northwestern.edu...

展开>> 收起<<

Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models Victor S. Bursztyn1David Demeter1Doug Downey12 and Larry Birnbaum1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models Victor S. Bursztyn1David Demeter1Doug Downey12 and Larry Birnbaum1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: