Data-Efﬁciency with a Single GPU An Exploration of Transfer Methods for Small Language Models Alon Albalak1 Akshat Shrivastava2 Chinnadhurai Sankar2 Adithya Sagar2 Mike Ross2

2025-04-27 1 0 324.34KB 9 页 10玖币

侵权投诉

Data-Efﬁciency with a Single GPU: An Exploration of Transfer Methods

for Small Language Models

Alon Albalak1, Akshat Shrivastava2, Chinnadhurai Sankar2, Adithya Sagar2, Mike Ross2

1University of California, Santa Barbara 2Meta AI

alon_albalak@ucsb.edu

Abstract

Multi-task learning (MTL), instruction tuning,

and prompting have recently been shown to

improve the generalizability of large language

models to new tasks. However, the beneﬁts

of such methods are less well-documented in

smaller language models, with some studies

ﬁnding contradictory results. In this work, we

explore and isolate the effects of (i) model

size, (ii) general purpose MTL, (iii) in-domain

MTL, (iv) instruction tuning, and (v) few-

shot ﬁne-tuning for models with fewer than

500 million parameters. Our experiments in

the zero-shot setting demonstrate that mod-

els gain 31% relative improvement, on aver-

age, from general purpose MTL, with an ad-

ditional 37.6% relative gain from in-domain

MTL. Contradictory to prior works on large

models, we ﬁnd that instruction tuning pro-

vides a modest 2% performance improvement

for small models.

1 Introduction

Many recent works have demonstrated the bene-

ﬁts of prompting for large language models (see

Liu et al. (2022) for an extensive survey). While

prompts started as simple task identiﬁers (eg.

"topic" for topic classiﬁcation) they have expanded

to include answer templates, examples, and instruc-

tions (Raffel et al.,2020;Albalak et al.,2022;Wei

et al.,2022a;Mishra et al.,2021;Ouyang et al.,

2022). Studies on utilizing prompts have shown

that as model sizes scale up, the generalization abil-

ities of a model increase (Brown et al.,2020;Lester

et al.,2021;Min et al.,2022). However, utilizing

models on the hundred-billion parameter scale is

not accessible for most researchers and practition-

ers. Furthermore, Wei et al. (2022b) show that

trends for large language models do not hold for

smaller language models. For this reason, it is cru-

cial that we must empirically ﬁnd the trends that

occur in smaller models and cannot rely on studies

of larger models.

Interestingly, some ﬁndings on instruction tun-

ing have been contradictory across studies. For

example, Wei et al. (2022a) ﬁnd that models with

fewer than 8B parameters see decreases in general-

ization when utilizing instructions, whereas Gupta

et al. (2022) ﬁnd consistent gains in models with

3B and fewer parameters. To conﬂate these results

further though, Gupta et al. (2022) only consider

2 situations: when inputs include instructions and

answer templates, or neither.

Simultaneously with the emergence of prompt-

ing, the explicit multi-task learning (MTL)

paradigm emerged, with works such as Muppet

(Aghajanyan et al.,2021) or T0 (Sanh et al.,2022)

and their variants. Explicit MTL has been demon-

strated as a means of improving the downstream

performance of pre-trained language models in

data-constrained settings. In this work we consider

2 types of MTL:

general purpose

and

in-domain

Speciﬁcally, general purpose MTL consists of train-

ing across a wide variety of tasks and domains,

whereas in-domain MTL consists of training across

a variety of tasks that all occur within a domain.

One limitation of many previous works on

prompting and multi-task learning is that they focus

on language models in the billion-parameter scale.

For situations with latency and memory limitations,

small models may be the only option. In this work,

we study an example of such a domain; dialogue.

In this work we bridge the gap between previ-

ous studies by exploring the effects of a variety of

factors on the zero- and few-shot generalizability

on modestly sized language models (<500M pa-

rameters). Speciﬁcally, we run experiments to ﬁnd

the effects of: (i) model size, (ii) general purpose

MTL, (iii) in-domain MTL, (iv) instruction tuning,

and (v) ﬁne-tuning with and without instructions.

Additionally, we perform a linguistic analysis on

instruction wording and analyze variations in per-

formance across task instructions.

In this study, we ﬁnd that

(1)

In-domain multi-

arXiv:2210.03871v1 [cs.CL] 8 Oct 2022

task learning (MTL) gives the largest improve-

ments to generalizability, up to 80% increased rel-

ative performance, and 37.6% on average across

all models

(2)

Increasing model size alone has lit-

tle effect on generalization, but when combined

with in-domain MTL leads to double the (already

strong) performance improvement of in-domain

MTL

(3)

General purpose MTL can provide large

gains (57% improvement) for downstream tasks

which closely resemble the MTL tasks, but still

provides modest gains (5%) even for tasks which

are more dissimilar

(4)

Instruction tuning during

in-domain MTL provides modest gains of just over

2% performance, regardless of model size.

2 Experiments

Data

For this study we utilize 46 tasks from the

Instructdial dataset (Gupta et al.,2022). Each task

is converted into a sequence-to-sequence format

with an answer template, allowing a single genera-

tive model to perform all tasks. Each task contains

3 to 10 instructions, with 4.4 instructions on av-

erage. For our zero-shot experiments, we use 3

splits of train/test tasks, where each split contains

40 training tasks and 6 test tasks. For our few-

shot experiments, we use the ﬁrst data split only.

Tasks are divided into classiﬁcation and generation,

where classiﬁcation tasks are evaluated on accuracy

and generation tasks by Rouge-L scores. The full

list of tasks and information on train/test splits can

be found in the section Aof the Appendix.

Models

In our experiments, we utilize 3 vari-

ants of BART model (Lewis et al.,2020): BART-

Base, BART-Large and BART0++ (Lin et al.,2022).

BART0++ is a BART-Large that has been multi-

task trained on PromptSource (Bach et al.,2022)

in the same fashion as T0++ (Sanh et al.,2022).1

Experimental Setup

To study the effects of (i)

model size, (ii) general purpose MTL, (iii) in-

domain MTL, and (iv) instruction tuning, we run

a series of zero-shot experiments. In order to mea-

sure the effect of (i) model size, we compare per-

formance between BART-Base (139M parameters)

and BART-Large (406M parameters). To mea-

sure the effect of (ii) general purpose MTL, we

compare performance between BART-Large and

BART0++. To study the effect of (iii) in-domain

MTL, we compare each model trained with in-

All pre-trained models were downloaded from the Hug-

gingFace Transformers library.

Figure 1: Average performance on 10 zero-shot clas-

siﬁcation tasks (top) and 8 zero-shot generation tasks

(bottom) comparing pre-trained models (Off the shelf)

with models explicitly multi-task trained on in-domain

data with and without instructions.

domain MTL against an off-the-shelf version that

is directly tested on each split. To measure the ef-

fect of (iv) in-domain MTL with instructions, we

include instructions in addition to the answer tem-

plate in the input sequences. All experiments were

repeated with 3 random seeds and reported scores

are means.2

In addition to zero-shot experiments, we also

consider the situation where we have small quan-

tities of data for ﬁne-tuning. We run experiments

with 10/10, 50/50, and 100/100 training/validation

samples. For the few-shot experiments, we study

the effects of (i) model size, (ii) general-purpose

MTL, (iv) instruction tuning, and (v) ﬁne-tuning

with or without instructions.

3 Findings

Figure 1shows the average zero-shot performance

divided into classiﬁcation and generation tasks.

Figure 2shows the absolute scores for all models

and methods on each of the 18 test tasks.

Effects of Model Size

When comparing off-the-

shelf versions of BART-Base and BART-Large, we

ﬁnd nearly identical performance across classiﬁ-

cation tasks, and slightly better performance for

BART-Base (11.2 vs. 10.2 Rouge-L) on genera-

tion tasks. However, the beneﬁts of model size are

demonstrated once the models have been further

trained using in-domain MTL (Figure 1). We ﬁnd

that with in-domain MTL the base model improves

Further training details on hyperparameters are in section

Cof the Appendix

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Data-EfciencywithaSingleGPU:AnExplorationofTransferMethodsforSmallLanguageModelsAlonAlbalak1,AkshatShrivastava2,ChinnadhuraiSankar2,AdithyaSagar2,MikeRoss21UniversityofCalifornia,SantaBarbara2MetaAIalon_albalak@ucsb.eduAbstractMulti-tasklearning(MTL),instructiontuning,andpromptinghaverecentlybeensh...

展开>> 收起<<

Data-Efﬁciency with a Single GPU An Exploration of Transfer Methods for Small Language Models Alon Albalak1 Akshat Shrivastava2 Chinnadhurai Sankar2 Adithya Sagar2 Mike Ross2.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Data-Efﬁciency with a Single GPU An Exploration of Transfer Methods for Small Language Models Alon Albalak1 Akshat Shrivastava2 Chinnadhurai Sankar2 Adithya Sagar2 Mike Ross2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: