Data-Efficiency with a Single GPU An Exploration of Transfer Methods for Small Language Models Alon Albalak1 Akshat Shrivastava2 Chinnadhurai Sankar2 Adithya Sagar2 Mike Ross2

2025-04-27 0 0 324.34KB 9 页 10玖币
侵权投诉
Data-Efficiency with a Single GPU: An Exploration of Transfer Methods
for Small Language Models
Alon Albalak1, Akshat Shrivastava2, Chinnadhurai Sankar2, Adithya Sagar2, Mike Ross2
1University of California, Santa Barbara 2Meta AI
alon_albalak@ucsb.edu
Abstract
Multi-task learning (MTL), instruction tuning,
and prompting have recently been shown to
improve the generalizability of large language
models to new tasks. However, the benefits
of such methods are less well-documented in
smaller language models, with some studies
finding contradictory results. In this work, we
explore and isolate the effects of (i) model
size, (ii) general purpose MTL, (iii) in-domain
MTL, (iv) instruction tuning, and (v) few-
shot fine-tuning for models with fewer than
500 million parameters. Our experiments in
the zero-shot setting demonstrate that mod-
els gain 31% relative improvement, on aver-
age, from general purpose MTL, with an ad-
ditional 37.6% relative gain from in-domain
MTL. Contradictory to prior works on large
models, we find that instruction tuning pro-
vides a modest 2% performance improvement
for small models.
1 Introduction
Many recent works have demonstrated the bene-
fits of prompting for large language models (see
Liu et al. (2022) for an extensive survey). While
prompts started as simple task identifiers (eg.
"topic" for topic classification) they have expanded
to include answer templates, examples, and instruc-
tions (Raffel et al.,2020;Albalak et al.,2022;Wei
et al.,2022a;Mishra et al.,2021;Ouyang et al.,
2022). Studies on utilizing prompts have shown
that as model sizes scale up, the generalization abil-
ities of a model increase (Brown et al.,2020;Lester
et al.,2021;Min et al.,2022). However, utilizing
models on the hundred-billion parameter scale is
not accessible for most researchers and practition-
ers. Furthermore, Wei et al. (2022b) show that
trends for large language models do not hold for
smaller language models. For this reason, it is cru-
cial that we must empirically find the trends that
occur in smaller models and cannot rely on studies
of larger models.
Interestingly, some findings on instruction tun-
ing have been contradictory across studies. For
example, Wei et al. (2022a) find that models with
fewer than 8B parameters see decreases in general-
ization when utilizing instructions, whereas Gupta
et al. (2022) find consistent gains in models with
3B and fewer parameters. To conflate these results
further though, Gupta et al. (2022) only consider
2 situations: when inputs include instructions and
answer templates, or neither.
Simultaneously with the emergence of prompt-
ing, the explicit multi-task learning (MTL)
paradigm emerged, with works such as Muppet
(Aghajanyan et al.,2021) or T0 (Sanh et al.,2022)
and their variants. Explicit MTL has been demon-
strated as a means of improving the downstream
performance of pre-trained language models in
data-constrained settings. In this work we consider
2 types of MTL:
general purpose
and
in-domain
.
Specifically, general purpose MTL consists of train-
ing across a wide variety of tasks and domains,
whereas in-domain MTL consists of training across
a variety of tasks that all occur within a domain.
One limitation of many previous works on
prompting and multi-task learning is that they focus
on language models in the billion-parameter scale.
For situations with latency and memory limitations,
small models may be the only option. In this work,
we study an example of such a domain; dialogue.
In this work we bridge the gap between previ-
ous studies by exploring the effects of a variety of
factors on the zero- and few-shot generalizability
on modestly sized language models (<500M pa-
rameters). Specifically, we run experiments to find
the effects of: (i) model size, (ii) general purpose
MTL, (iii) in-domain MTL, (iv) instruction tuning,
and (v) fine-tuning with and without instructions.
Additionally, we perform a linguistic analysis on
instruction wording and analyze variations in per-
formance across task instructions.
In this study, we find that
(1)
In-domain multi-
arXiv:2210.03871v1 [cs.CL] 8 Oct 2022
task learning (MTL) gives the largest improve-
ments to generalizability, up to 80% increased rel-
ative performance, and 37.6% on average across
all models
(2)
Increasing model size alone has lit-
tle effect on generalization, but when combined
with in-domain MTL leads to double the (already
strong) performance improvement of in-domain
MTL
(3)
General purpose MTL can provide large
gains (57% improvement) for downstream tasks
which closely resemble the MTL tasks, but still
provides modest gains (5%) even for tasks which
are more dissimilar
(4)
Instruction tuning during
in-domain MTL provides modest gains of just over
2% performance, regardless of model size.
2 Experiments
Data
For this study we utilize 46 tasks from the
Instructdial dataset (Gupta et al.,2022). Each task
is converted into a sequence-to-sequence format
with an answer template, allowing a single genera-
tive model to perform all tasks. Each task contains
3 to 10 instructions, with 4.4 instructions on av-
erage. For our zero-shot experiments, we use 3
splits of train/test tasks, where each split contains
40 training tasks and 6 test tasks. For our few-
shot experiments, we use the first data split only.
Tasks are divided into classification and generation,
where classification tasks are evaluated on accuracy
and generation tasks by Rouge-L scores. The full
list of tasks and information on train/test splits can
be found in the section Aof the Appendix.
Models
In our experiments, we utilize 3 vari-
ants of BART model (Lewis et al.,2020): BART-
Base, BART-Large and BART0++ (Lin et al.,2022).
BART0++ is a BART-Large that has been multi-
task trained on PromptSource (Bach et al.,2022)
in the same fashion as T0++ (Sanh et al.,2022).1
Experimental Setup
To study the effects of (i)
model size, (ii) general purpose MTL, (iii) in-
domain MTL, and (iv) instruction tuning, we run
a series of zero-shot experiments. In order to mea-
sure the effect of (i) model size, we compare per-
formance between BART-Base (139M parameters)
and BART-Large (406M parameters). To mea-
sure the effect of (ii) general purpose MTL, we
compare performance between BART-Large and
BART0++. To study the effect of (iii) in-domain
MTL, we compare each model trained with in-
1
All pre-trained models were downloaded from the Hug-
gingFace Transformers library.
Figure 1: Average performance on 10 zero-shot clas-
sification tasks (top) and 8 zero-shot generation tasks
(bottom) comparing pre-trained models (Off the shelf)
with models explicitly multi-task trained on in-domain
data with and without instructions.
domain MTL against an off-the-shelf version that
is directly tested on each split. To measure the ef-
fect of (iv) in-domain MTL with instructions, we
include instructions in addition to the answer tem-
plate in the input sequences. All experiments were
repeated with 3 random seeds and reported scores
are means.2
In addition to zero-shot experiments, we also
consider the situation where we have small quan-
tities of data for fine-tuning. We run experiments
with 10/10, 50/50, and 100/100 training/validation
samples. For the few-shot experiments, we study
the effects of (i) model size, (ii) general-purpose
MTL, (iv) instruction tuning, and (v) fine-tuning
with or without instructions.
3 Findings
Figure 1shows the average zero-shot performance
divided into classification and generation tasks.
Figure 2shows the absolute scores for all models
and methods on each of the 18 test tasks.
Effects of Model Size
When comparing off-the-
shelf versions of BART-Base and BART-Large, we
find nearly identical performance across classifi-
cation tasks, and slightly better performance for
BART-Base (11.2 vs. 10.2 Rouge-L) on genera-
tion tasks. However, the benefits of model size are
demonstrated once the models have been further
trained using in-domain MTL (Figure 1). We find
that with in-domain MTL the base model improves
2
Further training details on hyperparameters are in section
Cof the Appendix
摘要:

Data-EfciencywithaSingleGPU:AnExplorationofTransferMethodsforSmallLanguageModelsAlonAlbalak1,AkshatShrivastava2,ChinnadhuraiSankar2,AdithyaSagar2,MikeRoss21UniversityofCalifornia,SantaBarbara2MetaAIalon_albalak@ucsb.eduAbstractMulti-tasklearning(MTL),instructiontuning,andpromptinghaverecentlybeensh...

展开>> 收起<<
Data-Efficiency with a Single GPU An Exploration of Transfer Methods for Small Language Models Alon Albalak1 Akshat Shrivastava2 Chinnadhurai Sankar2 Adithya Sagar2 Mike Ross2.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:324.34KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注