Data-Efficiency with a Single GPU: An Exploration of Transfer Methods
for Small Language Models
Alon Albalak1, Akshat Shrivastava2, Chinnadhurai Sankar2, Adithya Sagar2, Mike Ross2
1University of California, Santa Barbara 2Meta AI
alon_albalak@ucsb.edu
Abstract
Multi-task learning (MTL), instruction tuning,
and prompting have recently been shown to
improve the generalizability of large language
models to new tasks. However, the benefits
of such methods are less well-documented in
smaller language models, with some studies
finding contradictory results. In this work, we
explore and isolate the effects of (i) model
size, (ii) general purpose MTL, (iii) in-domain
MTL, (iv) instruction tuning, and (v) few-
shot fine-tuning for models with fewer than
500 million parameters. Our experiments in
the zero-shot setting demonstrate that mod-
els gain 31% relative improvement, on aver-
age, from general purpose MTL, with an ad-
ditional 37.6% relative gain from in-domain
MTL. Contradictory to prior works on large
models, we find that instruction tuning pro-
vides a modest 2% performance improvement
for small models.
1 Introduction
Many recent works have demonstrated the bene-
fits of prompting for large language models (see
Liu et al. (2022) for an extensive survey). While
prompts started as simple task identifiers (eg.
"topic" for topic classification) they have expanded
to include answer templates, examples, and instruc-
tions (Raffel et al.,2020;Albalak et al.,2022;Wei
et al.,2022a;Mishra et al.,2021;Ouyang et al.,
2022). Studies on utilizing prompts have shown
that as model sizes scale up, the generalization abil-
ities of a model increase (Brown et al.,2020;Lester
et al.,2021;Min et al.,2022). However, utilizing
models on the hundred-billion parameter scale is
not accessible for most researchers and practition-
ers. Furthermore, Wei et al. (2022b) show that
trends for large language models do not hold for
smaller language models. For this reason, it is cru-
cial that we must empirically find the trends that
occur in smaller models and cannot rely on studies
of larger models.
Interestingly, some findings on instruction tun-
ing have been contradictory across studies. For
example, Wei et al. (2022a) find that models with
fewer than 8B parameters see decreases in general-
ization when utilizing instructions, whereas Gupta
et al. (2022) find consistent gains in models with
3B and fewer parameters. To conflate these results
further though, Gupta et al. (2022) only consider
2 situations: when inputs include instructions and
answer templates, or neither.
Simultaneously with the emergence of prompt-
ing, the explicit multi-task learning (MTL)
paradigm emerged, with works such as Muppet
(Aghajanyan et al.,2021) or T0 (Sanh et al.,2022)
and their variants. Explicit MTL has been demon-
strated as a means of improving the downstream
performance of pre-trained language models in
data-constrained settings. In this work we consider
2 types of MTL:
general purpose
and
in-domain
.
Specifically, general purpose MTL consists of train-
ing across a wide variety of tasks and domains,
whereas in-domain MTL consists of training across
a variety of tasks that all occur within a domain.
One limitation of many previous works on
prompting and multi-task learning is that they focus
on language models in the billion-parameter scale.
For situations with latency and memory limitations,
small models may be the only option. In this work,
we study an example of such a domain; dialogue.
In this work we bridge the gap between previ-
ous studies by exploring the effects of a variety of
factors on the zero- and few-shot generalizability
on modestly sized language models (<500M pa-
rameters). Specifically, we run experiments to find
the effects of: (i) model size, (ii) general purpose
MTL, (iii) in-domain MTL, (iv) instruction tuning,
and (v) fine-tuning with and without instructions.
Additionally, we perform a linguistic analysis on
instruction wording and analyze variations in per-
formance across task instructions.
In this study, we find that
(1)
In-domain multi-
arXiv:2210.03871v1 [cs.CL] 8 Oct 2022