Boosting Natural Language Generation from Instructions with Meta-Learning Budhaditya Deb Guoqing Zheng Ahmed Hassan Awadallah

2025-05-06 0 0 5.17MB 17 页 10玖币
侵权投诉
Boosting Natural Language Generation from Instructions
with Meta-Learning
Budhaditya Deb, Guoqing Zheng, Ahmed Hassan Awadallah
Microsoft Research
{budeb, zheng, hassanam}@microsoft.com
Abstract
Recent work has shown that language mod-
els (LMs) trained with multi-task instruc-
tional learning (MTIL) can solve diverse NLP
tasks in zero- and few-shot settings with im-
proved performance compared to prompt tun-
ing. MTIL illustrates that LMs can extract and
use information about the task from instruc-
tions beyond the surface patterns of the inputs
and outputs. This suggests that meta-learning
may further enhance the utilization of instruc-
tions for effective task transfer. In this paper
we investigate whether meta-learning applied
to MTIL can further improve generalization
to unseen tasks in a zero-shot setting. Specif-
ically, we propose to adapt meta-learning to
MTIL in three directions: 1) Model Agnostic
Meta Learning (MAML), 2) Hyper-Network
(HNet) based adaptation to generate task spe-
cific parameters conditioned on instructions,
and 3) an approach combining HNet and
MAML. Through extensive experiments on
the large scale Natural Instructions V2 dataset,
we show that our proposed approaches signif-
icantly improve over strong baselines in zero-
shot settings. In particular, meta-learning im-
proves the effectiveness of instructions and is
most impactful when the test tasks are strictly
zero-shot (i.e. no similar tasks in the training
set) and are "hard" for LMs, illustrating the po-
tential of meta-learning for MTIL for out-of-
distribution tasks.
1 Introduction
Given some basic instructions and a few demon-
strations, humans are capable of conducting di-
verse tasks without any supervision. Can language
models perform similarly on unseen tasks when
trained with instructions? Specifically can such an
approach work on complex generation tasks with
relatively smaller language models (LMs)?
The recent advances in large LMs have shown
tremendous potential in diverse AI applications
and have the capability to change the way model
developers and users interact with intelligent sys-
tems. The inherent representative power of such
models has shown that diverse NLP tasks can be
solved purely by appending prompts or demonstra-
tions in context before a test input (Radford et al.,
2019;Brown et al.,2020). This has led to the rise
of prompt-based training (Liu et al.,2021) where
even much smaller models trained on a large set
of tasks in a multi-task setting with prompts, can
behave similarly (Schick and Schütze,2021).
A natural extension of the prompt tuning concept
involves adding instructions about the task along
with the demonstrations. Instructions are more in-
formative than prompts and aid the language mod-
els to solve unseen tasks better. Instructions can
have different forms, for example to convey a short
task specific statement (e.g. "Provide a short sum-
mary for the following input") (Schick and Schütze,
2021b), or a natural language question ("How
would you rephrase that in a few words?") (Sanh
et al.,2022;Wei et al.,2022;Bach et al.,2022).
However, for complex generation tasks, short in-
structions can be ambiguous and non-informative
and thus need large LMs which can encode a much
richer prior knowledge.
In contrast, (Wang et al.,2022) define instruc-
tions in the Natural Instructions V2 (NIV2) dataset
comprising of detailed task descriptions, positive
and negative examples, and explanations. Instruc-
tions in NIV2 are similar to annotation guide-
lines, and thus potentially more beneficial
1
. Us-
ing multi-task instructional learning (MTIL) on
diverse tasks, (Wang et al.,2022) showed that even
smaller models can be competitive with larger mod-
els on zero-shot generalization to unseen tasks.
Results in (Wang et al.,2022) illustrated that
LMs can extract useful information from instruc-
tions beyond the surface patterns available in the
prompts for solving a task. This suggests that
1
The instructions in NIV2 are in-fact taken from annotation
guidelines for each of the tasks
arXiv:2210.11617v1 [cs.CL] 20 Oct 2022
learning-to-learn or meta-learning paradigm can
further enhance the utilization of instructions by
learning about task at deeper levels. In this pa-
per, we investigate how smaller LMs could best
benefit from the natural instructions and whether
meta-learning paradigms can further improve the
zero-shot generalization ability of LMs in MTIL.
Meta-learning has been shown to be effective in
adapting knowledge with little supervision but to
the best of our knowledge has not been adapted to
MTIL in zero-shot settings.
Specifically, we explore two different meta-
learning approaches. First we propose to adapt
Model Agnostic Meta Learning (MAML) (Finn
et al.,2017) for MTIL, an optimization based
approach. Second, we explore hyper-network
(HNet) (Ha et al.,2017) based MTIL, a black-box
approach. HNet introduces an auxiliary LM which
encodes instructions to produce task specific param-
eters which are added to the main LM parameters
to generate a task specific LM at prediction time.
In addition, we evaluate a third approach which
combines the two into a HNet-MAML by training
the HNet model using MAML.
We conduct extensive experiments specifically
designed to test the generalization ability of LMs
trained with instructions under different zero shot
conditions. We use two sets of training tasks from
the NIV2 dataset: 1) all natural language tasks and
2) natural language generation tasks. We evalu-
ate the models for two sets of held out
generation
tasks
conveying different levels of zero-shot gener-
alization ability: 1)
weak generalization
set with
a random selection of generation tasks with poten-
tial overlap of categories with training tasks and 2)
strong generalization
set (or strict zero-shot con-
ditions) using summarization and title generation
tasks with no overlap in categories from the train-
ing tasks. We further investigate the task sets under
difficulty levels of easy,medium, and hard based
on their baseline ROUGE scores.
The main conclusion from our study is that under
strict zero-shot conditions, meta-learning with in-
structions significantly improves the performance.
The improvements become more significant for the
strong generalization task set and when the task
difficulty level is hard (i.e. tasks where the LM
struggles to generate correct outputs in zero-shot
setting). Moreover, meta-learning increases the
effectiveness of instructions under all conditions.
While both MAML and HNet models show im-
provements over the baselines, HNet (along with its
MAML extension) by explicitly enforcing the use
of instructions through task specific conditioning of
parameters, results in larger gains. In summary, the
main contributions of the paper are two-fold. First,
we adapt meta-learning approaches to MTIL. Sec-
ond, we study their efficacy and show significant
improvements under strict zero-shot conditions.
2 Related Work
Learning from instructions: An extension of the
basic prompt-based in-context learning is append-
ing task specific instructions with prompts. Sev-
eral recent works which include FLAN (Wei et al.,
2022), T0 (Sanh et al.,2022) and (Reif et al.,2021),
train a large LM in a multi-task setting with instruc-
tions. InstructGPT (Ouyang et al.,2022) takes
slightly different approach by training the GPT3
model (Brown et al.,2020) with human anno-
tated dataset of demonstrations of desired user in-
tents and use reinforcement learning to improve
the model to follow such instructions. Yet an-
other direction called pattern-exploiting training
(PET) (Schick and Schütze,2021a;Schick and
Schütze,2021) combines the idea of formulating
instructions as cloze questions and show that even
small LMs can be good few-shot learners and work
with language generation.
Meta-learning for language generation
: Meta
learning has been applied in several language gen-
eration settings such as (Lin and Lee,2020) to
induce persona in a chatbot, (Mi et al.,2019) for
task oriented dialog systems, (Gu et al.,2018) for
low resource machine translation, and (Chen and
Shuai,2021) for abstractive summarization in a
low-resource transfer learning but do not use in-
structions for zero-shot transfer. Our MTIL sce-
nario is closely related to MetaICL (Min et al.,
2022) which applies multi-task learning in-context
in a K-shot setting for classification tasks, but dif-
fers in that it is a k-shot in-context scenario and
does not use instructions or meta-learning optimiza-
tion. While these works are related, to the best of
our knowledge, meta-learning has not been used
to generalize to unseen generation tasks in zero
shot settings using instructions and thus the paper
provides several novel insights and approaches.
Hyper-Networks (HNet) in NLP applica-
tions
: (Karimi Mahabadi et al.,2021) use HNet
to train LMs in a multi-task setting with adapters
and (von Oswald et al.,2020) propose a contin-
ual learning framework with HNets conditioned on
unique task IDs to reduce catastrophic forgetting.
HNets have been used for input conditioning a de-
coder in (Ivison and Peters,2022) which produces
a unique decoder for each input, and thus is similar
to our approach. However these the approaches
are not strictly applicable in our zero-shot scenario
or in general NLP tasks with task descriptions in
natural language.
Language model editing
: Our HNet based ap-
proach is based on the architecture in (Cao et al.,
2021) which uses it to edit factual knowledge in
LMs. While the architecture is similar, we use the
HNet to encode task specific instructions and is
intended for controlling task-level LM behavior
unlike the micro-behavior targeted in (Cao et al.,
2021). Similar to ours and (Cao et al.,2021),
Bayesian hyper networks (Krueger et al.,2018)
linearizes the number of parameters for predictions
by constraining the HNet outputs to scale and shift
parameters. (Sinitsin et al.,2020;Mitchell et al.,
2022) propose Meta Learning approaches for edit-
ing errors in a neural network but is not directly
applicable for MTIL in a zero-shot setting.
MTIL
: Finally, the work most closely related
to this paper is the Tk-Instruct model from (Wang
et al.,2022) which fine tunes a T5 model (Raffel
et al.,2020) with instructions, which we use as the
baseline. We use the same dataset and training set-
tings as Tk-Instruct but instead use the pretrained
BART model (Lewis et al.,2020) as it is task ag-
nostic compared to T5 (T5 may not represent a
true zero-shot setting). In addition, we enhance
this model with meta-learning and consider signif-
icantly different training, evaluation, and model
settings to test zero-shot generalization resulting in
unique contributions and conclusions orthogonal
to the findings in (Wang et al.,2022).
3 Problem Setup
In this section we briefly outline our problem set-
tings and baselines used in this paper.
3.1 Natural Instructions V2 Dataset
We use the Natural Instructions V2 (NIV2)
dataset (Wang et al.,2022)
2
to investigate meta-
learning approaches for instructional learning. The
NIV2 is a meta-dataset with over 1600 tasks.
In NIV2, each task contains instructions and mul-
tiple training instances with input and output. The
2https://instructions.apps.allenai.org/
instructions consist of: 1)
Categories
(classifica-
tion, summarization etc.), 2)
Short description
(a
short sentence about the task), 3)
Long descrip-
tion
(a detailed description of the task similar to
annotation guidelines), 4)
Positive examples
(in-
puts with correct outputs), 5)
Negative examples
(inputs with incorrect outputs), and 6)
Explana-
tions for the positive or negative examples.
(Wang et al.,2022) train a pretrained T5 lan-
guage model (Raffel et al.,2020) on input-output
pairs with instructions (Tk-Instruct) appended be-
fore the input in a multi-task setting. During testing,
held out unseen tasks are predicted by appending
similar instructions to the test input. (Wang et al.,
2022) provide detailed ablations and baseline com-
parisons with related models showing the impact of
instructions. Following the results there, we only
use the task descriptions and positive examples in
this study as negative examples and explanations
were not shown to have any positive contributions.
3.2 Baseline Model with Standard Training
Based on results in (Wang et al.,2022) where Tk-
Instruct was shown to comfortably beat much larger
T5, GPT3, InstructGPT3, and T0 models, we use
the Tk-Instruct setting as our baseline, i.e. we
train a pre-trained encoder-decoder LM on multiple
tasks with instructions. We also explored append-
ing the instructions before the decoder sequence
but did not find any improvements. However, we
did observe that by pre-pending a special prefix
to the decoder (we use "[Output]:") improves the
overall prediction performance. We refer to this
model as the standard training model.
For our base LM, we use the pretrained BART
model (Lewis et al.,2020) as it is task agnostic
compared to T5
3
and thus represents a stronger
zero-shot setting. Interested readers should refer to
the (Wang et al.,2022) paper for detailed ablations
specific to the NIV2 dataset and the T5 model.
3.3 Evaluation Settings
We focus specifically on the zero-shot generaliza-
tion on generation tasks. While the general settings
remain similar to (Wang et al.,2022) we consider
some specific settings to illustrate the generaliza-
tion capabilities of models to different tasks.
For training, we use two sets of tasks 1) All EN
tasks in the NIV2 dataset and 2) Generation tasks.
3
Publicly available T5 models are pre-trained on a multi-
task mixture of unsupervised and supervised tasks.
For evaluation, we consider two sets of generation
tasks with different zero-shot levels : 1) weak gen-
eralization set using a random set of generation
tasks with potential similarity to the training tasks
and 2) strong generalization set using tasks from
summarization and title generation categories with
no overlap with the training tasks. The list of eval-
uation tasks with short descriptions are provided in
the appendix in Figures 11 and 12.
We further divide the evaluation tasks into diffi-
culty levels of "easy", "medium" and "hard" based
on the ROUGE scores from the baseline model
(low scores indicate out-of distribution and difficult
tasks) to see to what extent meta-learning helps in
improving performance of the out-of-distribution
tasks.
4 Meta-Learning with Instructions
Training on a large number of diverse tasks and
testing on unseen tasks lend itself to the paradigm
of learning-to-learn or meta-learning, which has
been successfully applied for generalizing in both
zero- and few- shot scenarios. Task meta-data in the
form of instructions can also provide discriminative
information about the task process in addition to
the surface patterns of the input and output strings.
We investigate whether meta-learning can aid such
learning and adapt three approaches to MTIL.
4.1 Standard Training + MAML
We adapt Model Agnostic Meta Learning (MAML)
(Finn et al.,2017) to instructional learning of LMs
as a way to generalize to unseen tasks by training
on large number of diverse tasks.
The standard training with MAML is described
in Algorithm 1in the appendix. At any training
iteration, we sample two different sets of
k
tasks
for MAML meta-train and meta-test steps. We
uniformly sample across tasks to maximize the
diversity of tasks in each batch. The data format
is same as the standard training. Since we test
zero-shot conditions, we do not have any test time
optimization typically employed in MAML.
4.2 Standard Training + HNET
Both standard and MAML training do not explic-
itly enforce the use of instructions during decoding.
The model can thus minimize the loss simply by
ignoring the instruction part of the encoder by at-
tending to the input and output texts. This can lead
to sub-optimal use of the instructions.
Instructions (Iτ)
Description: ….
Positive Examples:
[Input]: …
[Output]: …
[Input]:x
[Output]:y
Encoder Decoder
+Main-LM (BART)
Encoder Decoder
HNET-LM (BART) ∆θ1τ
FFN1
FFN2
FFNn
∆θ2τ
∆θnτ
H1H2Hn
[1,2,…n]
θτ= θ0 + ∆θτ
HNET-FF
Figure 1: Encoding instructions using a Hyper-network
We propose a hyper-network (HNet) based archi-
tecture (Ha et al.,2017) to produce task specific
model parameters conditioned on the instructions.
The HNet architecture consists of an auxiliary LM
(HNet-LM) along with the Main-LM. The HNet-
LM produces task specific parameters of the main
LM by conditioning on instructions.
In particular, we adapt a specific type of HNet
architecture from (Cao et al.,2021) which predicts
the delta-parameters of the Main-LM, which are
then added to the Main-LM parameters to produce
the task specific LM. This preserves the parame-
ters in the Main-LM utilizing the shared generation
capability of the LM while specializing in task spe-
cific behavior. However, there are some specific
differences based on our requirements for instruc-
tional training, which are described next.
4.2.1 The HNet Language Model (HNet-LM)
Since the input to the HNet model is text, we use
a pretrained encoder-decoder LM (BART in this
paper) to encode the instructions
4
and use the de-
coder’s hidden states for conditioning the layer spe-
cific parameters of the Main-LM.
In (Cao et al.,2021) the last hidden state from
an LSTM is used for conditioning the parameters
of the main model. To increase the effective band-
width of the HNet while keeping the number of
parameters same, we use the last
N
hidden states
(for
N
layers of the main LM). This simple trick
allows the model to independently attend to and
condition each layer on the input instructions while
still keeping the model parameters the same.
The HNet-LM takes instruction
Iτ
for a task
τ
and sequence of decoder indexes
dn
as input and
produces
N
hidden states
hτ(n)
. The decoder in-
dex sequence we use is simply
1, ..., n
. Decoder
indexes provide different inputs to the decoder to
influence the generation of distinct parameters for
each layer of the main LM. This is not strictly re-
4In contrast (Cao et al.,2021) used an untrained LSTM.
摘要:

BoostingNaturalLanguageGenerationfromInstructionswithMeta-LearningBudhadityaDeb,GuoqingZheng,AhmedHassanAwadallahMicrosoftResearch{budeb,zheng,hassanam}@microsoft.comAbstractRecentworkhasshownthatlanguagemod-els(LMs)trainedwithmulti-taskinstruc-tionallearning(MTIL)cansolvediverseNLPtasksinzero-andfe...

展开>> 收起<<
Boosting Natural Language Generation from Instructions with Meta-Learning Budhaditya Deb Guoqing Zheng Ahmed Hassan Awadallah.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:5.17MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注