Boosting Natural Language Generation from Instructions with Meta-Learning Budhaditya Deb Guoqing Zheng Ahmed Hassan Awadallah

2025-05-06 0 0 5.17MB 17 页 10玖币

侵权投诉

Boosting Natural Language Generation from Instructions

with Meta-Learning

Budhaditya Deb, Guoqing Zheng, Ahmed Hassan Awadallah

Microsoft Research

{budeb, zheng, hassanam}@microsoft.com

Abstract

Recent work has shown that language mod-

els (LMs) trained with multi-task instruc-

tional learning (MTIL) can solve diverse NLP

tasks in zero- and few-shot settings with im-

proved performance compared to prompt tun-

ing. MTIL illustrates that LMs can extract and

use information about the task from instruc-

tions beyond the surface patterns of the inputs

and outputs. This suggests that meta-learning

may further enhance the utilization of instruc-

tions for effective task transfer. In this paper

we investigate whether meta-learning applied

to MTIL can further improve generalization

to unseen tasks in a zero-shot setting. Specif-

ically, we propose to adapt meta-learning to

MTIL in three directions: 1) Model Agnostic

Meta Learning (MAML), 2) Hyper-Network

(HNet) based adaptation to generate task spe-

ciﬁc parameters conditioned on instructions,

and 3) an approach combining HNet and

MAML. Through extensive experiments on

the large scale Natural Instructions V2 dataset,

we show that our proposed approaches signif-

icantly improve over strong baselines in zero-

shot settings. In particular, meta-learning im-

proves the effectiveness of instructions and is

most impactful when the test tasks are strictly

zero-shot (i.e. no similar tasks in the training

set) and are "hard" for LMs, illustrating the po-

tential of meta-learning for MTIL for out-of-

distribution tasks.

1 Introduction

Given some basic instructions and a few demon-

strations, humans are capable of conducting di-

verse tasks without any supervision. Can language

models perform similarly on unseen tasks when

trained with instructions? Speciﬁcally can such an

approach work on complex generation tasks with

relatively smaller language models (LMs)?

The recent advances in large LMs have shown

tremendous potential in diverse AI applications

and have the capability to change the way model

developers and users interact with intelligent sys-

tems. The inherent representative power of such

models has shown that diverse NLP tasks can be

solved purely by appending prompts or demonstra-

tions in context before a test input (Radford et al.,

2019;Brown et al.,2020). This has led to the rise

of prompt-based training (Liu et al.,2021) where

even much smaller models trained on a large set

of tasks in a multi-task setting with prompts, can

behave similarly (Schick and Schütze,2021).

A natural extension of the prompt tuning concept

involves adding instructions about the task along

with the demonstrations. Instructions are more in-

formative than prompts and aid the language mod-

els to solve unseen tasks better. Instructions can

have different forms, for example to convey a short

task speciﬁc statement (e.g. "Provide a short sum-

mary for the following input") (Schick and Schütze,

2021b), or a natural language question ("How

would you rephrase that in a few words?") (Sanh

et al.,2022;Wei et al.,2022;Bach et al.,2022).

However, for complex generation tasks, short in-

structions can be ambiguous and non-informative

and thus need large LMs which can encode a much

richer prior knowledge.

In contrast, (Wang et al.,2022) deﬁne instruc-

tions in the Natural Instructions V2 (NIV2) dataset

comprising of detailed task descriptions, positive

and negative examples, and explanations. Instruc-

tions in NIV2 are similar to annotation guide-

lines, and thus potentially more beneﬁcial

. Us-

ing multi-task instructional learning (MTIL) on

diverse tasks, (Wang et al.,2022) showed that even

smaller models can be competitive with larger mod-

els on zero-shot generalization to unseen tasks.

Results in (Wang et al.,2022) illustrated that

LMs can extract useful information from instruc-

tions beyond the surface patterns available in the

prompts for solving a task. This suggests that

The instructions in NIV2 are in-fact taken from annotation

guidelines for each of the tasks

arXiv:2210.11617v1 [cs.CL] 20 Oct 2022

learning-to-learn or meta-learning paradigm can

further enhance the utilization of instructions by

learning about task at deeper levels. In this pa-

per, we investigate how smaller LMs could best

beneﬁt from the natural instructions and whether

meta-learning paradigms can further improve the

zero-shot generalization ability of LMs in MTIL.

Meta-learning has been shown to be effective in

adapting knowledge with little supervision but to

the best of our knowledge has not been adapted to

MTIL in zero-shot settings.

Speciﬁcally, we explore two different meta-

learning approaches. First we propose to adapt

Model Agnostic Meta Learning (MAML) (Finn

et al.,2017) for MTIL, an optimization based

approach. Second, we explore hyper-network

(HNet) (Ha et al.,2017) based MTIL, a black-box

approach. HNet introduces an auxiliary LM which

encodes instructions to produce task speciﬁc param-

eters which are added to the main LM parameters

to generate a task speciﬁc LM at prediction time.

In addition, we evaluate a third approach which

combines the two into a HNet-MAML by training

the HNet model using MAML.

We conduct extensive experiments speciﬁcally

designed to test the generalization ability of LMs

trained with instructions under different zero shot

conditions. We use two sets of training tasks from

the NIV2 dataset: 1) all natural language tasks and

2) natural language generation tasks. We evalu-

ate the models for two sets of held out

generation

tasks

conveying different levels of zero-shot gener-

alization ability: 1)

weak generalization

set with

a random selection of generation tasks with poten-

tial overlap of categories with training tasks and 2)

strong generalization

set (or strict zero-shot con-

ditions) using summarization and title generation

tasks with no overlap in categories from the train-

ing tasks. We further investigate the task sets under

difﬁculty levels of easy,medium, and hard based

on their baseline ROUGE scores.

The main conclusion from our study is that under

strict zero-shot conditions, meta-learning with in-

structions signiﬁcantly improves the performance.

The improvements become more signiﬁcant for the

strong generalization task set and when the task

difﬁculty level is hard (i.e. tasks where the LM

struggles to generate correct outputs in zero-shot

setting). Moreover, meta-learning increases the

effectiveness of instructions under all conditions.

While both MAML and HNet models show im-

provements over the baselines, HNet (along with its

MAML extension) by explicitly enforcing the use

of instructions through task speciﬁc conditioning of

parameters, results in larger gains. In summary, the

main contributions of the paper are two-fold. First,

we adapt meta-learning approaches to MTIL. Sec-

ond, we study their efﬁcacy and show signiﬁcant

improvements under strict zero-shot conditions.

2 Related Work

Learning from instructions: An extension of the

basic prompt-based in-context learning is append-

ing task speciﬁc instructions with prompts. Sev-

eral recent works which include FLAN (Wei et al.,

2022), T0 (Sanh et al.,2022) and (Reif et al.,2021),

train a large LM in a multi-task setting with instruc-

tions. InstructGPT (Ouyang et al.,2022) takes

slightly different approach by training the GPT3

model (Brown et al.,2020) with human anno-

tated dataset of demonstrations of desired user in-

tents and use reinforcement learning to improve

the model to follow such instructions. Yet an-

other direction called pattern-exploiting training

(PET) (Schick and Schütze,2021a;Schick and

Schütze,2021) combines the idea of formulating

instructions as cloze questions and show that even

small LMs can be good few-shot learners and work

with language generation.

Meta-learning for language generation

: Meta

learning has been applied in several language gen-

eration settings such as (Lin and Lee,2020) to

induce persona in a chatbot, (Mi et al.,2019) for

task oriented dialog systems, (Gu et al.,2018) for

low resource machine translation, and (Chen and

Shuai,2021) for abstractive summarization in a

low-resource transfer learning but do not use in-

structions for zero-shot transfer. Our MTIL sce-

nario is closely related to MetaICL (Min et al.,

2022) which applies multi-task learning in-context

in a K-shot setting for classiﬁcation tasks, but dif-

fers in that it is a k-shot in-context scenario and

does not use instructions or meta-learning optimiza-

tion. While these works are related, to the best of

our knowledge, meta-learning has not been used

to generalize to unseen generation tasks in zero

shot settings using instructions and thus the paper

provides several novel insights and approaches.

Hyper-Networks (HNet) in NLP applica-

tions

: (Karimi Mahabadi et al.,2021) use HNet

to train LMs in a multi-task setting with adapters

and (von Oswald et al.,2020) propose a contin-

ual learning framework with HNets conditioned on

unique task IDs to reduce catastrophic forgetting.

HNets have been used for input conditioning a de-

coder in (Ivison and Peters,2022) which produces

a unique decoder for each input, and thus is similar

to our approach. However these the approaches

are not strictly applicable in our zero-shot scenario

or in general NLP tasks with task descriptions in

natural language.

Language model editing

: Our HNet based ap-

proach is based on the architecture in (Cao et al.,

2021) which uses it to edit factual knowledge in

LMs. While the architecture is similar, we use the

HNet to encode task speciﬁc instructions and is

intended for controlling task-level LM behavior

unlike the micro-behavior targeted in (Cao et al.,

2021). Similar to ours and (Cao et al.,2021),

Bayesian hyper networks (Krueger et al.,2018)

linearizes the number of parameters for predictions

by constraining the HNet outputs to scale and shift

parameters. (Sinitsin et al.,2020;Mitchell et al.,

2022) propose Meta Learning approaches for edit-

ing errors in a neural network but is not directly

applicable for MTIL in a zero-shot setting.

MTIL

: Finally, the work most closely related

to this paper is the Tk-Instruct model from (Wang

et al.,2022) which ﬁne tunes a T5 model (Raffel

et al.,2020) with instructions, which we use as the

baseline. We use the same dataset and training set-

tings as Tk-Instruct but instead use the pretrained

BART model (Lewis et al.,2020) as it is task ag-

nostic compared to T5 (T5 may not represent a

true zero-shot setting). In addition, we enhance

this model with meta-learning and consider signif-

icantly different training, evaluation, and model

settings to test zero-shot generalization resulting in

unique contributions and conclusions orthogonal

to the ﬁndings in (Wang et al.,2022).

3 Problem Setup

In this section we brieﬂy outline our problem set-

tings and baselines used in this paper.

3.1 Natural Instructions V2 Dataset

We use the Natural Instructions V2 (NIV2)

dataset (Wang et al.,2022)

to investigate meta-

learning approaches for instructional learning. The

NIV2 is a meta-dataset with over 1600 tasks.

In NIV2, each task contains instructions and mul-

tiple training instances with input and output. The

2https://instructions.apps.allenai.org/

instructions consist of: 1)

Categories

(classiﬁca-

tion, summarization etc.), 2)

Short description

short sentence about the task), 3)

Long descrip-

tion

(a detailed description of the task similar to

annotation guidelines), 4)

Positive examples

(in-

puts with correct outputs), 5)

Negative examples

(inputs with incorrect outputs), and 6)

Explana-

tions for the positive or negative examples.

(Wang et al.,2022) train a pretrained T5 lan-

guage model (Raffel et al.,2020) on input-output

pairs with instructions (Tk-Instruct) appended be-

fore the input in a multi-task setting. During testing,

held out unseen tasks are predicted by appending

similar instructions to the test input. (Wang et al.,

2022) provide detailed ablations and baseline com-

parisons with related models showing the impact of

instructions. Following the results there, we only

use the task descriptions and positive examples in

this study as negative examples and explanations

were not shown to have any positive contributions.

3.2 Baseline Model with Standard Training

Based on results in (Wang et al.,2022) where Tk-

Instruct was shown to comfortably beat much larger

T5, GPT3, InstructGPT3, and T0 models, we use

the Tk-Instruct setting as our baseline, i.e. we

train a pre-trained encoder-decoder LM on multiple

tasks with instructions. We also explored append-

ing the instructions before the decoder sequence

but did not ﬁnd any improvements. However, we

did observe that by pre-pending a special preﬁx

to the decoder (we use "[Output]:") improves the

overall prediction performance. We refer to this

model as the standard training model.

For our base LM, we use the pretrained BART

model (Lewis et al.,2020) as it is task agnostic

compared to T5

and thus represents a stronger

zero-shot setting. Interested readers should refer to

the (Wang et al.,2022) paper for detailed ablations

speciﬁc to the NIV2 dataset and the T5 model.

3.3 Evaluation Settings

We focus speciﬁcally on the zero-shot generaliza-

tion on generation tasks. While the general settings

remain similar to (Wang et al.,2022) we consider

some speciﬁc settings to illustrate the generaliza-

tion capabilities of models to different tasks.

For training, we use two sets of tasks 1) All EN

tasks in the NIV2 dataset and 2) Generation tasks.

Publicly available T5 models are pre-trained on a multi-

task mixture of unsupervised and supervised tasks.

For evaluation, we consider two sets of generation

tasks with different zero-shot levels : 1) weak gen-

eralization set using a random set of generation

tasks with potential similarity to the training tasks

and 2) strong generalization set using tasks from

summarization and title generation categories with

no overlap with the training tasks. The list of eval-

uation tasks with short descriptions are provided in

the appendix in Figures 11 and 12.

We further divide the evaluation tasks into difﬁ-

culty levels of "easy", "medium" and "hard" based

on the ROUGE scores from the baseline model

(low scores indicate out-of distribution and difﬁcult

tasks) to see to what extent meta-learning helps in

improving performance of the out-of-distribution

tasks.

4 Meta-Learning with Instructions

Training on a large number of diverse tasks and

testing on unseen tasks lend itself to the paradigm

of learning-to-learn or meta-learning, which has

been successfully applied for generalizing in both

zero- and few- shot scenarios. Task meta-data in the

form of instructions can also provide discriminative

information about the task process in addition to

the surface patterns of the input and output strings.

We investigate whether meta-learning can aid such

learning and adapt three approaches to MTIL.

4.1 Standard Training + MAML

We adapt Model Agnostic Meta Learning (MAML)

(Finn et al.,2017) to instructional learning of LMs

as a way to generalize to unseen tasks by training

on large number of diverse tasks.

The standard training with MAML is described

in Algorithm 1in the appendix. At any training

iteration, we sample two different sets of

tasks

for MAML meta-train and meta-test steps. We

uniformly sample across tasks to maximize the

diversity of tasks in each batch. The data format

is same as the standard training. Since we test

zero-shot conditions, we do not have any test time

optimization typically employed in MAML.

4.2 Standard Training + HNET

Both standard and MAML training do not explic-

itly enforce the use of instructions during decoding.

The model can thus minimize the loss simply by

ignoring the instruction part of the encoder by at-

tending to the input and output texts. This can lead

to sub-optimal use of the instructions.

Instructions (Iτ)

Description: ….

Positive Examples:

[Input]: …

[Output]: …

[Input]:x

[Output]:y

Encoder Decoder

+Main-LM (BART)

Encoder Decoder

HNET-LM (BART) ∆θ1τ

FFN1

FFN2

FFNn

∆θ2τ

∆θnτ

H1H2Hn

[1,2,…n]

θτ= θ0 + ∆θτ

HNET-FF

Figure 1: Encoding instructions using a Hyper-network

We propose a hyper-network (HNet) based archi-

tecture (Ha et al.,2017) to produce task speciﬁc

model parameters conditioned on the instructions.

The HNet architecture consists of an auxiliary LM

(HNet-LM) along with the Main-LM. The HNet-

LM produces task speciﬁc parameters of the main

LM by conditioning on instructions.

In particular, we adapt a speciﬁc type of HNet

architecture from (Cao et al.,2021) which predicts

the delta-parameters of the Main-LM, which are

then added to the Main-LM parameters to produce

the task speciﬁc LM. This preserves the parame-

ters in the Main-LM utilizing the shared generation

capability of the LM while specializing in task spe-

ciﬁc behavior. However, there are some speciﬁc

differences based on our requirements for instruc-

tional training, which are described next.

4.2.1 The HNet Language Model (HNet-LM)

Since the input to the HNet model is text, we use

a pretrained encoder-decoder LM (BART in this

paper) to encode the instructions

and use the de-

coder’s hidden states for conditioning the layer spe-

ciﬁc parameters of the Main-LM.

In (Cao et al.,2021) the last hidden state from

an LSTM is used for conditioning the parameters

of the main model. To increase the effective band-

width of the HNet while keeping the number of

parameters same, we use the last

hidden states

(for

layers of the main LM). This simple trick

allows the model to independently attend to and

condition each layer on the input instructions while

still keeping the model parameters the same.

The HNet-LM takes instruction

Iτ

for a task

and sequence of decoder indexes

as input and

produces

hidden states

hτ(n)

. The decoder in-

dex sequence we use is simply

1, ..., n

. Decoder

indexes provide different inputs to the decoder to

inﬂuence the generation of distinct parameters for

each layer of the main LM. This is not strictly re-

4In contrast (Cao et al.,2021) used an untrained LSTM.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoostingNaturalLanguageGenerationfromInstructionswithMeta-LearningBudhadityaDeb,GuoqingZheng,AhmedHassanAwadallahMicrosoftResearch{budeb,zheng,hassanam}@microsoft.comAbstractRecentworkhasshownthatlanguagemod-els(LMs)trainedwithmulti-taskinstruc-tionallearning(MTIL)cansolvediverseNLPtasksinzero-andfe...

展开>> 收起<<

Boosting Natural Language Generation from Instructions with Meta-Learning Budhaditya Deb Guoqing Zheng Ahmed Hassan Awadallah.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Boosting Natural Language Generation from Instructions with Meta-Learning Budhaditya Deb Guoqing Zheng Ahmed Hassan Awadallah

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: