Continual Training of Language Models for Few-Shot Learning Zixuan Ke1 Haowei Lin2 Yijia Shao2 Hu Xu1 Lei Shu1and Bing Liu1 1Department of Computer Science University of Illinois at Chicago

2025-04-24 1 0 415.46KB 13 页 10玖币

侵权投诉

Continual Training of Language Models for Few-Shot Learning

Zixuan Ke1, Haowei Lin2, Yijia Shao2, Hu Xu1, Lei Shu1∗and Bing Liu1

1Department of Computer Science, University of Illinois at Chicago

2Wangxuan Institute of Computer Technology, Peking University

1{zke4,hxu48,liub}@uic.edu

2{linhaowei, shaoyj}@pku.edu.cn

Abstract

Recent work on applying large language mod-

els (LMs) achieves impressive performance

in many NLP applications. Adapting or post-

training an LM using an unlabeled domain cor-

pus can produce even better performance for

end-tasks in the domain. This paper proposes

the problem of continually extending an LM

by incrementally post-train the LM with a se-

quence of unlabeled domain corpora to expand

its knowledge without forgetting its previous

skills. The goal is to improve the few-shot

end-task learning in these domains. The re-

sulting system is called CPT (Continual Post-

Training), which to our knowledge, is the ﬁrst

continual post-training system. Experimental

results verify its effectiveness.1

1 Introduction

Recent work has shown that large LMs have the

ability to perform few-shot (or even zero-shot)

learning well (Brown et al.,2020b;Rae et al.,

2021;Smith et al.,2022). Post-training (a.k.a.,

domain-adaptive pre-training or pre-ﬁnetuning) an

LM with a large unlabeled domain corpus before

end-task ﬁne-tuning in the domain achieves better

results (Xu et al.,2019;Gururangan et al.,2020a)

than directly ﬁne-tuning the LM. This paper goes

a step further to study the problem of improving

an LM’s ability to handle new and ever emerging

domains. For this, one needs to continually post-

train the LM with a sequence of domains. A key

issue associated with this problem is catastrophic

forgetting (CF).

This paper thus investigates how

to continually extend the LM’s knowledge with-

out suffering from CF. From a broader perspective,

since training a large LM from scratch is extremely

∗Now at Google Research leishu@google.com

1https://github.com/UIC-Liu-Lab/CPT

CF means that learning a new task/domain may need to

modify the existing network, which degrades the performance

of previous tasks/domains (McCloskey and Cohen,1989).

expensive and computation intensive, incremen-

tally updating the LM with the latest language data

reﬂecting the ever changing development of the

language itself, social events and the knowledge

from different ﬁelds is becoming more and more

critical. As humans are very effective at incremen-

tal learning, if we can imitate this human capability

with little or no forgetting, we will be pushing the

AI research forward signiﬁcantly.

The proposed system, called CPT, is a continual

learning (CL) system for post-training. Starting

from a pre-trained LM (e.g., RoBERTa (Liu et al.,

2019b)), it incrementally post-trains the LM with a

sequence of domains using their unlabeled corpora.

Once a task (a domain in our case)

is trained,

its data is no longer accessible. At any time, the

resulting continually post-trained LM can be used

by end-tasks in the trained domains. This is in

the task-incremental learning (TIL) setting of CL,

where the task id (domain id in our case) is pro-

vided when the learned model of a task needs to

be used later (the use of domain id is discussed

in Sec. 2.1).

This paper proposes an effective ap-

proach called CPT and focuses on the challenging

and practical scenario of few-shot end-task learning

after post-training a sequence of domains.

Continual post-training is different from conven-

tional CL (Chen and Liu,2018). The key difference

is that in conventional CL, each task is an end-task,

but in our case the end-task involves ﬁne-tuning

the continual post-trained LM (called p-LM). This

causes major forgetting, which we call the catas-

trophic butterﬂy effect (CBE) and does not happen

in conventional CL. Our proposed system, CPT,

can solve both CF and CBE, based on a novel hard

masking mechanism (Sec. 2.2) and can achieve

no forgetting. As shown in Sec. 3.3, naively ap-

We will use the term domain in this paper to be consistent

with the post-training literature

CL has two other settings: class-incremental learning and

domain-incremental learning (van de Ven and Tolias,2019).

arXiv:2210.05549v1 [cs.CL] 11 Oct 2022

plied existing CL systems cannot effectively pre-

vent CF (even though some existing techniques

have shown almost perfect CF prevention ability in

conventional CL).

Experiments in 4 domains and their correspond-

ing end-tasks demonstrate the effectiveness of the

proposed CPT system.

Related Work.

Overcoming CF is a major goal

of CL (Chen and Liu,2018). There are many ex-

isting approaches, e.g., regularization-based ap-

proaches (Kirkpatrick et al.,2016;Seff et al.,2017),

replay-based approaches (Rebufﬁ et al.,2017;

Lopez-Paz and Ranzato,2017) and parameter iso-

lation based approaches (Serrà et al.,2018;Fer-

nando et al.,2017). Our CPT is based on parameter

isolation and uses masks in continual post-training.

Recently, CL has drawn attention in NLP. It has

been used for slot ﬁlling (Shen et al.,2019), lan-

guage learning (Li et al.,2019), sentence embed-

ding (Liu et al.,2019a), translation (Khayrallah

et al.,2018), cross-lingual modeling (Liu et al.,

2020b), question answering (Greco et al.,2019)

and text classiﬁcation (Ke et al.,2021a,b;Sun et al.,

2020;Huang et al.,2021;Chuang et al.,2020;

Mehta et al.,2021;Madotto et al.,2020). How-

ever, none of them tries to improve an LM.

CPT is closely related to ELLE (Qin et al.,2022),

which does continual pre-training. The key differ-

ence is that ELLE starts from random initialization,

while our CPT starts from a pre-trained LM. We

tried to adapt ELLE for continual post-training by

learning from a pre-trained RoBERTa but it fails to

converge. This also indicates it is non-trivial to do

well in our setting. Readers can refer to Appendix

Afor a full coverage of the related work.

2 Proposed CPT System

CPT continually post-trains RoBERTa (Liu et al.,

2019b). This is achieved by two continual learn-

ing plug-in (called CL-plugin) modules inserted

into each transformer layer of RoBERTa. CL-

plugin is inspired by adapters in (Houlsby et al.,

2019). While adapters can isolate different tasks,

one needs to allocate a new adapter for each task

and no knowledge can be shared among different

tasks’ adapters. The CL-plugin, however, is a CL

system that learns a sequence of tasks with adapters

shared by all domains. Figure 1gives the CPT ar-

chitecture with two CL-plugins added to RoBERTa.

Sequential vs. Parallel CL-plugin.

Instead of

following the original sequential adapter (Houlsby

(A) Continual Post-training

MLM Head

Hidden States

Attention

Add & Layer Norm

FFN

CL-Plugin

Add & Layer Norm

× L

Hidden States

Attention

Add & Layer Norm

FFN

CL-Plugin

Add & Layer Norm

× L

Classification Head (B) Individual Fine-tuning

CL-Plugin

0101

CL-Plugin

0 1 010

100 0 0

Fixed Modules

0 0

Figure 1: Architecture of CPT, which has two CL-

plugins inserted in the transformer layers of RoBERTa

in a parallel manner (FFN: feed-forward network). (A)

CPT for continual post-training. It uses a masked

language model (MLM) head for unsupervised post-

training of the CL-plugins only. (B) CPT for individual

ﬁne-tuning. CPT is evaluated by the corresponding in-

dividual end-task performance of all post-trained tasks.

Each CL-plugin has numbers and colors indicating its

masks and is illustrated in Appendix B.

et al.,2019), CL-plugin adopts the parallel adapter

idea in (He et al.,2021). The difference is that the

former inserts an adapter after the FFN/attention

layer while the latter inserts it before FFN/attention

layer (see Fig. 1). We choose the parallel version

as it performs better (see Sec. 3.3).

In post-training, only the two CL-plugins are

trained. The components of the original pre-trained

RoBERTa are ﬁxed. In end-task ﬁne-tuning, all

components are trainable. A CL-plugin is a two-

layer fully connected network with a task mask

mechanism. It takes two inputs: (1) hidden states

h(t)

from the feed-forward layer in a transformer

layer and (2) task ID

needed by task incremen-

tal learning (TIL). Inside a CL-plugin, task masks

(TMs), which indicate task- speciﬁc neurons, are

used to deal with CF. Since TMs is differentiable,

the whole CPT can be trained end-to-end.

2.1 Task Masks (TMs)

In each layer of a CL-plugin, task masks are used

to protect those neurons that are important for pre-

vious tasks to overcome CF. The masks basically

forbid gradient updates to those neurons during

backpropagation in learning a new task. Note that

a task is also a domain in our case.

Learning a new task/domain consists of two

main steps: (1) apply the mask in each layer for

each old task to block off the gradient ﬂow to pro-

tect the model for the old task, and (2) learn domain

tand its masks for future use. We present (2) ﬁrst.

Learning Task Masks for Overcoming CF.

learning each task

, a mask (a “soft” binary mask)

m(t)

is trained for the task at each layer

in CL-

plugin, indicating the neurons that are important

for the task. We borrow the hard attention idea in

(Serrà et al.,2018) and leverage the task ID em-

bedding to train the mask. For a task ID

, its em-

bedding

e(t)

consists of differentiable deterministic

parameters that can be learned together with other

parts of the network. To generate the task mask

m(t)

from

e(t)

,Sigmoid is used as a pseudo-gate

(mask) function. m(t)

lis computed with

m(t)

l=σ(e(t)

l/τ),(1)

where

is a temperature variable, linearly annealed

from 1 to τmin (a small positive value).

In the forward pass, given the output of each

layer

k(t)

, we element-wise multiply mask

m(t)

o(t)

l=k(t)

l⊗m(t)

l.(2)

The masked output

o(t)

of the last layer in CL-

plugin is fed to the next layer of the RoBERTa with

a skip-connection. After learning task

, the ﬁnal

m(t)

lis saved and added to the set {m(t)

l}.

Applying Task Masks.

Before learning a new

task

, we ﬁrst accumulate and set the masks

m(iprev)

on the neurons in each layer

for all old tasks

iprev

so that in backpropagation, the gradient

g(t)

for

task

will not ﬂow to these neurons. Since

m(iprev)

is pseudo binary, we use max-pooling to achieve

the accumulation and condition the gradient:

0(t)

l=g(t)

l⊗(1 −(MaxPool({m(iprev)

l}))).(3)

Those gradients corresponding to the 1 entries in

MaxPool({m(iprev)

l})

are set to 0 (to block off gra-

dient ﬂow) while the others remain unchanged. In

this way, neurons in old tasks are protected. Note

that we expand (copy) the vector

m(ta)

to match

the dimensions of g(t)

2.2 Catastrophic Butterﬂy Effect in

Fine-tuning

To perform an end-task in a post-trained domain,

we ﬁne-tune the mask-protected model of the do-

main, which is indicated by the task/domain id. The

ﬁne-tuning uses the corresponding domain neurons

for the speciﬁc end-task by setting

τ=τmin

and

condition the output via Eq. 2. With the masks,

there should be no forgetting for continual post-

training and the end-task ﬁne-tuning performance

should be similar to post-train each domain sepa-

rately. However, we found that this is not the case.

Our investigation found that the problem is due to

the pseudo-gate function in Eq. 1. No matter how

small

is, Eq. 1can only gives us a mask almost 0

(or 1). This causes the following: (1) During post-

training, the gradients for used neurons in Eq. 3are

not exactly 0 but a very small number. (2) During

ﬁne-tuning, we cannot make use of the correspond-

ing neurons for the speciﬁc end-task by simply

setting

τ=τmin

. The small change in the neu-

rons for old domains during post-training caused

by (1) is neglect-able in conventional CL because

in conventional CL we evaluate the model using

test sets and no weights update involved. How-

ever, in CPT, the end-task needs to ﬁne-tune the

continually post-trained LM model (p-LM), which

involves weight updating. A small change to the

p-LM during continual post-training can result in a

different initialization for the end-task ﬁne-tuning

and give totally different ﬁne-tuning results. We

call this butterﬂy effect inspired by the term indi-

cating a small state change in nature (e.g., the ﬂap

of a butterﬂy’s wings in Brazil) can result in large

differences in a later state (e.g., a tornado in Texas).

We propose a simple method to solve it, i.e.,

adding a threshold

to the

m(t)

to make it a hard

binary mask,

m(t)

l=(1, m(t)

l> θ,

0, m(t)

l< θ.

(4)

We then apply it to Eq. 3in gradient manipulation

and Eq. 2in end-task ﬁne-tuning.

can be easily

set (we use 0.5) since Eq. 1already gives a pseudo-

binary mask. Note that this has almost no effect

on post-training as it is used to block the backward

For example, ﬁne-tuning an end restaurant sentiment clas-

siﬁcation task achieves macro-F1 (MF1) of 0.64 right after

post-training the restaurant domain but its ﬁne-tuning MF1

drops to 0.44 after post-training three more domains.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContinualTrainingofLanguageModelsforFew-ShotLearningZixuanKe1,HaoweiLin2,YijiaShao2,HuXu1,LeiShu1andBingLiu11DepartmentofComputerScience,UniversityofIllinoisatChicago2WangxuanInstituteofComputerTechnology,PekingUniversity1{zke4,hxu48,liub}@uic.edu2{linhaowei,shaoyj}@pku.edu.cnAbstractRecentworkonap...

展开>> 收起<<

Continual Training of Language Models for Few-Shot Learning Zixuan Ke1 Haowei Lin2 Yijia Shao2 Hu Xu1 Lei Shu1and Bing Liu1 1Department of Computer Science University of Illinois at Chicago.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Continual Training of Language Models for Few-Shot Learning Zixuan Ke1 Haowei Lin2 Yijia Shao2 Hu Xu1 Lei Shu1and Bing Liu1 1Department of Computer Science University of Illinois at Chicago

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: