Continual Training of Language Models for Few-Shot Learning Zixuan Ke1 Haowei Lin2 Yijia Shao2 Hu Xu1 Lei Shu1and Bing Liu1 1Department of Computer Science University of Illinois at Chicago

2025-04-24 0 0 415.46KB 13 页 10玖币
侵权投诉
Continual Training of Language Models for Few-Shot Learning
Zixuan Ke1, Haowei Lin2, Yijia Shao2, Hu Xu1, Lei Shu1and Bing Liu1
1Department of Computer Science, University of Illinois at Chicago
2Wangxuan Institute of Computer Technology, Peking University
1{zke4,hxu48,liub}@uic.edu
2{linhaowei, shaoyj}@pku.edu.cn
Abstract
Recent work on applying large language mod-
els (LMs) achieves impressive performance
in many NLP applications. Adapting or post-
training an LM using an unlabeled domain cor-
pus can produce even better performance for
end-tasks in the domain. This paper proposes
the problem of continually extending an LM
by incrementally post-train the LM with a se-
quence of unlabeled domain corpora to expand
its knowledge without forgetting its previous
skills. The goal is to improve the few-shot
end-task learning in these domains. The re-
sulting system is called CPT (Continual Post-
Training), which to our knowledge, is the first
continual post-training system. Experimental
results verify its effectiveness.1
1 Introduction
Recent work has shown that large LMs have the
ability to perform few-shot (or even zero-shot)
learning well (Brown et al.,2020b;Rae et al.,
2021;Smith et al.,2022). Post-training (a.k.a.,
domain-adaptive pre-training or pre-finetuning) an
LM with a large unlabeled domain corpus before
end-task fine-tuning in the domain achieves better
results (Xu et al.,2019;Gururangan et al.,2020a)
than directly fine-tuning the LM. This paper goes
a step further to study the problem of improving
an LM’s ability to handle new and ever emerging
domains. For this, one needs to continually post-
train the LM with a sequence of domains. A key
issue associated with this problem is catastrophic
forgetting (CF).
2
This paper thus investigates how
to continually extend the LM’s knowledge with-
out suffering from CF. From a broader perspective,
since training a large LM from scratch is extremely
Now at Google Research leishu@google.com
1https://github.com/UIC-Liu-Lab/CPT
2
CF means that learning a new task/domain may need to
modify the existing network, which degrades the performance
of previous tasks/domains (McCloskey and Cohen,1989).
expensive and computation intensive, incremen-
tally updating the LM with the latest language data
reflecting the ever changing development of the
language itself, social events and the knowledge
from different fields is becoming more and more
critical. As humans are very effective at incremen-
tal learning, if we can imitate this human capability
with little or no forgetting, we will be pushing the
AI research forward significantly.
The proposed system, called CPT, is a continual
learning (CL) system for post-training. Starting
from a pre-trained LM (e.g., RoBERTa (Liu et al.,
2019b)), it incrementally post-trains the LM with a
sequence of domains using their unlabeled corpora.
Once a task (a domain in our case)
3
is trained,
its data is no longer accessible. At any time, the
resulting continually post-trained LM can be used
by end-tasks in the trained domains. This is in
the task-incremental learning (TIL) setting of CL,
where the task id (domain id in our case) is pro-
vided when the learned model of a task needs to
be used later (the use of domain id is discussed
in Sec. 2.1).
4
This paper proposes an effective ap-
proach called CPT and focuses on the challenging
and practical scenario of few-shot end-task learning
after post-training a sequence of domains.
Continual post-training is different from conven-
tional CL (Chen and Liu,2018). The key difference
is that in conventional CL, each task is an end-task,
but in our case the end-task involves fine-tuning
the continual post-trained LM (called p-LM). This
causes major forgetting, which we call the catas-
trophic butterfly effect (CBE) and does not happen
in conventional CL. Our proposed system, CPT,
can solve both CF and CBE, based on a novel hard
masking mechanism (Sec. 2.2) and can achieve
no forgetting. As shown in Sec. 3.3, naively ap-
3
We will use the term domain in this paper to be consistent
with the post-training literature
4
CL has two other settings: class-incremental learning and
domain-incremental learning (van de Ven and Tolias,2019).
arXiv:2210.05549v1 [cs.CL] 11 Oct 2022
plied existing CL systems cannot effectively pre-
vent CF (even though some existing techniques
have shown almost perfect CF prevention ability in
conventional CL).
Experiments in 4 domains and their correspond-
ing end-tasks demonstrate the effectiveness of the
proposed CPT system.
Related Work.
Overcoming CF is a major goal
of CL (Chen and Liu,2018). There are many ex-
isting approaches, e.g., regularization-based ap-
proaches (Kirkpatrick et al.,2016;Seff et al.,2017),
replay-based approaches (Rebuffi et al.,2017;
Lopez-Paz and Ranzato,2017) and parameter iso-
lation based approaches (Serrà et al.,2018;Fer-
nando et al.,2017). Our CPT is based on parameter
isolation and uses masks in continual post-training.
Recently, CL has drawn attention in NLP. It has
been used for slot filling (Shen et al.,2019), lan-
guage learning (Li et al.,2019), sentence embed-
ding (Liu et al.,2019a), translation (Khayrallah
et al.,2018), cross-lingual modeling (Liu et al.,
2020b), question answering (Greco et al.,2019)
and text classification (Ke et al.,2021a,b;Sun et al.,
2020;Huang et al.,2021;Chuang et al.,2020;
Mehta et al.,2021;Madotto et al.,2020). How-
ever, none of them tries to improve an LM.
CPT is closely related to ELLE (Qin et al.,2022),
which does continual pre-training. The key differ-
ence is that ELLE starts from random initialization,
while our CPT starts from a pre-trained LM. We
tried to adapt ELLE for continual post-training by
learning from a pre-trained RoBERTa but it fails to
converge. This also indicates it is non-trivial to do
well in our setting. Readers can refer to Appendix
Afor a full coverage of the related work.
2 Proposed CPT System
CPT continually post-trains RoBERTa (Liu et al.,
2019b). This is achieved by two continual learn-
ing plug-in (called CL-plugin) modules inserted
into each transformer layer of RoBERTa. CL-
plugin is inspired by adapters in (Houlsby et al.,
2019). While adapters can isolate different tasks,
one needs to allocate a new adapter for each task
and no knowledge can be shared among different
tasks’ adapters. The CL-plugin, however, is a CL
system that learns a sequence of tasks with adapters
shared by all domains. Figure 1gives the CPT ar-
chitecture with two CL-plugins added to RoBERTa.
Sequential vs. Parallel CL-plugin.
Instead of
following the original sequential adapter (Houlsby
(A) Continual Post-training
MLM Head
Hidden States
Attention
Add & Layer Norm
FFN
CL-Plugin
CL-Plugin
Add & Layer Norm
+
+
× L
Hidden States
Attention
Add & Layer Norm
FFN
CL-Plugin
CL-Plugin
Add & Layer Norm
+
+
× L
Classification Head (B) Individual Fine-tuning
t
CL-Plugin
0
10
0101
t
CL-Plugin
0 1 010
100 0 0
Fixed Modules
0 0
0
Figure 1: Architecture of CPT, which has two CL-
plugins inserted in the transformer layers of RoBERTa
in a parallel manner (FFN: feed-forward network). (A)
CPT for continual post-training. It uses a masked
language model (MLM) head for unsupervised post-
training of the CL-plugins only. (B) CPT for individual
fine-tuning. CPT is evaluated by the corresponding in-
dividual end-task performance of all post-trained tasks.
Each CL-plugin has numbers and colors indicating its
masks and is illustrated in Appendix B.
et al.,2019), CL-plugin adopts the parallel adapter
idea in (He et al.,2021). The difference is that the
former inserts an adapter after the FFN/attention
layer while the latter inserts it before FFN/attention
layer (see Fig. 1). We choose the parallel version
as it performs better (see Sec. 3.3).
In post-training, only the two CL-plugins are
trained. The components of the original pre-trained
RoBERTa are fixed. In end-task fine-tuning, all
components are trainable. A CL-plugin is a two-
layer fully connected network with a task mask
mechanism. It takes two inputs: (1) hidden states
h(t)
from the feed-forward layer in a transformer
layer and (2) task ID
t
needed by task incremen-
tal learning (TIL). Inside a CL-plugin, task masks
(TMs), which indicate task- specific neurons, are
used to deal with CF. Since TMs is differentiable,
the whole CPT can be trained end-to-end.
2.1 Task Masks (TMs)
In each layer of a CL-plugin, task masks are used
to protect those neurons that are important for pre-
vious tasks to overcome CF. The masks basically
forbid gradient updates to those neurons during
backpropagation in learning a new task. Note that
a task is also a domain in our case.
Learning a new task/domain consists of two
main steps: (1) apply the mask in each layer for
each old task to block off the gradient flow to pro-
tect the model for the old task, and (2) learn domain
tand its masks for future use. We present (2) first.
Learning Task Masks for Overcoming CF.
In
learning each task
t
, a mask (a “soft” binary mask)
m(t)
l
is trained for the task at each layer
l
in CL-
plugin, indicating the neurons that are important
for the task. We borrow the hard attention idea in
(Serrà et al.,2018) and leverage the task ID em-
bedding to train the mask. For a task ID
t
, its em-
bedding
e(t)
l
consists of differentiable deterministic
parameters that can be learned together with other
parts of the network. To generate the task mask
m(t)
l
from
e(t)
l
,Sigmoid is used as a pseudo-gate
(mask) function. m(t)
lis computed with
m(t)
l=σ(e(t)
l),(1)
where
τ
is a temperature variable, linearly annealed
from 1 to τmin (a small positive value).
In the forward pass, given the output of each
layer
l
,
k(t)
l
, we element-wise multiply mask
m(t)
l
,
o(t)
l=k(t)
lm(t)
l.(2)
The masked output
o(t)
l
of the last layer in CL-
plugin is fed to the next layer of the RoBERTa with
a skip-connection. After learning task
t
, the final
m(t)
lis saved and added to the set {m(t)
l}.
Applying Task Masks.
Before learning a new
task
t
, we first accumulate and set the masks
m(iprev)
l
on the neurons in each layer
l
for all old tasks
iprev
so that in backpropagation, the gradient
g(t)
l
for
task
t
will not flow to these neurons. Since
m(iprev)
l
is pseudo binary, we use max-pooling to achieve
the accumulation and condition the gradient:
g
0(t)
l=g(t)
l(1 (MaxPool({m(iprev)
l}))).(3)
Those gradients corresponding to the 1 entries in
MaxPool({m(iprev)
l})
are set to 0 (to block off gra-
dient flow) while the others remain unchanged. In
this way, neurons in old tasks are protected. Note
that we expand (copy) the vector
m(ta)
l
to match
the dimensions of g(t)
l.
2.2 Catastrophic Butterfly Effect in
Fine-tuning
To perform an end-task in a post-trained domain,
we fine-tune the mask-protected model of the do-
main, which is indicated by the task/domain id. The
fine-tuning uses the corresponding domain neurons
for the specific end-task by setting
τ=τmin
and
condition the output via Eq. 2. With the masks,
there should be no forgetting for continual post-
training and the end-task fine-tuning performance
should be similar to post-train each domain sepa-
rately. However, we found that this is not the case.
5
Our investigation found that the problem is due to
the pseudo-gate function in Eq. 1. No matter how
small
τ
is, Eq. 1can only gives us a mask almost 0
(or 1). This causes the following: (1) During post-
training, the gradients for used neurons in Eq. 3are
not exactly 0 but a very small number. (2) During
fine-tuning, we cannot make use of the correspond-
ing neurons for the specific end-task by simply
setting
τ=τmin
. The small change in the neu-
rons for old domains during post-training caused
by (1) is neglect-able in conventional CL because
in conventional CL we evaluate the model using
test sets and no weights update involved. How-
ever, in CPT, the end-task needs to fine-tune the
continually post-trained LM model (p-LM), which
involves weight updating. A small change to the
p-LM during continual post-training can result in a
different initialization for the end-task fine-tuning
and give totally different fine-tuning results. We
call this butterfly effect inspired by the term indi-
cating a small state change in nature (e.g., the flap
of a butterfly’s wings in Brazil) can result in large
differences in a later state (e.g., a tornado in Texas).
We propose a simple method to solve it, i.e.,
adding a threshold
θ
to the
m(t)
l
to make it a hard
binary mask,
m(t)
l=(1, m(t)
l> θ,
0, m(t)
l< θ.
(4)
We then apply it to Eq. 3in gradient manipulation
and Eq. 2in end-task fine-tuning.
θ
can be easily
set (we use 0.5) since Eq. 1already gives a pseudo-
binary mask. Note that this has almost no effect
on post-training as it is used to block the backward
5
For example, fine-tuning an end restaurant sentiment clas-
sification task achieves macro-F1 (MF1) of 0.64 right after
post-training the restaurant domain but its fine-tuning MF1
drops to 0.44 after post-training three more domains.
摘要:

ContinualTrainingofLanguageModelsforFew-ShotLearningZixuanKe1,HaoweiLin2,YijiaShao2,HuXu1,LeiShu1andBingLiu11DepartmentofComputerScience,UniversityofIllinoisatChicago2WangxuanInstituteofComputerTechnology,PekingUniversity1{zke4,hxu48,liub}@uic.edu2{linhaowei,shaoyj}@pku.edu.cnAbstractRecentworkonap...

展开>> 收起<<
Continual Training of Language Models for Few-Shot Learning Zixuan Ke1 Haowei Lin2 Yijia Shao2 Hu Xu1 Lei Shu1and Bing Liu1 1Department of Computer Science University of Illinois at Chicago.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:415.46KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注