Continual Training of Language Models for Few-Shot Learning
Zixuan Ke1, Haowei Lin2, Yijia Shao2, Hu Xu1, Lei Shu1∗and Bing Liu1
1Department of Computer Science, University of Illinois at Chicago
2Wangxuan Institute of Computer Technology, Peking University
1{zke4,hxu48,liub}@uic.edu
2{linhaowei, shaoyj}@pku.edu.cn
Abstract
Recent work on applying large language mod-
els (LMs) achieves impressive performance
in many NLP applications. Adapting or post-
training an LM using an unlabeled domain cor-
pus can produce even better performance for
end-tasks in the domain. This paper proposes
the problem of continually extending an LM
by incrementally post-train the LM with a se-
quence of unlabeled domain corpora to expand
its knowledge without forgetting its previous
skills. The goal is to improve the few-shot
end-task learning in these domains. The re-
sulting system is called CPT (Continual Post-
Training), which to our knowledge, is the first
continual post-training system. Experimental
results verify its effectiveness.1
1 Introduction
Recent work has shown that large LMs have the
ability to perform few-shot (or even zero-shot)
learning well (Brown et al.,2020b;Rae et al.,
2021;Smith et al.,2022). Post-training (a.k.a.,
domain-adaptive pre-training or pre-finetuning) an
LM with a large unlabeled domain corpus before
end-task fine-tuning in the domain achieves better
results (Xu et al.,2019;Gururangan et al.,2020a)
than directly fine-tuning the LM. This paper goes
a step further to study the problem of improving
an LM’s ability to handle new and ever emerging
domains. For this, one needs to continually post-
train the LM with a sequence of domains. A key
issue associated with this problem is catastrophic
forgetting (CF).
2
This paper thus investigates how
to continually extend the LM’s knowledge with-
out suffering from CF. From a broader perspective,
since training a large LM from scratch is extremely
∗Now at Google Research leishu@google.com
1https://github.com/UIC-Liu-Lab/CPT
2
CF means that learning a new task/domain may need to
modify the existing network, which degrades the performance
of previous tasks/domains (McCloskey and Cohen,1989).
expensive and computation intensive, incremen-
tally updating the LM with the latest language data
reflecting the ever changing development of the
language itself, social events and the knowledge
from different fields is becoming more and more
critical. As humans are very effective at incremen-
tal learning, if we can imitate this human capability
with little or no forgetting, we will be pushing the
AI research forward significantly.
The proposed system, called CPT, is a continual
learning (CL) system for post-training. Starting
from a pre-trained LM (e.g., RoBERTa (Liu et al.,
2019b)), it incrementally post-trains the LM with a
sequence of domains using their unlabeled corpora.
Once a task (a domain in our case)
3
is trained,
its data is no longer accessible. At any time, the
resulting continually post-trained LM can be used
by end-tasks in the trained domains. This is in
the task-incremental learning (TIL) setting of CL,
where the task id (domain id in our case) is pro-
vided when the learned model of a task needs to
be used later (the use of domain id is discussed
in Sec. 2.1).
4
This paper proposes an effective ap-
proach called CPT and focuses on the challenging
and practical scenario of few-shot end-task learning
after post-training a sequence of domains.
Continual post-training is different from conven-
tional CL (Chen and Liu,2018). The key difference
is that in conventional CL, each task is an end-task,
but in our case the end-task involves fine-tuning
the continual post-trained LM (called p-LM). This
causes major forgetting, which we call the catas-
trophic butterfly effect (CBE) and does not happen
in conventional CL. Our proposed system, CPT,
can solve both CF and CBE, based on a novel hard
masking mechanism (Sec. 2.2) and can achieve
no forgetting. As shown in Sec. 3.3, naively ap-
3
We will use the term domain in this paper to be consistent
with the post-training literature
4
CL has two other settings: class-incremental learning and
domain-incremental learning (van de Ven and Tolias,2019).
arXiv:2210.05549v1 [cs.CL] 11 Oct 2022