How Relevant is Selective Memory Population in
Lifelong Language Learning?
Vladimir Araujo1,2, Helena Balabin1, Julio Hurtado3, Alvaro Soto2, Marie-Francine Moens1
1KU Leuven, 2Pontificia Universidad Católica de Chile, 3University of Pisa
vgaraujo@uc.cl,helena.balabin@kuleuven.be,
julio.hurtado@di.unipi.it,asoto@ing.puc.cl,sien.moens@kuleuven.be
Abstract
Lifelong language learning seeks to have mod-
els continuously learn multiple tasks in a se-
quential order without suffering from catas-
trophic forgetting. State-of-the-art approaches
rely on sparse experience replay as the pri-
mary approach to prevent forgetting. Experi-
ence replay usually adopts sampling methods
for the memory population; however, the ef-
fect of the chosen sampling strategy on model
performance has not yet been studied. In this
paper, we investigate how relevant the selec-
tive memory population is in the lifelong learn-
ing process of text classification and question-
answering tasks. We found that methods that
randomly store a uniform number of samples
from the entire data stream lead to high per-
formances, especially for low memory size,
which is consistent with computer vision stud-
ies.
1 Introduction
While humans learn throughout their lifetime,
current deep learning models are restricted to a
bounded environment, where the input distribu-
tion is fixed. When those models are sequentially
learning new tasks, they suffer from catastrophic
forgetting (McCloskey and Cohen,1989;Ratcliff,
1990) because the input distribution changes.
Several methods have been proposed to address
catastrophic forgetting, mainly for computer vision
(CV) (Delange et al.,2021) and few others for nat-
ural language processing (NLP) (Biesialska et al.,
2020). In both, one of the prominent approaches
is experience replay with episodic memory (Hayes
et al.,2021), which aims to store previously seen
training examples and later use them to perform
gradient updates while training on new tasks.
In the experience replay approach, random sam-
pling is the de facto method for the memory popula-
tion, as it has shown good results in CV (Chaudhry
et al.,2019;Wu et al.,2019;Hayes et al.,2020).
In contrast, other works have shown that memory
selection is relevant for deep reinforcement learn-
ing (Isele and Cosgun,2018), image classification
(Chaudhry et al.,2018;Sun et al.,2022), and ana-
logical reasoning (Hayes and Kanan,2021). How-
ever, no previous work has explored NLP tasks,
which raises the question of whether memory se-
lection is necessary for lifelong language learning.
In this paper, we adopt and evaluate seven mem-
ory population methods under a lifelong language
learning setup with sparse experience replay. We
conducted experiments with text classification and
question answering tasks. We find that methods
that obtain memory with a random sample from the
global data distribution for text classification pro-
vide the best results in both high and low memory
regimes. Conversely, for the question answering
task, a method that provides a balanced memory
composition per task performs better.
2 Related Work
Lifelong Learning in NLP.
Rather than training
a language model on a fixed dataset, lifelong (con-
tinual) language learning setups consist of a stream
of tasks (e.g., text classification). In this setup, a
model aims to retain the most relevant informa-
tion to prevent catastrophic forgetting. Existing
approaches for NLP include purely replay-based
methods (d
'
Autume et al.,2019;Han et al.,2020;
Araujo et al.,2022), meta-learning based methods
(Wang et al.,2020;Holla et al.,2020) and genera-
tive replay-based methods (Sun et al.,2020a,b).
Memory Selection in Lifelong Learning.
Sev-
eral strategies have been proposed to store and se-
lect the most relevant training examples in memory.
Early work has shown that reservoir sampling pre-
vents catastrophic forgetting in lifelong reinforce-
ment learning (Isele and Cosgun,2018) and super-
vised learning (Chaudhry et al.,2019) with limited
memory. More recent works have explored criteria-
based selection methods, showing that maximum-
arXiv:2210.00940v1 [cs.CL] 3 Oct 2022