How Relevant is Selective Memory Population in Lifelong Language Learning Vladimir Araujo12 Helena Balabin1 Julio Hurtado3 Alvaro Soto2 Marie-Francine Moens1

2025-04-29 0 0 285.87KB 7 页 10玖币
侵权投诉
How Relevant is Selective Memory Population in
Lifelong Language Learning?
Vladimir Araujo1,2, Helena Balabin1, Julio Hurtado3, Alvaro Soto2, Marie-Francine Moens1
1KU Leuven, 2Pontificia Universidad Católica de Chile, 3University of Pisa
vgaraujo@uc.cl,helena.balabin@kuleuven.be,
julio.hurtado@di.unipi.it,asoto@ing.puc.cl,sien.moens@kuleuven.be
Abstract
Lifelong language learning seeks to have mod-
els continuously learn multiple tasks in a se-
quential order without suffering from catas-
trophic forgetting. State-of-the-art approaches
rely on sparse experience replay as the pri-
mary approach to prevent forgetting. Experi-
ence replay usually adopts sampling methods
for the memory population; however, the ef-
fect of the chosen sampling strategy on model
performance has not yet been studied. In this
paper, we investigate how relevant the selec-
tive memory population is in the lifelong learn-
ing process of text classification and question-
answering tasks. We found that methods that
randomly store a uniform number of samples
from the entire data stream lead to high per-
formances, especially for low memory size,
which is consistent with computer vision stud-
ies.
1 Introduction
While humans learn throughout their lifetime,
current deep learning models are restricted to a
bounded environment, where the input distribu-
tion is fixed. When those models are sequentially
learning new tasks, they suffer from catastrophic
forgetting (McCloskey and Cohen,1989;Ratcliff,
1990) because the input distribution changes.
Several methods have been proposed to address
catastrophic forgetting, mainly for computer vision
(CV) (Delange et al.,2021) and few others for nat-
ural language processing (NLP) (Biesialska et al.,
2020). In both, one of the prominent approaches
is experience replay with episodic memory (Hayes
et al.,2021), which aims to store previously seen
training examples and later use them to perform
gradient updates while training on new tasks.
In the experience replay approach, random sam-
pling is the de facto method for the memory popula-
tion, as it has shown good results in CV (Chaudhry
et al.,2019;Wu et al.,2019;Hayes et al.,2020).
In contrast, other works have shown that memory
selection is relevant for deep reinforcement learn-
ing (Isele and Cosgun,2018), image classification
(Chaudhry et al.,2018;Sun et al.,2022), and ana-
logical reasoning (Hayes and Kanan,2021). How-
ever, no previous work has explored NLP tasks,
which raises the question of whether memory se-
lection is necessary for lifelong language learning.
In this paper, we adopt and evaluate seven mem-
ory population methods under a lifelong language
learning setup with sparse experience replay. We
conducted experiments with text classification and
question answering tasks. We find that methods
that obtain memory with a random sample from the
global data distribution for text classification pro-
vide the best results in both high and low memory
regimes. Conversely, for the question answering
task, a method that provides a balanced memory
composition per task performs better.
2 Related Work
Lifelong Learning in NLP.
Rather than training
a language model on a fixed dataset, lifelong (con-
tinual) language learning setups consist of a stream
of tasks (e.g., text classification). In this setup, a
model aims to retain the most relevant informa-
tion to prevent catastrophic forgetting. Existing
approaches for NLP include purely replay-based
methods (d
'
Autume et al.,2019;Han et al.,2020;
Araujo et al.,2022), meta-learning based methods
(Wang et al.,2020;Holla et al.,2020) and genera-
tive replay-based methods (Sun et al.,2020a,b).
Memory Selection in Lifelong Learning.
Sev-
eral strategies have been proposed to store and se-
lect the most relevant training examples in memory.
Early work has shown that reservoir sampling pre-
vents catastrophic forgetting in lifelong reinforce-
ment learning (Isele and Cosgun,2018) and super-
vised learning (Chaudhry et al.,2019) with limited
memory. More recent works have explored criteria-
based selection methods, showing that maximum-
arXiv:2210.00940v1 [cs.CL] 3 Oct 2022
loss examples are helpful for analogical reason-
ing (Hayes and Kanan,2021) and gradient-based
(Aljundi et al.,2019) or information-theoretic (Sun
et al.,2022) selection for image classification.
3 Lifelong Language Learning Setup
We consider the lifelong language learning setting
proposed by d
'
Autume et al. (2019), in which a
model learns multiple tasks in sequential order
from a stream of training examples
1
. In this setup,
each example is only allowed to be viewed once.
This setup adopts sparse experience replay,
which performs a gradient update at a certain in-
terval during training. We leverage this method, as
d
'
Autume et al. (2019) have shown that a sparse
1% rate of replaying to learning new examples is
sufficient for lifelong language learning.
This setting also includes local adaptation
(Sprechmann et al.,2018), which is a process that
retrieves K-nearest neighbors examples from mem-
ory to update model parameters used to predict a
particular test example. However, recent works
have tried to reduce its use (Wang et al.,2020) or
even avoid it (Holla et al.,2020) because it signifi-
cantly slows down the inference speed. We do not
use this mechanism in our main experimentation
because our goal is to analyze the effect of selective
memory on the generalization of the model. Nev-
ertheless, Section 6briefly shows how resulting
memory composition influences local adaptation.
4 Selective Episodic Memory
For the previously described lifelong learning setup,
we extend a replay model (see Section 5) with the
following seven memory population methods:
Naive Random.
A basic method for memory
population. It samples a percentage of elements of
each task. In our experiments, the percentage value
is the same as the memory capacity, and we sample
the elements on the fly from the current batch.
Reservoir.
A reservoir (Vitter,1985) allows sam-
pling elements from a stream without knowing how
many elements to expect. It samples each element
with a probability
M
N
where
N
is the number of el-
ements observed so far and
M
is the memory size.
This way, it acts randomly to maintain a uniform
sample from the already seen stream.
1
We use an available implementation of this setup:
https://github.com/vgaraujov/LLL-NLP
Ring Buffer.
Similar to Lopez-Paz and Ranzato
(2017), this method allocates
M
C
elements for each
class
C
of the task in memory. The strategy is a
FIFO buffer, so the memory is always filled with
the latest task observations. If the total number of
classes is unknown, the value of
M
is gradually
reduced as new tasks are observed.
Surprise.
Unexpected events have been shown
to influence episodic memory in humans (Cheng
and Frank,2008). One way to measure surprise is
by computing the entropy of the output distribution
of an input batch. Analogous to Isele and Cosgun
(2018), we use the time difference between the
current entropy value and that of the previous batch
to sample high-surprise elements.
Minimum Margin.
Similar to Hayes and Kanan
(2021), who introduced a margin-based method for
CV replay models, we define the margin as the
difference between the probability of the true class
and the probability of the other most likely class.
We store the most uncertain examples, that is, those
with the smallest margin for which the probability
of the true class is only marginally different from
the probability of the other most likely class.
Maximum Loss.
Analogous to the previous strat-
egy, the maximum loss strategy aims to store sam-
ples with high uncertainty. However, this time it
is based on storing samples with a high loss value
(Hayes and Kanan,2021). Here, we slightly mod-
ify the strategy by evaluating the loss for an en-
tire batch, therefore storing and overriding whole
batches in memory.
Mean of Features (MoF).
Similar to Rebuf
et al. (2017); Chaudhry et al. (2019), we calculate
the average feature vector based on averaging the
final
[CLS]
representations in memory for a given
class. If the representation of an input example has
a smaller distance to its average feature vector than
the entry in the memory with the largest distance
to the average, we store the new incoming example
and update the respective average feature vector.
5 Experimental Setup
Datasets.
We adopt the evaluation methodology
and datasets proposed by (d'Autume et al.,2019).
For text classification, we use five datasets from
(Zhang et al.,2015): AGNews classification, Yelp
sentiment analysis, Amazon sentiment analysis,
DBPedia article classification and Yahoo questions
摘要:

HowRelevantisSelectiveMemoryPopulationinLifelongLanguageLearning?VladimirAraujo1,2,HelenaBalabin1,JulioHurtado3,AlvaroSoto2,Marie-FrancineMoens11KULeuven,2PonticiaUniversidadCatólicadeChile,3UniversityofPisavgaraujo@uc.cl,helena.balabin@kuleuven.be,julio.hurtado@di.unipi.it,asoto@ing.puc.cl,sien.mo...

展开>> 收起<<
How Relevant is Selective Memory Population in Lifelong Language Learning Vladimir Araujo12 Helena Balabin1 Julio Hurtado3 Alvaro Soto2 Marie-Francine Moens1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:285.87KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注