How Relevant is Selective Memory Population in Lifelong Language Learning Vladimir Araujo12 Helena Balabin1 Julio Hurtado3 Alvaro Soto2 Marie-Francine Moens1

2025-04-29 0 0 285.87KB 7 页 10玖币

侵权投诉

How Relevant is Selective Memory Population in

Lifelong Language Learning?

Vladimir Araujo1,2, Helena Balabin1, Julio Hurtado3, Alvaro Soto2, Marie-Francine Moens1

1KU Leuven, 2Pontiﬁcia Universidad Católica de Chile, 3University of Pisa

vgaraujo@uc.cl,helena.balabin@kuleuven.be,

julio.hurtado@di.unipi.it,asoto@ing.puc.cl,sien.moens@kuleuven.be

Abstract

Lifelong language learning seeks to have mod-

els continuously learn multiple tasks in a se-

quential order without suffering from catas-

trophic forgetting. State-of-the-art approaches

rely on sparse experience replay as the pri-

mary approach to prevent forgetting. Experi-

ence replay usually adopts sampling methods

for the memory population; however, the ef-

fect of the chosen sampling strategy on model

performance has not yet been studied. In this

paper, we investigate how relevant the selec-

tive memory population is in the lifelong learn-

ing process of text classiﬁcation and question-

answering tasks. We found that methods that

randomly store a uniform number of samples

from the entire data stream lead to high per-

formances, especially for low memory size,

which is consistent with computer vision stud-

ies.

1 Introduction

While humans learn throughout their lifetime,

current deep learning models are restricted to a

bounded environment, where the input distribu-

tion is ﬁxed. When those models are sequentially

learning new tasks, they suffer from catastrophic

forgetting (McCloskey and Cohen,1989;Ratcliff,

1990) because the input distribution changes.

Several methods have been proposed to address

catastrophic forgetting, mainly for computer vision

(CV) (Delange et al.,2021) and few others for nat-

ural language processing (NLP) (Biesialska et al.,

2020). In both, one of the prominent approaches

is experience replay with episodic memory (Hayes

et al.,2021), which aims to store previously seen

training examples and later use them to perform

gradient updates while training on new tasks.

In the experience replay approach, random sam-

pling is the de facto method for the memory popula-

tion, as it has shown good results in CV (Chaudhry

et al.,2019;Wu et al.,2019;Hayes et al.,2020).

In contrast, other works have shown that memory

selection is relevant for deep reinforcement learn-

ing (Isele and Cosgun,2018), image classiﬁcation

(Chaudhry et al.,2018;Sun et al.,2022), and ana-

logical reasoning (Hayes and Kanan,2021). How-

ever, no previous work has explored NLP tasks,

which raises the question of whether memory se-

lection is necessary for lifelong language learning.

In this paper, we adopt and evaluate seven mem-

ory population methods under a lifelong language

learning setup with sparse experience replay. We

conducted experiments with text classiﬁcation and

question answering tasks. We ﬁnd that methods

that obtain memory with a random sample from the

global data distribution for text classiﬁcation pro-

vide the best results in both high and low memory

regimes. Conversely, for the question answering

task, a method that provides a balanced memory

composition per task performs better.

2 Related Work

Lifelong Learning in NLP.

Rather than training

a language model on a ﬁxed dataset, lifelong (con-

tinual) language learning setups consist of a stream

of tasks (e.g., text classiﬁcation). In this setup, a

model aims to retain the most relevant informa-

tion to prevent catastrophic forgetting. Existing

approaches for NLP include purely replay-based

methods (d

Autume et al.,2019;Han et al.,2020;

Araujo et al.,2022), meta-learning based methods

(Wang et al.,2020;Holla et al.,2020) and genera-

tive replay-based methods (Sun et al.,2020a,b).

Memory Selection in Lifelong Learning.

Sev-

eral strategies have been proposed to store and se-

lect the most relevant training examples in memory.

Early work has shown that reservoir sampling pre-

vents catastrophic forgetting in lifelong reinforce-

ment learning (Isele and Cosgun,2018) and super-

vised learning (Chaudhry et al.,2019) with limited

memory. More recent works have explored criteria-

based selection methods, showing that maximum-

arXiv:2210.00940v1 [cs.CL] 3 Oct 2022

loss examples are helpful for analogical reason-

ing (Hayes and Kanan,2021) and gradient-based

(Aljundi et al.,2019) or information-theoretic (Sun

et al.,2022) selection for image classiﬁcation.

3 Lifelong Language Learning Setup

We consider the lifelong language learning setting

proposed by d

Autume et al. (2019), in which a

model learns multiple tasks in sequential order

from a stream of training examples

. In this setup,

each example is only allowed to be viewed once.

This setup adopts sparse experience replay,

which performs a gradient update at a certain in-

terval during training. We leverage this method, as

Autume et al. (2019) have shown that a sparse

1% rate of replaying to learning new examples is

sufﬁcient for lifelong language learning.

This setting also includes local adaptation

(Sprechmann et al.,2018), which is a process that

retrieves K-nearest neighbors examples from mem-

ory to update model parameters used to predict a

particular test example. However, recent works

have tried to reduce its use (Wang et al.,2020) or

even avoid it (Holla et al.,2020) because it signiﬁ-

cantly slows down the inference speed. We do not

use this mechanism in our main experimentation

because our goal is to analyze the effect of selective

memory on the generalization of the model. Nev-

ertheless, Section 6brieﬂy shows how resulting

memory composition inﬂuences local adaptation.

4 Selective Episodic Memory

For the previously described lifelong learning setup,

we extend a replay model (see Section 5) with the

following seven memory population methods:

Naive Random.

A basic method for memory

population. It samples a percentage of elements of

each task. In our experiments, the percentage value

is the same as the memory capacity, and we sample

the elements on the ﬂy from the current batch.

Reservoir.

A reservoir (Vitter,1985) allows sam-

pling elements from a stream without knowing how

many elements to expect. It samples each element

with a probability

where

is the number of el-

ements observed so far and

is the memory size.

This way, it acts randomly to maintain a uniform

sample from the already seen stream.

We use an available implementation of this setup:

https://github.com/vgaraujov/LLL-NLP

Ring Buffer.

Similar to Lopez-Paz and Ranzato

(2017), this method allocates

elements for each

class

of the task in memory. The strategy is a

FIFO buffer, so the memory is always ﬁlled with

the latest task observations. If the total number of

classes is unknown, the value of

is gradually

reduced as new tasks are observed.

Surprise.

Unexpected events have been shown

to inﬂuence episodic memory in humans (Cheng

and Frank,2008). One way to measure surprise is

by computing the entropy of the output distribution

of an input batch. Analogous to Isele and Cosgun

(2018), we use the time difference between the

current entropy value and that of the previous batch

to sample high-surprise elements.

Minimum Margin.

Similar to Hayes and Kanan

(2021), who introduced a margin-based method for

CV replay models, we deﬁne the margin as the

difference between the probability of the true class

and the probability of the other most likely class.

We store the most uncertain examples, that is, those

with the smallest margin for which the probability

of the true class is only marginally different from

the probability of the other most likely class.

Maximum Loss.

Analogous to the previous strat-

egy, the maximum loss strategy aims to store sam-

ples with high uncertainty. However, this time it

is based on storing samples with a high loss value

(Hayes and Kanan,2021). Here, we slightly mod-

ify the strategy by evaluating the loss for an en-

tire batch, therefore storing and overriding whole

batches in memory.

Mean of Features (MoF).

Similar to Rebufﬁ

et al. (2017); Chaudhry et al. (2019), we calculate

the average feature vector based on averaging the

ﬁnal

[CLS]

representations in memory for a given

class. If the representation of an input example has

a smaller distance to its average feature vector than

the entry in the memory with the largest distance

to the average, we store the new incoming example

and update the respective average feature vector.

5 Experimental Setup

Datasets.

We adopt the evaluation methodology

and datasets proposed by (d'Autume et al.,2019).

For text classiﬁcation, we use ﬁve datasets from

(Zhang et al.,2015): AGNews classiﬁcation, Yelp

sentiment analysis, Amazon sentiment analysis,

DBPedia article classiﬁcation and Yahoo questions

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HowRelevantisSelectiveMemoryPopulationinLifelongLanguageLearning?VladimirAraujo1,2,HelenaBalabin1,JulioHurtado3,AlvaroSoto2,Marie-FrancineMoens11KULeuven,2PonticiaUniversidadCatólicadeChile,3UniversityofPisavgaraujo@uc.cl,helena.balabin@kuleuven.be,julio.hurtado@di.unipi.it,asoto@ing.puc.cl,sien.mo...

展开>> 收起<<

How Relevant is Selective Memory Population in Lifelong Language Learning Vladimir Araujo12 Helena Balabin1 Julio Hurtado3 Alvaro Soto2 Marie-Francine Moens1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

How Relevant is Selective Memory Population in Lifelong Language Learning Vladimir Araujo12 Helena Balabin1 Julio Hurtado3 Alvaro Soto2 Marie-Francine Moens1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: