Efﬁcient Meta-Learning for Continual Learning with Taylor Expansion Approximation 1stXiaohan Zou

2025-05-03 0 0 351.7KB 8 页 10玖币

侵权投诉

Efﬁcient Meta-Learning for Continual Learning

with Taylor Expansion Approximation

1st Xiaohan Zou

Boston University

zxh@bu.edu

2nd Tong Lin

Key Lab. of Machine Perception (MoE), School of AI,

Center for Data Science, Peking University

lintong@pku.edu.cn

Abstract—Continual learning aims to alleviate catastrophic

forgetting when handling consecutive tasks under non-stationary

distributions. Gradient-based meta-learning algorithms have

shown the capability to implicitly solve the transfer-interference

trade-off problem between different examples. However, they still

suffer from the catastrophic forgetting problem in the setting

of continual learning, since the past data of previous tasks are

no longer available. In this work, we propose a novel efﬁcient

meta-learning algorithm for solving the online continual learning

problem, where the regularization terms and learning rates are

adapted to the Taylor approximation of the parameter’s im-

portance to mitigate forgetting. The proposed method expresses

the gradient of the meta-loss in closed-form and thus avoid

computing second-order derivative which is computationally

inhibitable. We also use Proximal Gradient Descent to further

improve computational efﬁciency and accuracy. Experiments on

diverse benchmarks show that our method achieves better or

on-par performance and much higher efﬁciency compared to the

state-of-the-art approaches.

Index Terms—meta-learning, continual learning

I. INTRODUCTION

Catastrophic forgetting [1], [2] poses a major challenge to

artiﬁcial intelligence systems: when switching to a new task,

the system performance may degrade on the previously trained

tasks. Continual learning is proposed to address this challenge,

which requires models to be stable enough to prevent forget-

ting while being ﬂexible to acquire new knowledge.

To alleviate catastrophic forgetting, several categories of

continual learning methods have been proposed to penalize

neural networks with regularization approaches by calculating

the importance of weights [3], [4], to modify the architecture

of neural networks [5], [6] and to introduce an episodic

memory to store and replay the previously learned samples

[7], [8].

The basic idea of rehearsal-based approaches like Gra-

dient Episodic Memory (GEM) [7] is to ensure gradient-

alignment across tasks such that the losses of the past tasks

in episodic memory will not increase. Interestingly, this ob-

jective coincides with the implicit objective of gradient-based

meta-learning algorithms [9]–[11]. Further, meta-learning al-

gorithms show promise to generalize better on future tasks

[10], [12]. Meta Experience Replay (MER) [10] integrates

This work was supported by NSFC Tianyuan Fund for Mathematics (No.

12026606), National Key R&D Program of China (No. 2018AAA0100300),

and Beijing Academy of Artiﬁcial Intelligence(BAAI).

Correspondence to Tong Lin (lintong@pku.edu.cn).

the ﬁrst order meta-learning algorithm Reptile [9] with an

experience replay module to reduce interference between old

and new tasks. To alleviate the slow training speed of MER,

Lookahead-MAML (La-MAML) [11] proposes a more efﬁ-

cient meta-objective for online continual learning. La-MAML

then incorporates learnable per-parameter learning rates to

further reduce the catastrophic forgetting and achieved state-

of-the-art performance.

However, though La-MAML proposes a more efﬁcient

objective, it still requires directly computing the Hessian

matrix which is computationally inhibitable for large networks.

Also, learning a learning rate for each parameter entails more

computational overhead and increases memory usage.

To overcome these difﬁculties, in this paper, we present

a novel efﬁcient gradient-based meta-learning algorithm for

online continual learning. The proposed method solves the

meta-optimization problem without accessing the Hessian in-

formation of the empirical risk. Inspired by regularization-

based methods, we compute the parameter importance using

the ﬁrst-order Taylor series and assign the learning rates

according to the parameter importance. In this way, no extra

trainable parameters will be incorporated so that the computa-

tional complexity and memory usage can be reduced. We also

impose explicit regularization terms in the inner loss to achieve

better performance and apply proximal gradient descent to

improve efﬁciency. Our approach performs competitively on

four commonly used benchmark datasets, achieving better or

on-par performance against La-MAML and other state-of-the-

art approaches in a much shorter training time.

II. RELATED WORK

A. Continual Learning

Existing continual learning approaches are mostly broadly

classiﬁed into regularization-based,rehearsal-based, and dy-

namic network architecture-based approaches [13].

Regularization-based methods penalize major changes by

quantifying parameter importance on previous tasks while us-

ing a ﬁxed capacity. Parameter importance could be estimated

by Fisher information matrix [3], loss [4] or outputs sensitivity

with respect to the parameters [14] and trainable attention

masks [15]. A number of studies restrain weight updates from

Bayesian perspectives [16]–[20]. Several recently proposed

methods also consider forcing weight updates to belong to

the null space of the feature covariance [21], [22].

arXiv:2210.00713v1 [cs.LG] 3 Oct 2022

Rehearsal-based methods maintain a small episodic mem-

ory of previously seen samples for replay [7], [8], [23] or train

a generative model to produce pseudo-data for past tasks [24]–

[26]. Generative models reduce working memory effectively

but invoke the complexity of the generative task. In contrast,

episodic memory methods are simpler and more effective.

Gradient Episodic Memory (GEM) [7] aligns gradients across

tasks to avoid interference with the previous tasks. Averaged-

GEM (A-GEM) [8] simpliﬁes GEM by replacing all gradients

to one gradient of a sampled batch. Experience Replay (ER)

[27] considers the online setting and jointly trains the model on

the samples from new tasks and episodic memory. A number

of methods focus on improving the memory selection process,

like MIR [28] that selects most interfered samples for memory

rehearsal, HAL [29] that selects the anchor points of past tasks

and interleaves them with new tasks for future training, and

GMED [30] that edits stored examples via gradient updates to

create more “challenging” examples for replay.

Dynamic network architectures-based methods overcome

catastrophic forgetting by dynamically allocating task-speciﬁc

parameters to accommodate new tasks. In [5], [6], [31]–

[33], the model is expanded for each new task. Progressive

Neural Network (PNN) [5] leverages prior knowledge via

lateral connections to previously learned features. Dynamically

Expandable Network (DEN) [6] splits or duplicates impor-

tant neurons on new tasks when expanding the network to

reduce such redundancy, whereas [32] shares part of the base

network. Reinforced Continual Learning (RCL) [33] searches

for the best network architecture for arriving tasks using

reinforcement learning. To ensure that the model maintains the

compactness, [34] performs wights pruning after training on

each task, which highly increases the computational overhead.

Dirichlet process mixture models have also been applied

to expand a set of networks [35]. Instead of learning the

weights of the sub-networks, [36], [37] ﬁnd binary masks to

assign different subsets of the weights for different tasks. By

design, these approaches often result in higher model and time

complexities.

B. Meta-Learning for Continual Learning

Recently, it has been shown that gradient-based meta-

learning algorithms integrated with episodic memory outper-

form many previous approaches in online settings [10]–[12].

Meta-Experience Replay (MER) [10] aligns gradients between

old and new tasks using samples from an experience replay

module. However, the training speed of MER is pretty slow

so it’s impractical to extend it to real-world scenarios. Online-

aware Meta-Learning (OML) [12] proposes a meta-objective

to learn a sparse representation ofﬂine. Lookahead-MAML

(La-MAML) [11] introduces a more efﬁcient online objective

and incorporates trainable parameter-speciﬁc learning rates to

reduce the interference. Both La-MAML and MER require the

computation of second-order derivatives.

III. PRELIMINARIES

A. Continual Learning

Suppose that a sequence of Ttasks [τ1, τ2, . . . , τT]is

observed sequentially. Each task τtis associated with a dataset

{Xt, Y t}={(xt

m, yt

m)}nt

m=1 of ntexample pairs. At any

time-step jduring online learning, we would like to minimize

the loss on all the ttasks seen so far (τ1:t):

θj= arg min

θj

i=1

Eτi`i(θj)

= arg min

θj

Eτ1:tLt(θj)

(1)

where `iis the loss on τiusing θj, the learnt model

parameters at time-step j.Lt=Pt

i=1 `iis the sum of all task-

wise losses for tasks τ1:t. GEM [7] reformulates this problem

as:

min

˜g

2kg−˜gk2

2, s.t. h˜g, gpi ≥ 0,∀p < t, (2)

where gand gpare the gradient vectors computed on the

current task and previous tasks krespectively. Such objective

can also be treated as maximizing the dot products between

gradients of a set of tasks [11]:

θj= arg min

θj t

i=1

`i(θj)−αX

p,q≤t∂`p(θj)

∂θj·∂`q(θj)

∂θj!,

(3)

where αis a trade-off hyper-parameter.

B. Model-Agnostic Meta-Learning

Model Agnostic Meta-Learning (MAML) [38] is an

gradient-based meta-learning approach aiming to learn meta-

parameters that produce good task speciﬁc parameters after

adaptation. Meta-parameters are learned in the meta-update

(outer-loop), while task-speciﬁc models are learned in the

inner-update (inner-loop). In every meta-update, its objective

at time-step jcan be formulated as below:

min

θj

Eτ1:tLmeta

inner-loop

z }| {

Uk(θj

0)

| {z }

outer-loop

= min

θj

Eτ1:tLmetaθj

k,(4)

where θj

0is the meta-parameter at time-step jand Uk(θj

0) =

θj

krepresents an update function where θj

kis the parameter

after ksteps of stochastic gradient descent.

[9] has proved that MAML and its ﬁrst-order variation

like Reptile approximately optimize for the same objective

that gradients are encouraged to align within-task and across-

task as well. [10] then showed the equivalence between the

objective of GEM (Eq (3)) and Reptile. This implies that

the procedure to meta-learn an initialization coincides with

learning optimal parameters for continual learning.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientMeta-LearningforContinualLearningwithTaylorExpansionApproximation1stXiaohanZouBostonUniversityzxh@bu.edu2ndTongLinKeyLab.ofMachinePerception(MoE),SchoolofAI,CenterforDataScience,PekingUniversitylintong@pku.edu.cnAbstractContinuallearningaimstoalleviatecatastrophicforgettingwhenhandlingconse...

展开>> 收起<<

Efﬁcient Meta-Learning for Continual Learning with Taylor Expansion Approximation 1stXiaohan Zou.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Meta-Learning for Continual Learning with Taylor Expansion Approximation 1stXiaohan Zou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: