Efficient Meta-Learning for Continual Learning with Taylor Expansion Approximation 1stXiaohan Zou

2025-05-03 0 0 351.7KB 8 页 10玖币
侵权投诉
Efficient Meta-Learning for Continual Learning
with Taylor Expansion Approximation
1st Xiaohan Zou
Boston University
zxh@bu.edu
2nd Tong Lin
Key Lab. of Machine Perception (MoE), School of AI,
Center for Data Science, Peking University
lintong@pku.edu.cn
Abstract—Continual learning aims to alleviate catastrophic
forgetting when handling consecutive tasks under non-stationary
distributions. Gradient-based meta-learning algorithms have
shown the capability to implicitly solve the transfer-interference
trade-off problem between different examples. However, they still
suffer from the catastrophic forgetting problem in the setting
of continual learning, since the past data of previous tasks are
no longer available. In this work, we propose a novel efficient
meta-learning algorithm for solving the online continual learning
problem, where the regularization terms and learning rates are
adapted to the Taylor approximation of the parameter’s im-
portance to mitigate forgetting. The proposed method expresses
the gradient of the meta-loss in closed-form and thus avoid
computing second-order derivative which is computationally
inhibitable. We also use Proximal Gradient Descent to further
improve computational efficiency and accuracy. Experiments on
diverse benchmarks show that our method achieves better or
on-par performance and much higher efficiency compared to the
state-of-the-art approaches.
Index Terms—meta-learning, continual learning
I. INTRODUCTION
Catastrophic forgetting [1], [2] poses a major challenge to
artificial intelligence systems: when switching to a new task,
the system performance may degrade on the previously trained
tasks. Continual learning is proposed to address this challenge,
which requires models to be stable enough to prevent forget-
ting while being flexible to acquire new knowledge.
To alleviate catastrophic forgetting, several categories of
continual learning methods have been proposed to penalize
neural networks with regularization approaches by calculating
the importance of weights [3], [4], to modify the architecture
of neural networks [5], [6] and to introduce an episodic
memory to store and replay the previously learned samples
[7], [8].
The basic idea of rehearsal-based approaches like Gra-
dient Episodic Memory (GEM) [7] is to ensure gradient-
alignment across tasks such that the losses of the past tasks
in episodic memory will not increase. Interestingly, this ob-
jective coincides with the implicit objective of gradient-based
meta-learning algorithms [9]–[11]. Further, meta-learning al-
gorithms show promise to generalize better on future tasks
[10], [12]. Meta Experience Replay (MER) [10] integrates
This work was supported by NSFC Tianyuan Fund for Mathematics (No.
12026606), National Key R&D Program of China (No. 2018AAA0100300),
and Beijing Academy of Artificial Intelligence(BAAI).
Correspondence to Tong Lin (lintong@pku.edu.cn).
the first order meta-learning algorithm Reptile [9] with an
experience replay module to reduce interference between old
and new tasks. To alleviate the slow training speed of MER,
Lookahead-MAML (La-MAML) [11] proposes a more effi-
cient meta-objective for online continual learning. La-MAML
then incorporates learnable per-parameter learning rates to
further reduce the catastrophic forgetting and achieved state-
of-the-art performance.
However, though La-MAML proposes a more efficient
objective, it still requires directly computing the Hessian
matrix which is computationally inhibitable for large networks.
Also, learning a learning rate for each parameter entails more
computational overhead and increases memory usage.
To overcome these difficulties, in this paper, we present
a novel efficient gradient-based meta-learning algorithm for
online continual learning. The proposed method solves the
meta-optimization problem without accessing the Hessian in-
formation of the empirical risk. Inspired by regularization-
based methods, we compute the parameter importance using
the first-order Taylor series and assign the learning rates
according to the parameter importance. In this way, no extra
trainable parameters will be incorporated so that the computa-
tional complexity and memory usage can be reduced. We also
impose explicit regularization terms in the inner loss to achieve
better performance and apply proximal gradient descent to
improve efficiency. Our approach performs competitively on
four commonly used benchmark datasets, achieving better or
on-par performance against La-MAML and other state-of-the-
art approaches in a much shorter training time.
II. RELATED WORK
A. Continual Learning
Existing continual learning approaches are mostly broadly
classified into regularization-based,rehearsal-based, and dy-
namic network architecture-based approaches [13].
Regularization-based methods penalize major changes by
quantifying parameter importance on previous tasks while us-
ing a fixed capacity. Parameter importance could be estimated
by Fisher information matrix [3], loss [4] or outputs sensitivity
with respect to the parameters [14] and trainable attention
masks [15]. A number of studies restrain weight updates from
Bayesian perspectives [16]–[20]. Several recently proposed
methods also consider forcing weight updates to belong to
the null space of the feature covariance [21], [22].
arXiv:2210.00713v1 [cs.LG] 3 Oct 2022
Rehearsal-based methods maintain a small episodic mem-
ory of previously seen samples for replay [7], [8], [23] or train
a generative model to produce pseudo-data for past tasks [24]–
[26]. Generative models reduce working memory effectively
but invoke the complexity of the generative task. In contrast,
episodic memory methods are simpler and more effective.
Gradient Episodic Memory (GEM) [7] aligns gradients across
tasks to avoid interference with the previous tasks. Averaged-
GEM (A-GEM) [8] simplifies GEM by replacing all gradients
to one gradient of a sampled batch. Experience Replay (ER)
[27] considers the online setting and jointly trains the model on
the samples from new tasks and episodic memory. A number
of methods focus on improving the memory selection process,
like MIR [28] that selects most interfered samples for memory
rehearsal, HAL [29] that selects the anchor points of past tasks
and interleaves them with new tasks for future training, and
GMED [30] that edits stored examples via gradient updates to
create more “challenging” examples for replay.
Dynamic network architectures-based methods overcome
catastrophic forgetting by dynamically allocating task-specific
parameters to accommodate new tasks. In [5], [6], [31]–
[33], the model is expanded for each new task. Progressive
Neural Network (PNN) [5] leverages prior knowledge via
lateral connections to previously learned features. Dynamically
Expandable Network (DEN) [6] splits or duplicates impor-
tant neurons on new tasks when expanding the network to
reduce such redundancy, whereas [32] shares part of the base
network. Reinforced Continual Learning (RCL) [33] searches
for the best network architecture for arriving tasks using
reinforcement learning. To ensure that the model maintains the
compactness, [34] performs wights pruning after training on
each task, which highly increases the computational overhead.
Dirichlet process mixture models have also been applied
to expand a set of networks [35]. Instead of learning the
weights of the sub-networks, [36], [37] find binary masks to
assign different subsets of the weights for different tasks. By
design, these approaches often result in higher model and time
complexities.
B. Meta-Learning for Continual Learning
Recently, it has been shown that gradient-based meta-
learning algorithms integrated with episodic memory outper-
form many previous approaches in online settings [10]–[12].
Meta-Experience Replay (MER) [10] aligns gradients between
old and new tasks using samples from an experience replay
module. However, the training speed of MER is pretty slow
so it’s impractical to extend it to real-world scenarios. Online-
aware Meta-Learning (OML) [12] proposes a meta-objective
to learn a sparse representation offline. Lookahead-MAML
(La-MAML) [11] introduces a more efficient online objective
and incorporates trainable parameter-specific learning rates to
reduce the interference. Both La-MAML and MER require the
computation of second-order derivatives.
III. PRELIMINARIES
A. Continual Learning
Suppose that a sequence of Ttasks [τ1, τ2, . . . , τT]is
observed sequentially. Each task τtis associated with a dataset
{Xt, Y t}={(xt
m, yt
m)}nt
m=1 of ntexample pairs. At any
time-step jduring online learning, we would like to minimize
the loss on all the ttasks seen so far (τ1:t):
θj= arg min
θj
t
X
i=1
Eτi`i(θj)
= arg min
θj
Eτ1:tLt(θj)
(1)
where `iis the loss on τiusing θj, the learnt model
parameters at time-step j.Lt=Pt
i=1 `iis the sum of all task-
wise losses for tasks τ1:t. GEM [7] reformulates this problem
as:
min
˜g
1
2kg˜gk2
2, s.t. h˜g, gpi ≥ 0,p < t, (2)
where gand gpare the gradient vectors computed on the
current task and previous tasks krespectively. Such objective
can also be treated as maximizing the dot products between
gradients of a set of tasks [11]:
θj= arg min
θj t
X
i=1
`i(θj)αX
p,qt`p(θj)
θj·`q(θj)
θj!,
(3)
where αis a trade-off hyper-parameter.
B. Model-Agnostic Meta-Learning
Model Agnostic Meta-Learning (MAML) [38] is an
gradient-based meta-learning approach aiming to learn meta-
parameters that produce good task specific parameters after
adaptation. Meta-parameters are learned in the meta-update
(outer-loop), while task-specific models are learned in the
inner-update (inner-loop). In every meta-update, its objective
at time-step jcan be formulated as below:
min
θj
0
Eτ1:tLmeta
inner-loop
z }| {
Uk(θj
0)
| {z }
outer-loop
= min
θj
0
Eτ1:tLmetaθj
k,(4)
where θj
0is the meta-parameter at time-step jand Uk(θj
0) =
θj
krepresents an update function where θj
kis the parameter
after ksteps of stochastic gradient descent.
[9] has proved that MAML and its first-order variation
like Reptile approximately optimize for the same objective
that gradients are encouraged to align within-task and across-
task as well. [10] then showed the equivalence between the
objective of GEM (Eq (3)) and Reptile. This implies that
the procedure to meta-learn an initialization coincides with
learning optimal parameters for continual learning.
摘要:

EfcientMeta-LearningforContinualLearningwithTaylorExpansionApproximation1stXiaohanZouBostonUniversityzxh@bu.edu2ndTongLinKeyLab.ofMachinePerception(MoE),SchoolofAI,CenterforDataScience,PekingUniversitylintong@pku.edu.cnAbstract—Continuallearningaimstoalleviatecatastrophicforgettingwhenhandlingconse...

展开>> 收起<<
Efficient Meta-Learning for Continual Learning with Taylor Expansion Approximation 1stXiaohan Zou.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:351.7KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注