Efficient Meta-Learning for Continual Learning
with Taylor Expansion Approximation
1st Xiaohan Zou
Boston University
zxh@bu.edu
2nd Tong Lin
Key Lab. of Machine Perception (MoE), School of AI,
Center for Data Science, Peking University
lintong@pku.edu.cn
Abstract—Continual learning aims to alleviate catastrophic
forgetting when handling consecutive tasks under non-stationary
distributions. Gradient-based meta-learning algorithms have
shown the capability to implicitly solve the transfer-interference
trade-off problem between different examples. However, they still
suffer from the catastrophic forgetting problem in the setting
of continual learning, since the past data of previous tasks are
no longer available. In this work, we propose a novel efficient
meta-learning algorithm for solving the online continual learning
problem, where the regularization terms and learning rates are
adapted to the Taylor approximation of the parameter’s im-
portance to mitigate forgetting. The proposed method expresses
the gradient of the meta-loss in closed-form and thus avoid
computing second-order derivative which is computationally
inhibitable. We also use Proximal Gradient Descent to further
improve computational efficiency and accuracy. Experiments on
diverse benchmarks show that our method achieves better or
on-par performance and much higher efficiency compared to the
state-of-the-art approaches.
Index Terms—meta-learning, continual learning
I. INTRODUCTION
Catastrophic forgetting [1], [2] poses a major challenge to
artificial intelligence systems: when switching to a new task,
the system performance may degrade on the previously trained
tasks. Continual learning is proposed to address this challenge,
which requires models to be stable enough to prevent forget-
ting while being flexible to acquire new knowledge.
To alleviate catastrophic forgetting, several categories of
continual learning methods have been proposed to penalize
neural networks with regularization approaches by calculating
the importance of weights [3], [4], to modify the architecture
of neural networks [5], [6] and to introduce an episodic
memory to store and replay the previously learned samples
[7], [8].
The basic idea of rehearsal-based approaches like Gra-
dient Episodic Memory (GEM) [7] is to ensure gradient-
alignment across tasks such that the losses of the past tasks
in episodic memory will not increase. Interestingly, this ob-
jective coincides with the implicit objective of gradient-based
meta-learning algorithms [9]–[11]. Further, meta-learning al-
gorithms show promise to generalize better on future tasks
[10], [12]. Meta Experience Replay (MER) [10] integrates
This work was supported by NSFC Tianyuan Fund for Mathematics (No.
12026606), National Key R&D Program of China (No. 2018AAA0100300),
and Beijing Academy of Artificial Intelligence(BAAI).
Correspondence to Tong Lin (lintong@pku.edu.cn).
the first order meta-learning algorithm Reptile [9] with an
experience replay module to reduce interference between old
and new tasks. To alleviate the slow training speed of MER,
Lookahead-MAML (La-MAML) [11] proposes a more effi-
cient meta-objective for online continual learning. La-MAML
then incorporates learnable per-parameter learning rates to
further reduce the catastrophic forgetting and achieved state-
of-the-art performance.
However, though La-MAML proposes a more efficient
objective, it still requires directly computing the Hessian
matrix which is computationally inhibitable for large networks.
Also, learning a learning rate for each parameter entails more
computational overhead and increases memory usage.
To overcome these difficulties, in this paper, we present
a novel efficient gradient-based meta-learning algorithm for
online continual learning. The proposed method solves the
meta-optimization problem without accessing the Hessian in-
formation of the empirical risk. Inspired by regularization-
based methods, we compute the parameter importance using
the first-order Taylor series and assign the learning rates
according to the parameter importance. In this way, no extra
trainable parameters will be incorporated so that the computa-
tional complexity and memory usage can be reduced. We also
impose explicit regularization terms in the inner loss to achieve
better performance and apply proximal gradient descent to
improve efficiency. Our approach performs competitively on
four commonly used benchmark datasets, achieving better or
on-par performance against La-MAML and other state-of-the-
art approaches in a much shorter training time.
II. RELATED WORK
A. Continual Learning
Existing continual learning approaches are mostly broadly
classified into regularization-based,rehearsal-based, and dy-
namic network architecture-based approaches [13].
Regularization-based methods penalize major changes by
quantifying parameter importance on previous tasks while us-
ing a fixed capacity. Parameter importance could be estimated
by Fisher information matrix [3], loss [4] or outputs sensitivity
with respect to the parameters [14] and trainable attention
masks [15]. A number of studies restrain weight updates from
Bayesian perspectives [16]–[20]. Several recently proposed
methods also consider forcing weight updates to belong to
the null space of the feature covariance [21], [22].
arXiv:2210.00713v1 [cs.LG] 3 Oct 2022