with a continuum of data at inference, the model automati-
cally identifies the task and quickly adapts to it with just a
single update. However at training the inner loop of their al-
gorithm, which generates task-specific models for each task
that are then combined in the outer loop to form a more
generic model, requires the knowledge of task label. At
inference, the task is predicted using generalized model pa-
rameters. Specifically, for each sample in the continuum,
the outcome of the general model is obtained and a maxi-
mum response per task is recorded. An average of the max-
imum responses per task is used as the task score. A task
with a maximum score is finally predicted. iTAML counter-
acts catastrophic forgetting by keeping a memory buffer of
samples from different tasks and using it to fine-tune gener-
alized parameters representing all tasks to a currently seen
one. This method is not task-agnostic, since it requires
task labels at training, though the authors categorize their
method as task-agnostic. CN-DPM [18] is an expansion-
based method that eliminates catastrophic forgetting by al-
locating new resources to learn new data. They formulate
the task-agnostic continual learning problem as an online
variational inference of Dirichlet process mixture models
consisting of a set of neural experts. Each expert is in charge
of a subset of the data. Each expert is associated with a dis-
criminative model (classifier) and a generative model (den-
sity estimator). For a new sample, they first decide whether
the sample should be assigned to an existing expert or a new
expert should be created for it. This is done by computing
the responsibility scores of the experts for the considered
sample and is supported by a short-term memory (STM)
collecting sufficient data. Specifically, when a data point is
classified as new, they store it to the STM. Once the STM
reaches its maximum capacity, they train a new expert with
the data in the STM. Another technique for task-agnosic
continual learning, known as HCL [15], models the distri-
bution of each task and each class with a normalizing flow
model. For task identification, they use the state-of-the-art
anomaly detection techniques based on measuring the typ-
icality of the model’s statistics. For avoiding catastrophic
forgetting they use a combination of generative replay and
a functional regularization technique.
In the context of unsupervised learning setting, VASE
method [1] addresses representation learning from piece-
wise stationary visual data based on a variational autoen-
coder with shared embeddings. The emphasis of this work
is put on learning shared representations across domains.
The method automatically detects shifts in the training data
distribution and uses this information to allocate spare latent
capacity to novel data set-specific disentangled representa-
tions, while reusing previously acquired representations of
latent dimensions where applicable. Authors represent data
sets using a set of data generative factors, where two data
sets may use the same generative factors but render them
differently, or they may use a different subset of factors
altogether. They next determine whether the average re-
construction error of the relevant generative factors for the
current data matches the previous data sets by a threshold
or not using Minimum Description Length principle. Al-
locating spare representational capacity to new knowledge
protects previously learnt representations from catastrophic
forgetting. Another technique called CURL [32] learns a
task-specific representation on top of a larger set of shared
parameters while dynamically expanding model capacity
to capture new tasks. The method represents tasks using
a mixture of Gaussians and expands the model as needed,
by maintaining a small set of poorly-modelled samples and
then initialising and fitting a new mixture component to this
set when it reaches a critical size. The method also relies
on replay generative models to alleviate catastrophic for-
getting.
Non Task-Agnostic Continual Learning First family of
non task-agnostic continual learning techniques consists of
complementary learning systems and memory replay meth-
ods. They rely on replaying selected samples from the prior
tasks. These samples are incorporated into the current learn-
ing process so that at each step the model is trained on a
mixture of samples from a new task as well as a small sub-
set of samples from the previously seen tasks. Some tech-
niques focus on efficiently selecting and storing prior expe-
riences through different selection strategies [4,13]. Other
approaches, e.g. GEM [21], A-GEM [6], and MER [33] fo-
cus on favoring positive backward transfer to previous tasks.
Finally, there are deep generative replay approaches [34,36]
that substitute the replay memory buffer with a generative
model to learn data distribution from previous tasks and
generate samples accordingly when learning a new task.
Another family of techniques, known as regularization-
based methods, enforce a constraint on the parameter up-
date of the neural network, usually by adding a regular-
ization term to the objective function. This term penal-
izes the change in the model parameters when the new task
is observed and assures they stay close to the parameters
learned on the previous tasks. Among these techniques,
we identify a few famous algorithms such as EWC [16],
SI [43], MAS [3], and RWALK [5] that introduce different
notions of the importance of synapses or parameters and pe-
nalizes changes to high importance parameters, as well as
the LwF [20] method that can be seen as a combination of
knowledge distillation and fine-tuning. Finally, the last fam-
ily of techniques are the dynamic architecture methods that
expand the architecture of the network by allocating addi-
tional resources, i.e., neurons or layers, to new tasks which
is usually accompanied by additional parameter pruning
and masking. This family consists of such techniques as
expert-gate method [2], progressive networks [35], dynam-