TAME Task Agnostic Continual Learning using Multiple Experts Haoran Zhu1Maryam Majzoubi2Arihant Jain1Anna Choromanska1 1New York University2Google

2025-05-02 0 0 2.81MB 12 页 10玖币
侵权投诉
TAME: Task Agnostic Continual Learning using Multiple Experts
Haoran Zhu1Maryam Majzoubi 2Arihant Jain 1Anna Choromanska 1
1New York University 2Google
{hz1922, aj2622, ac5455}@nyu.edu maryam.majzoubi@gmail.com
Abstract
The goal of lifelong learning is to continuously
learn from non-stationary distributions, where the non-
stationarity is typically imposed by a sequence of distinct
tasks. Prior works have mostly considered idealistic set-
tings, where the identity of tasks is known at least at train-
ing. In this paper we focus on a fundamentally harder, so-
called task-agnostic, setting where the task identities are
not known and the learning machine needs to infer them
from the observations. Our algorithm, which we call TAME
(Task-Agnostic continual learning using Multiple Experts),
automatically detects the shift in data distributions and
switches between task expert networks in an online man-
ner. At training, the strategy for switching between tasks
hinges on an extremely simple observation that for each
new coming task there occurs a statistically-significant de-
viation in the value of the loss function that marks the onset
of this new task. At inference, the switching between ex-
perts is governed by the selector network that forwards the
test sample to its relevant expert network. The selector net-
work is trained on a small subset of data drawn uniformly
at random. We control the growth of the task expert net-
works as well as selector network by employing pruning.
Our experimental results show the efficacy of our approach
on benchmark continual learning data sets, outperforming
the previous task-agnostic methods and even the techniques
that admit task identities at both training and testing, while
at the same time using a comparable model size.
1. Introduction
Learning agents deployed in real world applications are ex-
posed to a continuous stream of incrementally available in-
formation usually from non-stationary data distributions.
The agent is required to adaptively learn over time by ac-
commodating new experience while preserving previous
learned knowledge. This is referred to as lifelong or contin-
ual learning, which has been a long-established challenge in
artificial intelligence, including deep learning [8,30,39].
In the commonly considered scenario of lifelong learn-
ing, where the tasks come sequentially and each task is
a sequence of events from the same distribution, one of
the main challenges is to overcome catastrophic forget-
ting, where training the model on a new task interfere
with the previously acquired knowledge and leads to the
performance deterioration on the previously seen tasks.
Deep neural networks generally perform well on classifi-
cation tasks, but they heavily rely on having i.i.d. data
samples drawn from stationary distribution during training
time [7,17,33]. In the case of sequential tasks, their perfor-
mance significantly deteriorates when learning new coming
tasks [14,2426,30].
A number of approaches have been suggested in the liter-
ature to deal with catastrophic forgetting. Some works [11,
40] provide systematic categorization of the continual learn-
ing frameworks and identify three different scenarios: in-
cremental task learning, incremental domain learning, and
incremental class learning, where their differences stem
from the availability of task labels at testing and number
of output heads. In incremental class and domain learn-
ing the task identity is not known during the testing. All
of these scenarios however are based on the assumption
that the task labels are known at the training phase. This
assumption is limiting in practical real-world applications,
where the agent needs to learn in a more challenging task-
agnostic setting [18,31,32,44]. In this learning setting the
task identities are not available both at training and infer-
ence times. The literature started exploring this setting very
recently and this setting is in the central focus of our paper.
In this work, we present an approach for handling task-
agnostic continual learning inspired by the older approaches
dedicated to learning non-stationary sequences based on ex-
perts advice [10,27,28], which explore and exploit the in-
termittent switches between distinct stationary processes. In
these approaches the learner can make predictions on the
basis of a fixed set of experts. Since the learner does not
know the mechanisms by which the experts arrive at their
predictions, it ought to exploit the information obtained by
observing the losses of the experts. Based on the experts’
losses it weights the experts to attenuate poor performers
and emphasize the good ones, and forms the final predic-
arXiv:2210.03869v2 [cs.LG] 2 Jun 2024
Figure 1. Deviation of the value of loss function of the expert when
the task is switched
tion as the weighted sum of experts’ predictions. Thus the
learner needs to identify the best expert at each time and
switch between the experts when the task switches occur.
In the aforementioned works, the weights over the experts
are the only carriers of the memory of previous experi-
ences. Also, the discussed methods rely on the assumption
that the number of experts/tasks are known in advance. Fi-
nally, these methods do not consider a separate train and
test phase, but rather their optimization process is focused
on minimizing the regret, which is the difference between
the cumulative loss of the algorithm and the loss of the
best method in the same class, chosen in hindsight (hind-
sight refers to full knowledge of the sequence to be pre-
dicted). Minimizing the regret however is not equivalent to
counter-acting catastrophic forgetting since previous tasks
that present little relevance to the currently learned ones
are gradually being overwritten in memory. These meth-
ods thus are not directly applicable to the continual learning
setting.
Motivated by having a set of experts representing a se-
quence of tasks, where each task is essentially a stationary
segment of a longer non-stationary distribution,
we propose a learning system that initially starts with
one expert and gradually adds or switches between experts
when the tasks change. During the online training phase our
algorithm automatically identifies when the task switches
and either selects or creates the best expert for a new task,
depending whether this task was seen before or not. The de-
tection of task switches relies on the statistically significant
deviation of the loss function value of the current expert,
which marks the onset of the new task (see Figure 1). Sim-
ilarly, the determination whether the task was seen before
or not relies on the behavior of the per-expert loss func-
tions (if the deviation of all per-expert loss values are high,
new expert is created to represent the current task). Such
simple detection mechanism is inspired by the classical ex-
perts advise literature discussed in the previous paragraph,
where switching between experts is governed by the values
of the loss functions of the experts. Moreover, we introduce
aselector network which predicts the task identity of the
samples at inference time. The selector network is trained
on a small subset of training examples that were sampled
uniformly at random from different tasks during the learn-
ing process. Despite the simplicity of our approach, it leads
to a task-agnostic continual learning algorithm that com-
pares favorably to existing methods and proves that a rich
historical literature on online processing of non-stationary
sequences can provide useful signal processing tools for ad-
dressing challenges in modern continual learning discipline.
The rest of the paper is organized as follows: Section 2
discusses the most relevant work. Section 3introduces our
algorithm which we call TAME: Task Agnostic continual
learning using Multiple Experts. Section 4reports empir-
ical results on benchmark continual learning data sets, and
finally Section 5concludes the paper.
2. Related Work
In recent years, there has been a plethora of techniques pro-
posed for continual learning that mitigate the catastrophic
forgetting problem in deep neural networks. The existing
approaches can be divided into three categories: i) com-
plementary learning systems and memory replay methods,
ii) regularization-based methods, and iii) dynamic architec-
ture methods. These techniques are not dedicated to the
task-agnostic scenario since they assume the identity of the
tasks are provided at least during the training phase. On the
other hand, more challenging task-agnostic continual learn-
ing setting was addressed only recently in a handful of pa-
pers. We review them first since our paper considers the
same setting. For completeness we also discuss the most
relevant works from the broad continual learning literature
and refer the reader to a survey paper [30] that provides a
more comprehensive review of these approaches.
Task-Agnostic Continual Learning In the context of su-
pervised learning setting, which is of central focus to this
paper, one of the first methods addressing task-agnostic
continual learning is the Bayesian Gradient Descent algo-
rithm, popularly known as BGD [44]. This approach is
based on an online version of variational Bayes and pro-
poses a Bayesian learning update rule for the mean and vari-
ance of each parameter. As all Bayesian approaches, this
method counter-acts catastrophic forgetting by using the
posterior distribution of the parameters for the previous task
as a prior for the new task. BGD obtains the most promising
empirical results in the setting, where the method relies on
the so-called “label trick” where the task identity is inferred
from the class label. Label trick however breaks the task-
agnostic assumption. Another approach called iTAML [31]
proposes to use meta-learning to maintain a set of gener-
alized parameters that represent all tasks. When presented
with a continuum of data at inference, the model automati-
cally identifies the task and quickly adapts to it with just a
single update. However at training the inner loop of their al-
gorithm, which generates task-specific models for each task
that are then combined in the outer loop to form a more
generic model, requires the knowledge of task label. At
inference, the task is predicted using generalized model pa-
rameters. Specifically, for each sample in the continuum,
the outcome of the general model is obtained and a maxi-
mum response per task is recorded. An average of the max-
imum responses per task is used as the task score. A task
with a maximum score is finally predicted. iTAML counter-
acts catastrophic forgetting by keeping a memory buffer of
samples from different tasks and using it to fine-tune gener-
alized parameters representing all tasks to a currently seen
one. This method is not task-agnostic, since it requires
task labels at training, though the authors categorize their
method as task-agnostic. CN-DPM [18] is an expansion-
based method that eliminates catastrophic forgetting by al-
locating new resources to learn new data. They formulate
the task-agnostic continual learning problem as an online
variational inference of Dirichlet process mixture models
consisting of a set of neural experts. Each expert is in charge
of a subset of the data. Each expert is associated with a dis-
criminative model (classifier) and a generative model (den-
sity estimator). For a new sample, they first decide whether
the sample should be assigned to an existing expert or a new
expert should be created for it. This is done by computing
the responsibility scores of the experts for the considered
sample and is supported by a short-term memory (STM)
collecting sufficient data. Specifically, when a data point is
classified as new, they store it to the STM. Once the STM
reaches its maximum capacity, they train a new expert with
the data in the STM. Another technique for task-agnosic
continual learning, known as HCL [15], models the distri-
bution of each task and each class with a normalizing flow
model. For task identification, they use the state-of-the-art
anomaly detection techniques based on measuring the typ-
icality of the model’s statistics. For avoiding catastrophic
forgetting they use a combination of generative replay and
a functional regularization technique.
In the context of unsupervised learning setting, VASE
method [1] addresses representation learning from piece-
wise stationary visual data based on a variational autoen-
coder with shared embeddings. The emphasis of this work
is put on learning shared representations across domains.
The method automatically detects shifts in the training data
distribution and uses this information to allocate spare latent
capacity to novel data set-specific disentangled representa-
tions, while reusing previously acquired representations of
latent dimensions where applicable. Authors represent data
sets using a set of data generative factors, where two data
sets may use the same generative factors but render them
differently, or they may use a different subset of factors
altogether. They next determine whether the average re-
construction error of the relevant generative factors for the
current data matches the previous data sets by a threshold
or not using Minimum Description Length principle. Al-
locating spare representational capacity to new knowledge
protects previously learnt representations from catastrophic
forgetting. Another technique called CURL [32] learns a
task-specific representation on top of a larger set of shared
parameters while dynamically expanding model capacity
to capture new tasks. The method represents tasks using
a mixture of Gaussians and expands the model as needed,
by maintaining a small set of poorly-modelled samples and
then initialising and fitting a new mixture component to this
set when it reaches a critical size. The method also relies
on replay generative models to alleviate catastrophic for-
getting.
Non Task-Agnostic Continual Learning First family of
non task-agnostic continual learning techniques consists of
complementary learning systems and memory replay meth-
ods. They rely on replaying selected samples from the prior
tasks. These samples are incorporated into the current learn-
ing process so that at each step the model is trained on a
mixture of samples from a new task as well as a small sub-
set of samples from the previously seen tasks. Some tech-
niques focus on efficiently selecting and storing prior expe-
riences through different selection strategies [4,13]. Other
approaches, e.g. GEM [21], A-GEM [6], and MER [33] fo-
cus on favoring positive backward transfer to previous tasks.
Finally, there are deep generative replay approaches [34,36]
that substitute the replay memory buffer with a generative
model to learn data distribution from previous tasks and
generate samples accordingly when learning a new task.
Another family of techniques, known as regularization-
based methods, enforce a constraint on the parameter up-
date of the neural network, usually by adding a regular-
ization term to the objective function. This term penal-
izes the change in the model parameters when the new task
is observed and assures they stay close to the parameters
learned on the previous tasks. Among these techniques,
we identify a few famous algorithms such as EWC [16],
SI [43], MAS [3], and RWALK [5] that introduce different
notions of the importance of synapses or parameters and pe-
nalizes changes to high importance parameters, as well as
the LwF [20] method that can be seen as a combination of
knowledge distillation and fine-tuning. Finally, the last fam-
ily of techniques are the dynamic architecture methods that
expand the architecture of the network by allocating addi-
tional resources, i.e., neurons or layers, to new tasks which
is usually accompanied by additional parameter pruning
and masking. This family consists of such techniques as
expert-gate method [2], progressive networks [35], dynam-
摘要:

TAME:TaskAgnosticContinualLearningusingMultipleExpertsHaoranZhu1∗MaryamMajzoubi2∗ArihantJain1∗AnnaChoromanska11NewYorkUniversity2Google{hz1922,aj2622,ac5455}@nyu.edumaryam.majzoubi@gmail.comAbstractThegoaloflifelonglearningistocontinuouslylearnfromnon-stationarydistributions,wherethenon-stationarity...

展开>> 收起<<
TAME Task Agnostic Continual Learning using Multiple Experts Haoran Zhu1Maryam Majzoubi2Arihant Jain1Anna Choromanska1 1New York University2Google.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:2.81MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注