TAME Task Agnostic Continual Learning using Multiple Experts Haoran Zhu1Maryam Majzoubi2Arihant Jain1Anna Choromanska1 1New York University2Google

2025-05-02 2 0 2.81MB 12 页 10玖币

侵权投诉

TAME: Task Agnostic Continual Learning using Multiple Experts

Haoran Zhu1∗Maryam Majzoubi 2∗Arihant Jain 1∗Anna Choromanska 1

1New York University 2Google

{hz1922, aj2622, ac5455}@nyu.edu maryam.majzoubi@gmail.com

Abstract

The goal of lifelong learning is to continuously

learn from non-stationary distributions, where the non-

stationarity is typically imposed by a sequence of distinct

tasks. Prior works have mostly considered idealistic set-

tings, where the identity of tasks is known at least at train-

ing. In this paper we focus on a fundamentally harder, so-

called task-agnostic, setting where the task identities are

not known and the learning machine needs to infer them

from the observations. Our algorithm, which we call TAME

(Task-Agnostic continual learning using Multiple Experts),

automatically detects the shift in data distributions and

switches between task expert networks in an online man-

ner. At training, the strategy for switching between tasks

hinges on an extremely simple observation that for each

new coming task there occurs a statistically-signiﬁcant de-

viation in the value of the loss function that marks the onset

of this new task. At inference, the switching between ex-

perts is governed by the selector network that forwards the

test sample to its relevant expert network. The selector net-

work is trained on a small subset of data drawn uniformly

at random. We control the growth of the task expert net-

works as well as selector network by employing pruning.

Our experimental results show the efﬁcacy of our approach

on benchmark continual learning data sets, outperforming

the previous task-agnostic methods and even the techniques

that admit task identities at both training and testing, while

at the same time using a comparable model size.

1. Introduction

Learning agents deployed in real world applications are ex-

posed to a continuous stream of incrementally available in-

formation usually from non-stationary data distributions.

The agent is required to adaptively learn over time by ac-

commodating new experience while preserving previous

learned knowledge. This is referred to as lifelong or contin-

ual learning, which has been a long-established challenge in

artiﬁcial intelligence, including deep learning [8,30,39].

In the commonly considered scenario of lifelong learn-

ing, where the tasks come sequentially and each task is

a sequence of events from the same distribution, one of

the main challenges is to overcome catastrophic forget-

ting, where training the model on a new task interfere

with the previously acquired knowledge and leads to the

performance deterioration on the previously seen tasks.

Deep neural networks generally perform well on classiﬁ-

cation tasks, but they heavily rely on having i.i.d. data

samples drawn from stationary distribution during training

time [7,17,33]. In the case of sequential tasks, their perfor-

mance signiﬁcantly deteriorates when learning new coming

tasks [14,24–26,30].

A number of approaches have been suggested in the liter-

ature to deal with catastrophic forgetting. Some works [11,

40] provide systematic categorization of the continual learn-

ing frameworks and identify three different scenarios: in-

cremental task learning, incremental domain learning, and

incremental class learning, where their differences stem

from the availability of task labels at testing and number

of output heads. In incremental class and domain learn-

ing the task identity is not known during the testing. All

of these scenarios however are based on the assumption

that the task labels are known at the training phase. This

assumption is limiting in practical real-world applications,

where the agent needs to learn in a more challenging task-

agnostic setting [18,31,32,44]. In this learning setting the

task identities are not available both at training and infer-

ence times. The literature started exploring this setting very

recently and this setting is in the central focus of our paper.

In this work, we present an approach for handling task-

agnostic continual learning inspired by the older approaches

dedicated to learning non-stationary sequences based on ex-

perts advice [10,27,28], which explore and exploit the in-

termittent switches between distinct stationary processes. In

these approaches the learner can make predictions on the

basis of a ﬁxed set of experts. Since the learner does not

know the mechanisms by which the experts arrive at their

predictions, it ought to exploit the information obtained by

observing the losses of the experts. Based on the experts’

losses it weights the experts to attenuate poor performers

and emphasize the good ones, and forms the ﬁnal predic-

arXiv:2210.03869v2 [cs.LG] 2 Jun 2024

Figure 1. Deviation of the value of loss function of the expert when

the task is switched

tion as the weighted sum of experts’ predictions. Thus the

learner needs to identify the best expert at each time and

switch between the experts when the task switches occur.

In the aforementioned works, the weights over the experts

are the only carriers of the memory of previous experi-

ences. Also, the discussed methods rely on the assumption

that the number of experts/tasks are known in advance. Fi-

nally, these methods do not consider a separate train and

test phase, but rather their optimization process is focused

on minimizing the regret, which is the difference between

the cumulative loss of the algorithm and the loss of the

best method in the same class, chosen in hindsight (hind-

sight refers to full knowledge of the sequence to be pre-

dicted). Minimizing the regret however is not equivalent to

counter-acting catastrophic forgetting since previous tasks

that present little relevance to the currently learned ones

are gradually being overwritten in memory. These meth-

ods thus are not directly applicable to the continual learning

setting.

Motivated by having a set of experts representing a se-

quence of tasks, where each task is essentially a stationary

segment of a longer non-stationary distribution,

we propose a learning system that initially starts with

one expert and gradually adds or switches between experts

when the tasks change. During the online training phase our

algorithm automatically identiﬁes when the task switches

and either selects or creates the best expert for a new task,

depending whether this task was seen before or not. The de-

tection of task switches relies on the statistically signiﬁcant

deviation of the loss function value of the current expert,

which marks the onset of the new task (see Figure 1). Sim-

ilarly, the determination whether the task was seen before

or not relies on the behavior of the per-expert loss func-

tions (if the deviation of all per-expert loss values are high,

new expert is created to represent the current task). Such

simple detection mechanism is inspired by the classical ex-

perts advise literature discussed in the previous paragraph,

where switching between experts is governed by the values

of the loss functions of the experts. Moreover, we introduce

aselector network which predicts the task identity of the

samples at inference time. The selector network is trained

on a small subset of training examples that were sampled

uniformly at random from different tasks during the learn-

ing process. Despite the simplicity of our approach, it leads

to a task-agnostic continual learning algorithm that com-

pares favorably to existing methods and proves that a rich

historical literature on online processing of non-stationary

sequences can provide useful signal processing tools for ad-

dressing challenges in modern continual learning discipline.

The rest of the paper is organized as follows: Section 2

discusses the most relevant work. Section 3introduces our

algorithm which we call TAME: Task Agnostic continual

learning using Multiple Experts. Section 4reports empir-

ical results on benchmark continual learning data sets, and

ﬁnally Section 5concludes the paper.

2. Related Work

In recent years, there has been a plethora of techniques pro-

posed for continual learning that mitigate the catastrophic

forgetting problem in deep neural networks. The existing

approaches can be divided into three categories: i) com-

plementary learning systems and memory replay methods,

ii) regularization-based methods, and iii) dynamic architec-

ture methods. These techniques are not dedicated to the

task-agnostic scenario since they assume the identity of the

tasks are provided at least during the training phase. On the

other hand, more challenging task-agnostic continual learn-

ing setting was addressed only recently in a handful of pa-

pers. We review them ﬁrst since our paper considers the

same setting. For completeness we also discuss the most

relevant works from the broad continual learning literature

and refer the reader to a survey paper [30] that provides a

more comprehensive review of these approaches.

Task-Agnostic Continual Learning In the context of su-

pervised learning setting, which is of central focus to this

paper, one of the ﬁrst methods addressing task-agnostic

continual learning is the Bayesian Gradient Descent algo-

rithm, popularly known as BGD [44]. This approach is

based on an online version of variational Bayes and pro-

poses a Bayesian learning update rule for the mean and vari-

ance of each parameter. As all Bayesian approaches, this

method counter-acts catastrophic forgetting by using the

posterior distribution of the parameters for the previous task

as a prior for the new task. BGD obtains the most promising

empirical results in the setting, where the method relies on

the so-called “label trick” where the task identity is inferred

from the class label. Label trick however breaks the task-

agnostic assumption. Another approach called iTAML [31]

proposes to use meta-learning to maintain a set of gener-

alized parameters that represent all tasks. When presented

with a continuum of data at inference, the model automati-

cally identiﬁes the task and quickly adapts to it with just a

single update. However at training the inner loop of their al-

gorithm, which generates task-speciﬁc models for each task

that are then combined in the outer loop to form a more

generic model, requires the knowledge of task label. At

inference, the task is predicted using generalized model pa-

rameters. Speciﬁcally, for each sample in the continuum,

the outcome of the general model is obtained and a maxi-

mum response per task is recorded. An average of the max-

imum responses per task is used as the task score. A task

with a maximum score is ﬁnally predicted. iTAML counter-

acts catastrophic forgetting by keeping a memory buffer of

samples from different tasks and using it to ﬁne-tune gener-

alized parameters representing all tasks to a currently seen

one. This method is not task-agnostic, since it requires

task labels at training, though the authors categorize their

method as task-agnostic. CN-DPM [18] is an expansion-

based method that eliminates catastrophic forgetting by al-

locating new resources to learn new data. They formulate

the task-agnostic continual learning problem as an online

variational inference of Dirichlet process mixture models

consisting of a set of neural experts. Each expert is in charge

of a subset of the data. Each expert is associated with a dis-

criminative model (classiﬁer) and a generative model (den-

sity estimator). For a new sample, they ﬁrst decide whether

the sample should be assigned to an existing expert or a new

expert should be created for it. This is done by computing

the responsibility scores of the experts for the considered

sample and is supported by a short-term memory (STM)

collecting sufﬁcient data. Speciﬁcally, when a data point is

classiﬁed as new, they store it to the STM. Once the STM

reaches its maximum capacity, they train a new expert with

the data in the STM. Another technique for task-agnosic

continual learning, known as HCL [15], models the distri-

bution of each task and each class with a normalizing ﬂow

model. For task identiﬁcation, they use the state-of-the-art

anomaly detection techniques based on measuring the typ-

icality of the model’s statistics. For avoiding catastrophic

forgetting they use a combination of generative replay and

a functional regularization technique.

In the context of unsupervised learning setting, VASE

method [1] addresses representation learning from piece-

wise stationary visual data based on a variational autoen-

coder with shared embeddings. The emphasis of this work

is put on learning shared representations across domains.

The method automatically detects shifts in the training data

distribution and uses this information to allocate spare latent

capacity to novel data set-speciﬁc disentangled representa-

tions, while reusing previously acquired representations of

latent dimensions where applicable. Authors represent data

sets using a set of data generative factors, where two data

sets may use the same generative factors but render them

differently, or they may use a different subset of factors

altogether. They next determine whether the average re-

construction error of the relevant generative factors for the

current data matches the previous data sets by a threshold

or not using Minimum Description Length principle. Al-

locating spare representational capacity to new knowledge

protects previously learnt representations from catastrophic

forgetting. Another technique called CURL [32] learns a

task-speciﬁc representation on top of a larger set of shared

parameters while dynamically expanding model capacity

to capture new tasks. The method represents tasks using

a mixture of Gaussians and expands the model as needed,

by maintaining a small set of poorly-modelled samples and

then initialising and ﬁtting a new mixture component to this

set when it reaches a critical size. The method also relies

on replay generative models to alleviate catastrophic for-

getting.

Non Task-Agnostic Continual Learning First family of

non task-agnostic continual learning techniques consists of

complementary learning systems and memory replay meth-

ods. They rely on replaying selected samples from the prior

tasks. These samples are incorporated into the current learn-

ing process so that at each step the model is trained on a

mixture of samples from a new task as well as a small sub-

set of samples from the previously seen tasks. Some tech-

niques focus on efﬁciently selecting and storing prior expe-

riences through different selection strategies [4,13]. Other

approaches, e.g. GEM [21], A-GEM [6], and MER [33] fo-

cus on favoring positive backward transfer to previous tasks.

Finally, there are deep generative replay approaches [34,36]

that substitute the replay memory buffer with a generative

model to learn data distribution from previous tasks and

generate samples accordingly when learning a new task.

Another family of techniques, known as regularization-

based methods, enforce a constraint on the parameter up-

date of the neural network, usually by adding a regular-

ization term to the objective function. This term penal-

izes the change in the model parameters when the new task

is observed and assures they stay close to the parameters

learned on the previous tasks. Among these techniques,

we identify a few famous algorithms such as EWC [16],

SI [43], MAS [3], and RWALK [5] that introduce different

notions of the importance of synapses or parameters and pe-

nalizes changes to high importance parameters, as well as

the LwF [20] method that can be seen as a combination of

knowledge distillation and ﬁne-tuning. Finally, the last fam-

ily of techniques are the dynamic architecture methods that

expand the architecture of the network by allocating addi-

tional resources, i.e., neurons or layers, to new tasks which

is usually accompanied by additional parameter pruning

and masking. This family consists of such techniques as

expert-gate method [2], progressive networks [35], dynam-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TAME:TaskAgnosticContinualLearningusingMultipleExpertsHaoranZhu1∗MaryamMajzoubi2∗ArihantJain1∗AnnaChoromanska11NewYorkUniversity2Google{hz1922,aj2622,ac5455}@nyu.edumaryam.majzoubi@gmail.comAbstractThegoaloflifelonglearningistocontinuouslylearnfromnon-stationarydistributions,wherethenon-stationarity...

展开>> 收起<<

TAME Task Agnostic Continual Learning using Multiple Experts Haoran Zhu1Maryam Majzoubi2Arihant Jain1Anna Choromanska1 1New York University2Google.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TAME Task Agnostic Continual Learning using Multiple Experts Haoran Zhu1Maryam Majzoubi2Arihant Jain1Anna Choromanska1 1New York University2Google

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: