Toward Sustainable Continual Learning Detection and Knowledge Repurposing of Similar Tasks Sijia Wang1 Yoojin Choi2 Junya Chen1 Mostafa El-Khamy2 and Ricardo Henao1

2025-04-26 0 0 3.46MB 20 页 10玖币

侵权投诉

Toward Sustainable Continual Learning: Detection

and Knowledge Repurposing of Similar Tasks

Sijia Wang1, Yoojin Choi2, Junya Chen1, Mostafa El-Khamy2, and Ricardo Henao1

1Duke University

2SoC R&D Samsung Semiconductor Inc.

sijia.wang@duke.edu

Abstract

Most existing works on continual learning (CL) focus on overcoming the catas-

trophic forgetting (CF) problem, with dynamic models and replay methods perform-

ing exceptionally well. However, since current works tend to assume exclusivity or

dissimilarity among learning tasks, these methods require constantly accumulating

task-speciﬁc knowledge in memory for each task. This results in the eventual

prohibitive expansion of the knowledge repository if we consider learning from a

long sequence of tasks. In this work, we introduce a paradigm where the continual

learner gets a sequence of mixed similar and dissimilar tasks. We propose a new

continual learning framework that uses a task similarity detection function that does

not require additional learning, with which we analyze whether there is a speciﬁc

task in the past that is similar to the current task. We can then reuse previous

task knowledge to slow down parameter expansion, ensuring that the CL system

expands the knowledge repository sublinearly to the number of learned tasks. Our

experiments show that the proposed framework performs competitively on widely

used computer vision benchmarks such as CIFAR10,CIFAR100, and EMNIST.

1 Introduction

Human intelligence is distinguished by the ability to learn new tasks over time while remembering

how to perform previously experienced tasks. Continual learning (CL), an advanced machine learning

paradigm requiring intelligent agents to continuously learn new knowledge while trying not to forget

past knowledge, has a pivotal role in machines imitating human-level intelligence [

]. The main

problem in continual learning is catastrophic forgetting (CF) of previous knowledge when new tasks,

observed over time, are incorporated to the model via training or (parameter) expansion.

In general, continual learning methods can be categorized into three major groups based on: (

)

regularization, (

) replay, and (

iii

) dynamic (expansion) models. Approaches based on regularization

[

–

] alleviate forgetting by constraining the updates of parameters that are important for previous

tasks, by adding a regularization term to the loss function penalizing changes between parameters for

current and previous tasks. While this type of approach excels at keeping a low memory footprint,

the ability to remember previous knowledge eventually declines, especially, in scenarios with long

sequences of tasks [

]. Methods based on replay keep a memory bank to store either a subset of

exemplars to represent each class and task [

–

], or generative models for pseudo-replay [

Approaches based on dynamic models [

–

] allow the architecture (or components of it) to expand

over time. Speciﬁcally, [15, 16] leverage modular networks and study a practical continual learning

scenario with 100-task long sequences. Nevertheless, these parameter expansion approaches face the

challenges of keeping model growth under control (ideally sublinear). This problem associated with

parameter expansion in CL systems is critical and requires further attention.

Preprint. Under review.

arXiv:2210.05751v1 [cs.CV] 11 Oct 2022

This paper discusses CL under a task continual learning (TCL) setting, i.e., that in which data arrives

sequentially in groups of tasks. For works under this scenario [

], the assumption is usually

that once a new task is presented, all of its data becomes readily available for batch (ofﬂine) training.

In this setting, a task is deﬁned as an individual training phase with a new collection of data that

belongs to a new (never seen) group of classes, or in general, a new domain. Further, TCL also

(implicitly) requires a task identiﬁer during training. However, in practice, when the model has seen

enough tasks, a newly arriving batch of data becomes increasingly likely to belong to the same group

of classes or domain of a previously seen task. Importantly, most existing works on TCL fail to

acknowledge this possibility. Moreover and in general, the task deﬁnition or identiﬁer may not be

available during training, e.g., the model does not have access to the task description due to (user)

privacy concerns. In such case mostly concerning dynamic models, the system has to treat every task

as new, thus constantly learning new sets of parameters regardless of task similarity or overlap. This

clearly constitutes a suboptimal use of resources (predominantly memory), especially as the number

of tasks experienced by the CL system grows.

This study investigates the aforementioned scenario and makes it an endeavor to create a memory-

efﬁcient CL system which though focused on image classiﬁcation tasks, is general and in principle

can be readily used toward other applications or data modality settings. We provide a solution for

dynamic models to identify similar tasks when no task identiﬁer is provided during the training phase.

To the best of our knowledge, the only work that also discusses the learning of a continual learning

system with mixed similar and dissimilar tasks is [

], which proposes a task similarity function to

identify previously seen similar tasks, which requires training a reference model every time a new

task becomes available. Alternatively, in this work, we identify similar tasks without the need for

training a new model, by leveraging a task similarity metric, which in practice results in high task

similarity identiﬁcation accuracy. We also discuss memory usage under challenging scenarios where

longer, more realistic, sequences of more than 20 tasks are used.

To summarize, our contributions are listed below:

•

We propose a new framework for an under-explored and yet practical TCL setting in which we

seek to learn a sequence of mixed similar and dissimilar tasks, while preventing (catastrophic)

forgetting and repurposing task-speciﬁc parameters from a previously seen similar task, thus

slowing down parameter expansion.

•

The proposed TCL framework is characterized by a task similarity detection module that deter-

mines, without additional learning, whether the CL system can reuse the task-speciﬁc parameters

of the model for a previous task or needs to instantiate new ones.

•

Our task similarity detection module shows remarkable performance on widely used computer

vision benchmarks, such as

CIFAR10

[

CIFAR100

[

EMNIST

[

], from which we create

sequences of 10 to 100 tasks.

2 Related Work

TCL in Practical Scenarios

Task continual learning (TCL) being an intuitive imitation of human

learning process constitutes one of the most studied scenarios in CL. Though TCL systems have

achieved impressive performance [

], previous works have mainly focused on circumventing

the problems associated with CF. Historically, task sequences have been restricted to no more than 10

tasks and strong assumptions have been imposed so that all the tasks in the sequence are unique and

classes among tasks are disjoint [

]. The authors of [

] rightly argue that currently discussed

CL settings are oversimpliﬁed, and more general and practical CL forms should be discussed to

advance the ﬁeld. It is not until recently that solutions have been proposed for longer sequences

and more practical CL scenarios. Particularly, [

] proposed CAT, which learns from a sequence of

mixed similar and dissimilar tasks, thus enabling knowledge transfer between future and past tasks

detected to be similar. To characterize the tasks, a set of task-speciﬁc masks, i.e., binary matrices

indicating which parameters are important for a given task [

], are trained along other model

parameters. Speciﬁcally, these masks are activated and parameters associated to them are ﬁnetuned

once the current task is identiﬁed as “similar” by a task similarity function, or otherwise held ﬁxed

by the masking parameters to protect them from changing, hence preventing CF. Alternatively, [

]

introduces a modular network, which is composed of neural modules that are potentially shared with

related tasks. Each task is optimized by selecting a set of modules that are either freshly trained on

the new task or borrowed from similar past tasks. These similar tasks are detected by a task-driven

prior. [

] evaluate their approach with a real-world CL benchmark CtrL that contains 100 tasks

with class overlap between tasks. The data-driven prior used in [

] for recognizing similar tasks

is a very simple approach, which leaves space for improved substitutes yielding slower parameter

growth. In summary, these works serve as a reminder that identifying similar tasks in TCL settings is

in general a hard problem that deserves more attention.

Computational Efﬁciency in CL

The ﬁrst and foremost step for tackling computational inefﬁcien-

cies is to identify the framework components that are most computationally consuming. For instance,

to avoid storing real images for replay, [

] proposes to keep lower-dimensional feature embeddings

of images instead. Another way to tackle the issue is through limiting trainable parameters in the

model architecture itself by partitioning a convolutional layer into a backbone and task-speciﬁc

adapting modules, like in [

]. Similarly, Filter Atom Swapping in [

] decompose ﬁlters in

neural network layers and ensemble them with ﬁlters of tasks in the past. We can also make use of

task relevancy or similarity. For instance, approaches with model structures similar to Progressive

Network [

], where tasks are optimized in search of neural modules to use. Further, [

] is proposed

to optimize search space through task similarity.

3 Background

3.1 Problem Setting

We consider the TCL scenario for image classiﬁcation tasks, where we seek to incrementally learn

a sequence of tasks

T1,T2,...,Tt−1

and denote the collection of tasks currently learned as

{T1,...,Tt−1}

. The underlying assumption is that, as the number of tasks in

grows, the current

task

will eventually have a corresponding similar task

Tsim

t∈ T

. Let the set of all dissimilar tasks

be Tdis

t=T \T sim

t. We deﬁne similar and dissimilar tasks as follows.

Similar and Dissimilar tasks: Consider two tasks

and

, which are represented by datasets

DA={xA

i, yA

i}nA

i=1

and

DB={xB

i, yB

i}nB

i=1

, where

i∈ {YA}

and

i∈ {YB}

. If predictors

(e.g., images) in

{xA

i}nA

i=1 ∼ P

and

{xB

i}nB

i=1 ∼ P

, and labels

{YA}={YB}

, indicating that data

from

and

belong to the same group of classes, and their predictors are drawn from the same

distribution

, then we say A and B are similar tasks, otherwise,

and

are deemed dissimilar.

Notably, when both tasks share the same distribution

, but have different label spaces, they are

considered dissimilar.

The main objective in this work is to identify

Tsim

among

without training a new model (or

learning parameters) for

, by leveraging a task similarity identiﬁcation function, that will enable the

system to reuse the parameters of

Tsim

when identiﬁcation of a previously seen task is successful.

Alternatively, the system will instantiate new parameters for the dissimilar task. As a result, the

system will attempt to learn parameters for the set of unique tasks, which in a long sequence of tasks

is assumed to be smaller than the sequence length. In practice, in order to handle memory efﬁciently,

we do not have to instantiate completely different sets of parameters for every unique task, but rather

we deﬁne global and task-speciﬁc parameters as a means to further control model growth. For this

purpose, we leverage the efﬁcient feature transformations for convolutional models described below.

3.2 Task-Speciﬁc Adaptation via Feature Transformation Techniques

The efﬁcient feature transformation (EFT) framework in [

] proposed that instead of ﬁnetuning

all the parameters in a well-trained (pretrained) model, one can instead partition the network into

a (global) backbone model and task-speciﬁc feature transformations. Note that similar ideas have

also been explored in [

–

]. Given a trained backbone convolutional neural network, we

can transform the convolutional feature maps

for each layer into task-speciﬁc feature maps

implementing small convolutional transformations.

In our setting, only

is learned for task

as a means to reduce the parameter count, and thus

the memory footprint, required for new tasks. Speciﬁcally, the feature transformation involves two

types of convolutional kernels, namely,

ωs∈R3×3×a

for capturing spatial features within groups of

channels and

ωd∈R1×1×b

for capturing features across channels at every location in

, where

and bare hyperparameters controlling the size of each feature map groups.

The transformed feature maps

W=Ws+γW d

are obtained from

Ws= [Ws

0:a−1|···|Ws

(K−a):K]

and Wd= [Ws

0:b−1|···|Ws

(K−b):K]via

ai:(ai+a−1) = [ωs

i,1∗Fai:(ai+a−1)|···|ws

i,a ∗Fai:(ai+a−1)], i ∈ {0,··· ,K

a−1}(1)

bi:(bi+b−1) = [ωd

i,1∗Fbi:(bi+b−1)|···|wd

i,b ∗Fbi:(bi+b−1)], i ∈ {0,··· ,K

b−1}(2)

where the feature maps

F∈RM×N×K

and

W∈RM×N×K

have spatial dimensions

and

is the number of feature maps, | is the concatenation operation,

K/a

and

K/b

are the number

of groups into which

is split for each feature map,

γ∈ {0,1}

indicates whether the point-wise

convolutions

ωd

are employed, and

ai:(ai+a−1) ∈RM×N×a

, and

bi:(bi+b−1) ∈RM×N×b

are

slices of the transformed feature map. In practice, we set

aK

and

bK

so that the amount

of trainable parameters per task is substantially reduced. For instance, using a ResNet18 backbone,

a= 8

, and

b= 16

results in 449k parameters per new tasks, which is

3.9%

the size of the backbone.

As empirically demonstrated in [

], EFT preserves the remarkable representation learning power of

ResNet models while signiﬁcantly reducing the number of trainable parameters per task.

3.3 Task Continual Learning: Mixture Model Perspective

One of the key components when trying to identify similar tasks is to decide whether

{xA

i}nA

i=1

and

{xB

i}nB

i=1

originate from the same distribution

. However, though conceptually simple, it is

extremely challenging in practice, particularly when predictors are complex instances such as images.

Intuitively, for a sequence of tasks

T={T1,...,Tt−1}

, with corresponding data

D1, . . . , Dt−1

Dt={xi, yi}nt

i=1

consisting of

instances, where

is an image and

is its corresponding label,

we can think of data instances

from the collection of all unique tasks as a mixture model deﬁned as

p(x) = π∗p(x|φ∗) +

t−1

j=1

πjp(x|φj),(3)

from which we can see that

πj

is the probability that

belongs to task

and

p(x|φj)

is the

likelihood of

under the distribution for task

parameterized by

φj

. Further,

π∗

and

φ∗

denote the

hypothetical probability and parameters for a new unseen task

∗

,i.e., distinct from

{p(x|φj)}t−1

j=1

. The

formulation in

(3)

which is reminiscent of a Dirichlet Process Mixture Model (DPMM) [

], can

in principle be used to estimate a posteriori that

p(Dt∈ T∗)

by evaluating

(3)

, which assumes that

parameters

{π∗, π1, . . . , πt−1}

and

{φ∗, φ1...,φt−1}

are readily available. Unfortunately, though

for the collection of existing tasks

we can effectively estimate

πj

and

p(x|φj)

, using generative

models (e.g., variational autoencoers), the parameters for a new task

φ∗

and

p(x|φ∗)

are much more

difﬁcult to estimate because

) if we naively build a generative model for the new dataset

to obtain

φt

and then evaluate

(3)

, it is almost guaranteed that

corresponding to

will be more likely under

p(x|φ∗=φt)

than any other existing task distribution

{p(x|φj)}t−1

j=1

, and alternatively,

) if we set

p(x|φ∗)

to some prior distribution, e.g. a pretrained generative model, it will be most deﬁnitely never

selected, especially in scenarios with complex predictors such as images. In fact, in early stages of

development we empirically veriﬁed this being the case using both pretrained generative models

admitting (marginal) likelihood estimation [36, 37] and anomaly detection models based on density

estimators [38, 39].

In Section 4, we will show how to leverage

(3)

to identify new tasks in the context of TCL without

using data from Ttfor learning nor specifying a prior distribution for p(x|φ∗).

3.4 Estimating the Association between Predictors and Labels

The mixture model perspective for TCL in

(3)

offers a compelling way to compare the distribution

of predictors for different tasks, however, it does not provide any insights about the strength of the

association between predictors and labels for dataset

corresponding to task

. Further, [

]

showed that overparameterized neural network classiﬁers can attain zero training error, regardless of

the strength of association between predictors and labels, and in an extreme case, even for randomly

labeled data. However, it is clear that a model trained with random labels will not generalize.

So motivated, [

] ﬁrst studied the properties of suitably labeled data that control generalization ability,

and proposed a generalization bound on the test error (empirical risk) for arbitrary overparameterized

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardSustainableContinualLearning:DetectionandKnowledgeRepurposingofSimilarTasksSijiaWang1,YoojinChoi2,JunyaChen1,MostafaEl-Khamy2,andRicardoHenao11DukeUniversity2SoCR&DSamsungSemiconductorInc.sijia.wang@duke.eduAbstractMostexistingworksoncontinuallearning(CL)focusonovercomingthecatas-trophicforget...

展开>> 收起<<

Toward Sustainable Continual Learning Detection and Knowledge Repurposing of Similar Tasks Sijia Wang1 Yoojin Choi2 Junya Chen1 Mostafa El-Khamy2 and Ricardo Henao1.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Toward Sustainable Continual Learning Detection and Knowledge Repurposing of Similar Tasks Sijia Wang1 Yoojin Choi2 Junya Chen1 Mostafa El-Khamy2 and Ricardo Henao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: