Toward Sustainable Continual Learning Detection and Knowledge Repurposing of Similar Tasks Sijia Wang1 Yoojin Choi2 Junya Chen1 Mostafa El-Khamy2 and Ricardo Henao1

2025-04-26 0 0 3.46MB 20 页 10玖币
侵权投诉
Toward Sustainable Continual Learning: Detection
and Knowledge Repurposing of Similar Tasks
Sijia Wang1, Yoojin Choi2, Junya Chen1, Mostafa El-Khamy2, and Ricardo Henao1
1Duke University
2SoC R&D Samsung Semiconductor Inc.
sijia.wang@duke.edu
Abstract
Most existing works on continual learning (CL) focus on overcoming the catas-
trophic forgetting (CF) problem, with dynamic models and replay methods perform-
ing exceptionally well. However, since current works tend to assume exclusivity or
dissimilarity among learning tasks, these methods require constantly accumulating
task-specific knowledge in memory for each task. This results in the eventual
prohibitive expansion of the knowledge repository if we consider learning from a
long sequence of tasks. In this work, we introduce a paradigm where the continual
learner gets a sequence of mixed similar and dissimilar tasks. We propose a new
continual learning framework that uses a task similarity detection function that does
not require additional learning, with which we analyze whether there is a specific
task in the past that is similar to the current task. We can then reuse previous
task knowledge to slow down parameter expansion, ensuring that the CL system
expands the knowledge repository sublinearly to the number of learned tasks. Our
experiments show that the proposed framework performs competitively on widely
used computer vision benchmarks such as CIFAR10,CIFAR100, and EMNIST.
1 Introduction
Human intelligence is distinguished by the ability to learn new tasks over time while remembering
how to perform previously experienced tasks. Continual learning (CL), an advanced machine learning
paradigm requiring intelligent agents to continuously learn new knowledge while trying not to forget
past knowledge, has a pivotal role in machines imitating human-level intelligence [
1
]. The main
problem in continual learning is catastrophic forgetting (CF) of previous knowledge when new tasks,
observed over time, are incorporated to the model via training or (parameter) expansion.
In general, continual learning methods can be categorized into three major groups based on: (
i
)
regularization, (
ii
) replay, and (
iii
) dynamic (expansion) models. Approaches based on regularization
[
2
4
] alleviate forgetting by constraining the updates of parameters that are important for previous
tasks, by adding a regularization term to the loss function penalizing changes between parameters for
current and previous tasks. While this type of approach excels at keeping a low memory footprint,
the ability to remember previous knowledge eventually declines, especially, in scenarios with long
sequences of tasks [
5
]. Methods based on replay keep a memory bank to store either a subset of
exemplars to represent each class and task [
6
8
], or generative models for pseudo-replay [
9
,
10
].
Approaches based on dynamic models [
11
16
] allow the architecture (or components of it) to expand
over time. Specifically, [15, 16] leverage modular networks and study a practical continual learning
scenario with 100-task long sequences. Nevertheless, these parameter expansion approaches face the
challenges of keeping model growth under control (ideally sublinear). This problem associated with
parameter expansion in CL systems is critical and requires further attention.
Preprint. Under review.
arXiv:2210.05751v1 [cs.CV] 11 Oct 2022
This paper discusses CL under a task continual learning (TCL) setting, i.e., that in which data arrives
sequentially in groups of tasks. For works under this scenario [
17
,
18
], the assumption is usually
that once a new task is presented, all of its data becomes readily available for batch (offline) training.
In this setting, a task is defined as an individual training phase with a new collection of data that
belongs to a new (never seen) group of classes, or in general, a new domain. Further, TCL also
(implicitly) requires a task identifier during training. However, in practice, when the model has seen
enough tasks, a newly arriving batch of data becomes increasingly likely to belong to the same group
of classes or domain of a previously seen task. Importantly, most existing works on TCL fail to
acknowledge this possibility. Moreover and in general, the task definition or identifier may not be
available during training, e.g., the model does not have access to the task description due to (user)
privacy concerns. In such case mostly concerning dynamic models, the system has to treat every task
as new, thus constantly learning new sets of parameters regardless of task similarity or overlap. This
clearly constitutes a suboptimal use of resources (predominantly memory), especially as the number
of tasks experienced by the CL system grows.
This study investigates the aforementioned scenario and makes it an endeavor to create a memory-
efficient CL system which though focused on image classification tasks, is general and in principle
can be readily used toward other applications or data modality settings. We provide a solution for
dynamic models to identify similar tasks when no task identifier is provided during the training phase.
To the best of our knowledge, the only work that also discusses the learning of a continual learning
system with mixed similar and dissimilar tasks is [
19
], which proposes a task similarity function to
identify previously seen similar tasks, which requires training a reference model every time a new
task becomes available. Alternatively, in this work, we identify similar tasks without the need for
training a new model, by leveraging a task similarity metric, which in practice results in high task
similarity identification accuracy. We also discuss memory usage under challenging scenarios where
longer, more realistic, sequences of more than 20 tasks are used.
To summarize, our contributions are listed below:
We propose a new framework for an under-explored and yet practical TCL setting in which we
seek to learn a sequence of mixed similar and dissimilar tasks, while preventing (catastrophic)
forgetting and repurposing task-specific parameters from a previously seen similar task, thus
slowing down parameter expansion.
The proposed TCL framework is characterized by a task similarity detection module that deter-
mines, without additional learning, whether the CL system can reuse the task-specific parameters
of the model for a previous task or needs to instantiate new ones.
Our task similarity detection module shows remarkable performance on widely used computer
vision benchmarks, such as
CIFAR10
[
20
],
CIFAR100
[
20
],
EMNIST
[
21
], from which we create
sequences of 10 to 100 tasks.
2 Related Work
TCL in Practical Scenarios
Task continual learning (TCL) being an intuitive imitation of human
learning process constitutes one of the most studied scenarios in CL. Though TCL systems have
achieved impressive performance [
22
,
23
], previous works have mainly focused on circumventing
the problems associated with CF. Historically, task sequences have been restricted to no more than 10
tasks and strong assumptions have been imposed so that all the tasks in the sequence are unique and
classes among tasks are disjoint [
10
,
24
]. The authors of [
25
] rightly argue that currently discussed
CL settings are oversimplified, and more general and practical CL forms should be discussed to
advance the field. It is not until recently that solutions have been proposed for longer sequences
and more practical CL scenarios. Particularly, [
19
] proposed CAT, which learns from a sequence of
mixed similar and dissimilar tasks, thus enabling knowledge transfer between future and past tasks
detected to be similar. To characterize the tasks, a set of task-specific masks, i.e., binary matrices
indicating which parameters are important for a given task [
22
], are trained along other model
parameters. Specifically, these masks are activated and parameters associated to them are finetuned
once the current task is identified as “similar” by a task similarity function, or otherwise held fixed
by the masking parameters to protect them from changing, hence preventing CF. Alternatively, [
15
]
introduces a modular network, which is composed of neural modules that are potentially shared with
related tasks. Each task is optimized by selecting a set of modules that are either freshly trained on
2
the new task or borrowed from similar past tasks. These similar tasks are detected by a task-driven
prior. [
15
,
16
] evaluate their approach with a real-world CL benchmark CtrL that contains 100 tasks
with class overlap between tasks. The data-driven prior used in [
15
] for recognizing similar tasks
is a very simple approach, which leaves space for improved substitutes yielding slower parameter
growth. In summary, these works serve as a reminder that identifying similar tasks in TCL settings is
in general a hard problem that deserves more attention.
Computational Efficiency in CL
The first and foremost step for tackling computational inefficien-
cies is to identify the framework components that are most computationally consuming. For instance,
to avoid storing real images for replay, [
26
] proposes to keep lower-dimensional feature embeddings
of images instead. Another way to tackle the issue is through limiting trainable parameters in the
model architecture itself by partitioning a convolutional layer into a backbone and task-specific
adapting modules, like in [
27
,
28
]. Similarly, Filter Atom Swapping in [
29
] decompose filters in
neural network layers and ensemble them with filters of tasks in the past. We can also make use of
task relevancy or similarity. For instance, approaches with model structures similar to Progressive
Network [
14
], where tasks are optimized in search of neural modules to use. Further, [
15
] is proposed
to optimize search space through task similarity.
3 Background
3.1 Problem Setting
We consider the TCL scenario for image classification tasks, where we seek to incrementally learn
a sequence of tasks
T1,T2,...,Tt1
and denote the collection of tasks currently learned as
T=
{T1,...,Tt1}
. The underlying assumption is that, as the number of tasks in
T
grows, the current
task
Tt
will eventually have a corresponding similar task
Tsim
t∈ T
. Let the set of all dissimilar tasks
be Tdis
t=T \T sim
t. We define similar and dissimilar tasks as follows.
Similar and Dissimilar tasks: Consider two tasks
A
and
B
, which are represented by datasets
DA={xA
i, yA
i}nA
i=1
and
DB={xB
i, yB
i}nB
i=1
, where
yA
i∈ {YA}
and
yB
i∈ {YB}
. If predictors
(e.g., images) in
{xA
i}nA
i=1 ∼ P
and
{xB
i}nB
i=1 ∼ P
, and labels
{YA}={YB}
, indicating that data
from
A
and
B
belong to the same group of classes, and their predictors are drawn from the same
distribution
P
, then we say A and B are similar tasks, otherwise,
A
and
B
are deemed dissimilar.
Notably, when both tasks share the same distribution
P
, but have different label spaces, they are
considered dissimilar.
The main objective in this work is to identify
Tsim
t
among
T
without training a new model (or
learning parameters) for
Tt
, by leveraging a task similarity identification function, that will enable the
system to reuse the parameters of
Tsim
t
when identification of a previously seen task is successful.
Alternatively, the system will instantiate new parameters for the dissimilar task. As a result, the
system will attempt to learn parameters for the set of unique tasks, which in a long sequence of tasks
is assumed to be smaller than the sequence length. In practice, in order to handle memory efficiently,
we do not have to instantiate completely different sets of parameters for every unique task, but rather
we define global and task-specific parameters as a means to further control model growth. For this
purpose, we leverage the efficient feature transformations for convolutional models described below.
3.2 Task-Specific Adaptation via Feature Transformation Techniques
The efficient feature transformation (EFT) framework in [
30
] proposed that instead of finetuning
all the parameters in a well-trained (pretrained) model, one can instead partition the network into
a (global) backbone model and task-specific feature transformations. Note that similar ideas have
also been explored in [
28
,
29
,
31
33
]. Given a trained backbone convolutional neural network, we
can transform the convolutional feature maps
F
for each layer into task-specific feature maps
W
by
implementing small convolutional transformations.
In our setting, only
W
is learned for task
Tt
as a means to reduce the parameter count, and thus
the memory footprint, required for new tasks. Specifically, the feature transformation involves two
types of convolutional kernels, namely,
ωsR3×3×a
for capturing spatial features within groups of
channels and
ωdR1×1×b
for capturing features across channels at every location in
F
, where
a
and bare hyperparameters controlling the size of each feature map groups.
3
The transformed feature maps
W=Ws+γW d
are obtained from
Ws= [Ws
0:a1|···|Ws
(Ka):K]
and Wd= [Ws
0:b1|···|Ws
(Kb):K]via
Ws
ai:(ai+a1) = [ωs
i,1Fai:(ai+a1)|···|ws
i,a Fai:(ai+a1)], i ∈ {0,··· ,K
a1}(1)
Wd
bi:(bi+b1) = [ωd
i,1Fbi:(bi+b1)|···|wd
i,b Fbi:(bi+b1)], i ∈ {0,··· ,K
b1}(2)
where the feature maps
FRM×N×K
and
WRM×N×K
have spatial dimensions
M
and
N
,
K
is the number of feature maps, | is the concatenation operation,
K/a
and
K/b
are the number
of groups into which
F
is split for each feature map,
γ∈ {0,1}
indicates whether the point-wise
convolutions
ωd
are employed, and
Ws
ai:(ai+a1) RM×N×a
, and
Wd
bi:(bi+b1) RM×N×b
are
slices of the transformed feature map. In practice, we set
aK
and
bK
so that the amount
of trainable parameters per task is substantially reduced. For instance, using a ResNet18 backbone,
a= 8
, and
b= 16
results in 449k parameters per new tasks, which is
3.9%
the size of the backbone.
As empirically demonstrated in [
30
], EFT preserves the remarkable representation learning power of
ResNet models while significantly reducing the number of trainable parameters per task.
3.3 Task Continual Learning: Mixture Model Perspective
One of the key components when trying to identify similar tasks is to decide whether
{xA
i}nA
i=1
and
{xB
i}nB
i=1
originate from the same distribution
P
. However, though conceptually simple, it is
extremely challenging in practice, particularly when predictors are complex instances such as images.
Intuitively, for a sequence of tasks
T={T1,...,Tt1}
, with corresponding data
D1, . . . , Dt1
,
Dt={xi, yi}nt
i=1
consisting of
nt
instances, where
xi
is an image and
yi
is its corresponding label,
we can think of data instances
x
from the collection of all unique tasks as a mixture model defined as
p(x) = πp(x|φ) +
t1
X
j=1
πjp(x|φj),(3)
from which we can see that
πj
is the probability that
x
belongs to task
Tj
and
p(x|φj)
is the
likelihood of
x
under the distribution for task
Tj
parameterized by
φj
. Further,
π
and
φ
denote the
hypothetical probability and parameters for a new unseen task
,i.e., distinct from
{p(x|φj)}t1
j=1
. The
formulation in
(3)
which is reminiscent of a Dirichlet Process Mixture Model (DPMM) [
34
,
35
], can
in principle be used to estimate a posteriori that
p(Dt∈ T)
by evaluating
(3)
, which assumes that
parameters
{π, π1, . . . , πt1}
and
{φ, φ1...,φt1}
are readily available. Unfortunately, though
for the collection of existing tasks
T
we can effectively estimate
πj
and
p(x|φj)
, using generative
models (e.g., variational autoencoers), the parameters for a new task
φ
and
p(x|φ)
are much more
difficult to estimate because
i
) if we naively build a generative model for the new dataset
Dt
to obtain
φt
and then evaluate
(3)
, it is almost guaranteed that
Dt
corresponding to
Tt
will be more likely under
p(x|φ=φt)
than any other existing task distribution
{p(x|φj)}t1
j=1
, and alternatively,
ii
) if we set
p(x|φ)
to some prior distribution, e.g. a pretrained generative model, it will be most definitely never
selected, especially in scenarios with complex predictors such as images. In fact, in early stages of
development we empirically verified this being the case using both pretrained generative models
admitting (marginal) likelihood estimation [36, 37] and anomaly detection models based on density
estimators [38, 39].
In Section 4, we will show how to leverage
(3)
to identify new tasks in the context of TCL without
using data from Ttfor learning nor specifying a prior distribution for p(x|φ).
3.4 Estimating the Association between Predictors and Labels
The mixture model perspective for TCL in
(3)
offers a compelling way to compare the distribution
of predictors for different tasks, however, it does not provide any insights about the strength of the
association between predictors and labels for dataset
Dt
corresponding to task
Tt
. Further, [
40
]
showed that overparameterized neural network classifiers can attain zero training error, regardless of
the strength of association between predictors and labels, and in an extreme case, even for randomly
labeled data. However, it is clear that a model trained with random labels will not generalize.
So motivated, [
41
] first studied the properties of suitably labeled data that control generalization ability,
and proposed a generalization bound on the test error (empirical risk) for arbitrary overparameterized
4
摘要:

TowardSustainableContinualLearning:DetectionandKnowledgeRepurposingofSimilarTasksSijiaWang1,YoojinChoi2,JunyaChen1,MostafaEl-Khamy2,andRicardoHenao11DukeUniversity2SoCR&DSamsungSemiconductorInc.sijia.wang@duke.eduAbstractMostexistingworksoncontinuallearning(CL)focusonovercomingthecatas-trophicforget...

展开>> 收起<<
Toward Sustainable Continual Learning Detection and Knowledge Repurposing of Similar Tasks Sijia Wang1 Yoojin Choi2 Junya Chen1 Mostafa El-Khamy2 and Ricardo Henao1.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:3.46MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注