Exclusive Supermask Subnetwork Training for Continual Learning Prateek Yadav Mohit Bansal Department of Computer Science

2025-05-06 0 0 747.43KB 17 页 10玖币
侵权投诉
Exclusive Supermask Subnetwork Training for Continual Learning
Prateek Yadav & Mohit Bansal
Department of Computer Science
UNC Chapel Hill
{praty,mbansal}@cs.unc.edu
Abstract
Continual Learning (CL) methods focus on ac-
cumulating knowledge over time while avoid-
ing catastrophic forgetting. Recently, Worts-
man et al. (2020) proposed a CL method,
SupSup, which uses a randomly initialized,
fixed base network (model) and finds a su-
permask for each new task that selectively
keeps or removes each weight to produce a
subnetwork. They prevent forgetting as the
network weights are not being updated. Al-
though there is no forgetting, the performance
of SupSup is sub-optimal because fixed weights
restrict its representational power. Further-
more, there is no accumulation or transfer of
knowledge inside the model when new tasks
are learned. Hence, we propose EXSSNET
(Exclusive Supermask SubNEtwork Training),
that performs exclusive and non-overlapping
subnetwork weight training. This avoids con-
flicting updates to the shared weights by subse-
quent tasks to improve performance while still
preventing forgetting. Furthermore, we pro-
pose a novel KNN-based Knowledge Transfer
(KKT) module that utilizes previously acquired
knowledge to learn new tasks better and faster.
We demonstrate that EXSSNET outperforms
strong previous methods on both NLP and
Vision domains while preventing forgetting.
Moreover, EXSSNET is particularly advan-
tageous for sparse masks that activate
2
-
10
%
of the model parameters, resulting in an aver-
age improvement of
8.3
% over SupSup. Fur-
thermore, EXSSNET scales to a large num-
ber of tasks (100). Our code is available at
https://github.com/prateeky2806/exessnet.
1 Introduction
Artificial intelligence aims to develop agents that
can learn to accomplish a set of tasks. Continual
Learning (CL) (Ring,1998;Thrun,1998) is crucial
for this, but when a model is sequentially trained
on different tasks with different data distributions,
it can lose its ability to perform well on previous
tasks, a phenomenon is known as catastrophic for-
getting (CF) (McCloskey and Cohen,1989;Zhao
and Schmidhuber,1996;Thrun,1998). This is
caused by the lack of access to data from previ-
ous tasks, as well as conflicting updates to shared
model parameters when sequentially learning mul-
tiple tasks, which is called parameter interference
(McCloskey and Cohen,1989).
Recently, some CL methods avoid parameter
interference by taking inspiration from the Lottery
Ticket Hypothesis (Frankle and Carbin,2018) and
Supermasks (Zhou et al.,2019) to exploit the
expressive power of sparse subnetworks. Given
that we have a combinatorial number of sparse
subnetworks inside a network, Zhou et al. (2019)
noted that even within randomly weighted neural
networks, there exist certain subnetworks known
as supermasks that achieve good performance. A
supermask is a sparse binary mask that selectively
keeps or removes each connection in a fixed
and randomly initialized network to produce a
subnetwork with good performance on a given
task. We call this the subnetwork as supermask
subnetwork that is shown in Figure 1, highlighted
in red weights. Building upon this idea, Wortsman
et al. (2020) proposed a CL method, SupSup,
which initializes a network with fixed and random
weights and then learns a different supermask for
each new task. This allows them to prevent catas-
trophic forgetting (CF) as there is no parameter
interference (because the model weights are fixed).
Although SupSup (Wortsman et al.,2020) pre-
vents CF, there are some problems with using su-
permasks for CL: (1) Fixed random model weights
in SupSup limits the supermask subnetwork’s rep-
resentational power resulting in sub-optimal perfor-
mance. (2) When learning a task, there is no mecha-
nism for transferring learned knowledge from previ-
ous tasks to better learn the current task. Moreover,
the model is not accumulating knowledge over time
as the weights are not being updated.
arXiv:2210.10209v2 [cs.CV] 5 Jul 2023
Overlapping weights are not updated
Randomly initialized
weights
Mask over
weights
Mask over
weights
Mask over
weights
Find Mask
Find Mask
Find Mask
Task1 Training
Task2 Training
Task3 Training
Train
non-overlapping
weights
Train
non-overlapping
weights
Train
non-overlapping
weights
: Trained weights
: Untrained weights
: Overlapping weights
: Task1 : Task2
: Task3
Figure 1: EXSSNET diagram. We start with random weights
W(0)
. For task 1, we first learn a supermask
M1
(the corresponding
subnetwork is marked by red color, column 2 row 1) and then train the weight corresponding to
M1
resulting in weights
W(1)
(bold red lines, column 1 row 2). For task 2, we learn the mask
M2
over fixed weights
W(1)
. If mask
M2
weights overlap with
M1
(marked by bold dashed green lines in column 3 row 1), then only the non-overlapping weights (solid green lines) of the task
2 subnetwork are updated (as shown by bold and solid green lines column 3 row 2). These already trained weights (bold lines)
are not updated by any subsequent task. Finally, for task 3, we learn the mask
M3
(blue lines) and update the solid blue weights.
To overcome the aforementioned issues, we
propose our method, EXSSNET (Exclusive
Supermask SubNEtwork Training), pronounced
as ‘excess-net’, which first learns a mask for a task
and then selectively trains a subset of weights from
the supermask subnetwork. We train the weights of
this subnetwork via exclusion that avoids updating
parameters from the current subnetwork that
have already been updated by any of the previous
tasks. In Figure 1, we demonstrate EXSSNET
that also helps us to prevent forgetting. Training
the supermask subnetwork’s weights increases its
representational power and allows EXSSNET to
encode task-specific knowledge inside the subnet-
work (see Figure 2). This solves the first problem
and allows EXSSNET to perform comparably to a
fully trained network on individual tasks; and when
learning multiple tasks, the exclusive subnetwork
training improves the performance of each task
while still preventing forgetting (see Figure 3).
To address the second problem of knowledge
transfer, we propose a
k
-nearest neighbors-based
knowledge transfer (KKT) module that is able to
utilize relevant information from the previously
learned tasks to improve performance on new tasks
while learning them faster. Our KKT module uses
KNN classification to select a subnetwork from
the previously learned tasks that has a better than
random predictive power for the current task and
use it as a starting point to learn the new tasks.
Next, we show our method’s advantage by ex-
perimenting with both natural language and vi-
sion tasks. For natural language, we evaluate on
WebNLP classification tasks (de Masson d'Autume
et al.,2019) and GLUE benchmark tasks (Wang
et al.,2018), whereas, for vision, we evaluate on
SplitMNIST (Zenke et al.,2017), SplitCIFAR100
(De Lange and Tuytelaars,2021), and SplitTiny-
ImageNet (Buzzega et al.,2020) datasets. We
show that for both language and vision domains,
EXSSNET outperforms multiple strong and recent
continual learning methods based on replay, reg-
ularization, distillation, and parameter isolation.
For the vision domain, EXSSNET outperforms the
strongest baseline by
4.8
% and
1.4
% on SplitCI-
FAR and SplitTinyImageNet datasets respectively,
while surpassing multitask model and bridging the
gap to training individual models for each task.
In addition, for GLUE datasets, EXSSNET is 2%
better than the strongest baseline methods and sur-
passes the performance of multitask learning that
uses all the data at once. Moreover, EXSSNET ob-
tains an average improvement of
8.3
% over SupSup
for sparse masks with
210
% of the model param-
eters and scales to a large number of tasks (100).
Furthermore, EXSSNET with the KKT module
learns new tasks in as few as 30 epochs compared
to 100 epochs without it, while achieving 3.2%
higher accuracy on the SplitCIFAR100 dataset. In
summary, our contributions are listed below:
We propose a simple and novel method to im-
prove mask learning by combining it with ex-
clusive subnetwork weight training to improve
CL performance while preventing CF.
We propose a KNN-based Knowledge Transfer
(KKT) module for supermask initialization that
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
30
40
50
60
Methods
ExSSNeT
SSNeT
SupSup
Fully Trained
Mask Density
Average Test Accuracy
Figure 2: Test accuracy versus the mask density for 100-
way CIFAR100 classification. Averaged over 3 seeds.
dynamically identifies previous tasks to transfer
knowledge to learn new tasks better and faster.
Extensive experiments on NLP and vision tasks
show that EXSSNET outperforms strong base-
lines and is comparable to multitask model for
NLP tasks while surpassing it for vision tasks.
Moreover, EXSSNET works well for sparse
masks and scales to a large number of tasks.
2 Motivation
Using sparsity for CL is an effective technique to
learn multiple tasks, i.e., by encoding them in dif-
ferent subnetworks inside a single model. SupSup
(Wortsman et al.,2020) is an instantiation of this
that initializes the network weights randomly and
then learns a separate supermask for each task (see
Figure 7). They prevent CF because the weights of
the network are fixed and never updated. However,
this is a crucial problem as discussed below.
Problem 1 - Sub-Optimal Performance of Su-
permask: Although fixed network weights in
SupSup prevent CF, this also restricts the repre-
sentational capacity, leading to worse performance
compared to a fully trained network. In Figure 2,
we report the test accuracy with respect to the frac-
tion of network parameters selected by the mask,
i.e., the mask density for an underlying ResNet18
model on a single 100-way classification on CI-
FAR100 dataset. The fully trained ResNet18 model
(dashed green line) achieves an accuracy of
63.9%
.
Similar to Zhou et al. (2019), we observe that the
performance of SupSup (yellow dashed line) is
at least
8.3
% worse compared to a fully trained
model. As a possible partial remedy, we propose a
simple solution, SSNET (Supermask SubNEtwork
Training), that first finds a subnetwork for a task
and then trains the subnetwork’s weights. This in-
creases the representational capacity of the subnet-
work because there are more trainable parameters.
For a single task, the test accuracy of SSNET is bet-
ter than SupSup for all mask densities and matches
the performance of the fully trained model beyond
a density threshold. But as shown below, when
0 10 20 30 40 50 60 70 80
30
40
50
60
70
80
Methods
ExSSNeT
SSNeT
SupSup
Average Sparse Overlap
Average Test Accuracy
Figure 3: Average Test accuracy on five 20-way tasks from
SplitCIFAR100 versus sparse overlap. Averaged over 3 seeds.
learning multiple tasks sequentially, SSNET gives
rise to parameter interference that results in CF.
Problem 2 - Parameter Interference Due to Sub-
network Weight Training for Multiple Tasks:
Next, we demonstrate that when learning multi-
ple tasks sequentially, SSNET can still lead to CF.
In Figure 3, we report the average test accuracy
versus the fraction of overlapping parameters be-
tween the masks of different tasks, i.e., the sparse
overlap (see Equation 2) for five different 20-way
classification tasks from SplitCIFAR100 dataset
with ResNet18 model. We observe that SSNET
outperforms SupSup for lower sparse overlap but
as the sparse overlap increases, the performance
declines because the supermask subnetworks for
different tasks have more overlapping (common)
weights (bold dashed lines in Figure 1). This leads
to higher parameter interference resulting in in-
creased forgetting which suppresses the gain from
subnetwork weight training.
Our final proposal, EXSSNET, resolves both of
these problems by selectively training a subset of
the weights in the supermask subnetwork to prevent
parameter interference. When learning multiple
tasks, this prevents CF, resulting in strictly better
performance than SupSup (Figure 3) while having
the representational power to match bridge the gap
with fully trained models (Figure 2).
3 Method
As shown in Figure 1, when learning a new task
ti
, EXSSNET follows three steps: (1) We learn
a supermask
Mi
for the task; (2) We use all the
previous tasks’ masks
M1, . . . , Mi1
to create a
free parameter mask
Mfree
i
, that finds the parame-
ters selected by the mask
Mi
that were not selected
by any of the previous masks; (3) We update the
weights corresponding to the mask
Mfree
i
as this
avoids parameter interference. Now, we formally
describe all the step of our method EXSSNET
(Exclusive Supermask SubNEtwork Training) for
a Multi-layer perceptron (MLP).
Notation: During training, we can treat each
layer
l
of an MLP network separately. An inter-
mediate layer
l
has
nl
nodes denoted by
V(l)=
{v1, . . . , vnl}
. For a node
v
in layer
l
, let
Iv
denote
its input and
Zv=σ(Iv)
denote its output, where
σ(.)
is the activation function. Given this nota-
tion,
Iv
can be written as
Iv=Pu∈V(l1) wuvZu
,
where
wuv
is the network weight connecting node
u
to node
v
. The complete network weights for the
MLP are denoted by
W
. When training the task
ti
,
we have access to the supermasks from all previ-
ous tasks
{Mj}i1
j=1
and the model weights
W(i1)
obtained after learning task ti1.
3.1 EXSSNET: Exclusive Supermask
SubNEtwork Training
Finding Supermasks: Following Wortsman et al.
(2020), we use the algorithm of Ramanujan et al.
(2019) to learn a supermask
Mi
for the current
task
ti
. The supermask
Mi
is learned with re-
spect to the underlying model weights
W(i1)
and
the mask selects a fraction of weights that lead to
good performance on the task without training the
weights. To achieve this, we learn a score
suv
for
each weight
wuv
, and once trained, these scores
are thresholded to obtain the mask. Here, the input
to a node
v
is
Iv=Pu∈V(l1) wuvZumuv
, where
muv =h(suv)
is the binary mask value and
h(.)
is a function which outputs 1 for top-
k%
of the
scores in the layer with
k
being the mask density.
Next, we use a straight-through gradient estimator
(Bengio et al.,2013) and iterate over the current
task’s data samples to update the scores for the
corresponding supermask Mias follows,
suv =suv αˆgsuv ; ˆgsuv =L
Iv
Iv
suv =L
Ivwuv Zu
(1)
Finding Exclusive Mask Parameters: Given a
learned mask
Mi
, we use all the previous tasks’
masks
M1, . . . , Mi1
to create a free parameter
mask
Mfree
i
, that finds the parameters selected by
the mask
Mi
that were not selected by any of the
previous masks. We do this by – (1) creating a new
mask
M1:i1
containing all the parameters already
updated by any of the previous tasks by taking a
union of all the previous masks
{Mj}i1
j=1
by using
the logical or operation, and (2) Then we obtain
a mask
Mfree
i
by taking the intersection of all the
network parameters not used by any previous task
which is given by the negation of the mask
M1:i1
with the current task mask
Mi
via a logical and
operation. Next, we use this mask
Mfree
i
for the
exclusive supermask subnetwork weight training.
Exclusive Supermask Subnetwork Weight
Training: For training the subnetwork param-
eters for task
ti
given the free parameter mask
Mfree
i
, we perform the forward pass on the model
as
model(x, W ˆ
Mi)
where
ˆ
Mi=Mfree
i+((1
Mfree
i)Mi).detach()
, where
is the element-
wise multiplication. Hence,
ˆ
Mi
allows us to use
all the connections in
Mi
during the forward pass
of the training but during the backward pass, only
the parameters in
Mfree
i
are updated because the
gradient value is 0 for all the weights
wuv
where
mfree
uv = 0
. While during the inference on task
ti
we use the mask Mi. In contrast, SSNET uses the
task mask
Mi
both during the training and infer-
ence as model(x, W (i1) Mi). This updates all
the parameters in the mask including the parame-
ters that are already updated by previous tasks that
result in CF. Therefore, in cases where the sparse
overlap is high, EXSSNET is preferred over SS-
NET. To summarize, EXSSNET circumvents the
CF issue of SSNET while benefiting from the sub-
network training to improve overall performance
as shown in Figure 3.
3.2 KKT: Knn-Based Knowledge Transfer
When learning multiple tasks, it is a desired
property to transfer information learned by the
previous tasks to achieve better performance on
new tasks and to learn them faster (Biesialska et al.,
2020). Hence, we propose a K-Nearest Neighbours
(KNN) based knowledge transfer (KKT) module
that uses KNN classification to dynamically find
the most relevant previous task (Veniat et al.,2021)
to initialize the supermask for the current task.
To be more specific, before learning the mask
Mi
for the current task
ti
, we randomly sample
a small fraction of data from task
ti
and split it
into a train and test set. Next, we use the trained
subnetworks of each previous task
t1, . . . , ti1
to
obtain features on this sampled data. Then we learn
i1
independent KNN-classification models using
these features. Then we evaluate these
i1
models
on the sampled test set to obtain accuracy scores
which denote the predictive power of features from
each previous task for the current task. Finally, we
select the previous task with the highest accuracy
摘要:

ExclusiveSupermaskSubnetworkTrainingforContinualLearningPrateekYadav&MohitBansalDepartmentofComputerScienceUNCChapelHill{praty,mbansal}@cs.unc.eduAbstractContinualLearning(CL)methodsfocusonac-cumulatingknowledgeovertimewhileavoid-ingcatastrophicforgetting.Recently,Worts-manetal.(2020)proposedaCLmeth...

展开>> 收起<<
Exclusive Supermask Subnetwork Training for Continual Learning Prateek Yadav Mohit Bansal Department of Computer Science.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:747.43KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注