Exclusive Supermask Subnetwork Training for Continual Learning Prateek Yadav Mohit Bansal Department of Computer Science

2025-05-06 0 0 747.43KB 17 页 10玖币

侵权投诉

Exclusive Supermask Subnetwork Training for Continual Learning

Prateek Yadav & Mohit Bansal

Department of Computer Science

UNC Chapel Hill

{praty,mbansal}@cs.unc.edu

Abstract

Continual Learning (CL) methods focus on ac-

cumulating knowledge over time while avoid-

ing catastrophic forgetting. Recently, Worts-

man et al. (2020) proposed a CL method,

SupSup, which uses a randomly initialized,

ﬁxed base network (model) and ﬁnds a su-

permask for each new task that selectively

keeps or removes each weight to produce a

subnetwork. They prevent forgetting as the

network weights are not being updated. Al-

though there is no forgetting, the performance

of SupSup is sub-optimal because ﬁxed weights

restrict its representational power. Further-

more, there is no accumulation or transfer of

knowledge inside the model when new tasks

are learned. Hence, we propose EXSSNET

(Exclusive Supermask SubNEtwork Training),

that performs exclusive and non-overlapping

subnetwork weight training. This avoids con-

ﬂicting updates to the shared weights by subse-

quent tasks to improve performance while still

preventing forgetting. Furthermore, we pro-

pose a novel KNN-based Knowledge Transfer

(KKT) module that utilizes previously acquired

knowledge to learn new tasks better and faster.

We demonstrate that EXSSNET outperforms

strong previous methods on both NLP and

Vision domains while preventing forgetting.

Moreover, EXSSNET is particularly advan-

tageous for sparse masks that activate

of the model parameters, resulting in an aver-

age improvement of

8.3

% over SupSup. Fur-

thermore, EXSSNET scales to a large num-

ber of tasks (100). Our code is available at

https://github.com/prateeky2806/exessnet.

1 Introduction

Artiﬁcial intelligence aims to develop agents that

can learn to accomplish a set of tasks. Continual

Learning (CL) (Ring,1998;Thrun,1998) is crucial

for this, but when a model is sequentially trained

on different tasks with different data distributions,

it can lose its ability to perform well on previous

tasks, a phenomenon is known as catastrophic for-

getting (CF) (McCloskey and Cohen,1989;Zhao

and Schmidhuber,1996;Thrun,1998). This is

caused by the lack of access to data from previ-

ous tasks, as well as conﬂicting updates to shared

model parameters when sequentially learning mul-

tiple tasks, which is called parameter interference

(McCloskey and Cohen,1989).

Recently, some CL methods avoid parameter

interference by taking inspiration from the Lottery

Ticket Hypothesis (Frankle and Carbin,2018) and

Supermasks (Zhou et al.,2019) to exploit the

expressive power of sparse subnetworks. Given

that we have a combinatorial number of sparse

subnetworks inside a network, Zhou et al. (2019)

noted that even within randomly weighted neural

networks, there exist certain subnetworks known

as supermasks that achieve good performance. A

supermask is a sparse binary mask that selectively

keeps or removes each connection in a ﬁxed

and randomly initialized network to produce a

subnetwork with good performance on a given

task. We call this the subnetwork as supermask

subnetwork that is shown in Figure 1, highlighted

in red weights. Building upon this idea, Wortsman

et al. (2020) proposed a CL method, SupSup,

which initializes a network with ﬁxed and random

weights and then learns a different supermask for

each new task. This allows them to prevent catas-

trophic forgetting (CF) as there is no parameter

interference (because the model weights are ﬁxed).

Although SupSup (Wortsman et al.,2020) pre-

vents CF, there are some problems with using su-

permasks for CL: (1) Fixed random model weights

in SupSup limits the supermask subnetwork’s rep-

resentational power resulting in sub-optimal perfor-

mance. (2) When learning a task, there is no mecha-

nism for transferring learned knowledge from previ-

ous tasks to better learn the current task. Moreover,

the model is not accumulating knowledge over time

as the weights are not being updated.

arXiv:2210.10209v2 [cs.CV] 5 Jul 2023

Overlapping weights are not updated

Randomly initialized

weights

Mask over

weights

Mask over

weights

Mask over

weights

Find Mask

Task1 Training

Task2 Training

Task3 Training

Train

non-overlapping

weights

Train

non-overlapping

weights

Train

non-overlapping

weights

: Trained weights

: Untrained weights

: Overlapping weights

: Task1 : Task2

: Task3

Figure 1: EXSSNET diagram. We start with random weights

W(0)

. For task 1, we ﬁrst learn a supermask

(the corresponding

subnetwork is marked by red color, column 2 row 1) and then train the weight corresponding to

resulting in weights

W(1)

(bold red lines, column 1 row 2). For task 2, we learn the mask

over ﬁxed weights

W(1)

. If mask

weights overlap with

(marked by bold dashed green lines in column 3 row 1), then only the non-overlapping weights (solid green lines) of the task

2 subnetwork are updated (as shown by bold and solid green lines column 3 row 2). These already trained weights (bold lines)

are not updated by any subsequent task. Finally, for task 3, we learn the mask

(blue lines) and update the solid blue weights.

To overcome the aforementioned issues, we

propose our method, EXSSNET (Exclusive

Supermask SubNEtwork Training), pronounced

as ‘excess-net’, which ﬁrst learns a mask for a task

and then selectively trains a subset of weights from

the supermask subnetwork. We train the weights of

this subnetwork via exclusion that avoids updating

parameters from the current subnetwork that

have already been updated by any of the previous

tasks. In Figure 1, we demonstrate EXSSNET

that also helps us to prevent forgetting. Training

the supermask subnetwork’s weights increases its

representational power and allows EXSSNET to

encode task-speciﬁc knowledge inside the subnet-

work (see Figure 2). This solves the ﬁrst problem

and allows EXSSNET to perform comparably to a

fully trained network on individual tasks; and when

learning multiple tasks, the exclusive subnetwork

training improves the performance of each task

while still preventing forgetting (see Figure 3).

To address the second problem of knowledge

transfer, we propose a

-nearest neighbors-based

knowledge transfer (KKT) module that is able to

utilize relevant information from the previously

learned tasks to improve performance on new tasks

while learning them faster. Our KKT module uses

KNN classiﬁcation to select a subnetwork from

the previously learned tasks that has a better than

random predictive power for the current task and

use it as a starting point to learn the new tasks.

Next, we show our method’s advantage by ex-

perimenting with both natural language and vi-

sion tasks. For natural language, we evaluate on

WebNLP classiﬁcation tasks (de Masson d'Autume

et al.,2019) and GLUE benchmark tasks (Wang

et al.,2018), whereas, for vision, we evaluate on

SplitMNIST (Zenke et al.,2017), SplitCIFAR100

(De Lange and Tuytelaars,2021), and SplitTiny-

ImageNet (Buzzega et al.,2020) datasets. We

show that for both language and vision domains,

EXSSNET outperforms multiple strong and recent

continual learning methods based on replay, reg-

ularization, distillation, and parameter isolation.

For the vision domain, EXSSNET outperforms the

strongest baseline by

4.8

% and

1.4

% on SplitCI-

FAR and SplitTinyImageNet datasets respectively,

while surpassing multitask model and bridging the

gap to training individual models for each task.

In addition, for GLUE datasets, EXSSNET is 2%

better than the strongest baseline methods and sur-

passes the performance of multitask learning that

uses all the data at once. Moreover, EXSSNET ob-

tains an average improvement of

8.3

% over SupSup

for sparse masks with

2−10

% of the model param-

eters and scales to a large number of tasks (100).

Furthermore, EXSSNET with the KKT module

learns new tasks in as few as 30 epochs compared

to 100 epochs without it, while achieving 3.2%

higher accuracy on the SplitCIFAR100 dataset. In

summary, our contributions are listed below:

•

We propose a simple and novel method to im-

prove mask learning by combining it with ex-

clusive subnetwork weight training to improve

CL performance while preventing CF.

•

We propose a KNN-based Knowledge Transfer

(KKT) module for supermask initialization that

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Methods

ExSSNeT

SSNeT

SupSup

Fully Trained

Mask Density

Average Test Accuracy

Figure 2: Test accuracy versus the mask density for 100-

way CIFAR100 classiﬁcation. Averaged over 3 seeds.

dynamically identiﬁes previous tasks to transfer

knowledge to learn new tasks better and faster.

•

Extensive experiments on NLP and vision tasks

show that EXSSNET outperforms strong base-

lines and is comparable to multitask model for

NLP tasks while surpassing it for vision tasks.

Moreover, EXSSNET works well for sparse

masks and scales to a large number of tasks.

2 Motivation

Using sparsity for CL is an effective technique to

learn multiple tasks, i.e., by encoding them in dif-

ferent subnetworks inside a single model. SupSup

(Wortsman et al.,2020) is an instantiation of this

that initializes the network weights randomly and

then learns a separate supermask for each task (see

Figure 7). They prevent CF because the weights of

the network are ﬁxed and never updated. However,

this is a crucial problem as discussed below.

Problem 1 - Sub-Optimal Performance of Su-

permask: Although ﬁxed network weights in

SupSup prevent CF, this also restricts the repre-

sentational capacity, leading to worse performance

compared to a fully trained network. In Figure 2,

we report the test accuracy with respect to the frac-

tion of network parameters selected by the mask,

i.e., the mask density for an underlying ResNet18

model on a single 100-way classiﬁcation on CI-

FAR100 dataset. The fully trained ResNet18 model

(dashed green line) achieves an accuracy of

63.9%

Similar to Zhou et al. (2019), we observe that the

performance of SupSup (yellow dashed line) is

at least

8.3

% worse compared to a fully trained

model. As a possible partial remedy, we propose a

simple solution, SSNET (Supermask SubNEtwork

Training), that ﬁrst ﬁnds a subnetwork for a task

and then trains the subnetwork’s weights. This in-

creases the representational capacity of the subnet-

work because there are more trainable parameters.

For a single task, the test accuracy of SSNET is bet-

ter than SupSup for all mask densities and matches

the performance of the fully trained model beyond

a density threshold. But as shown below, when

0 10 20 30 40 50 60 70 80

Methods

ExSSNeT

SSNeT

SupSup

Average Sparse Overlap

Average Test Accuracy

Figure 3: Average Test accuracy on ﬁve 20-way tasks from

SplitCIFAR100 versus sparse overlap. Averaged over 3 seeds.

learning multiple tasks sequentially, SSNET gives

rise to parameter interference that results in CF.

Problem 2 - Parameter Interference Due to Sub-

network Weight Training for Multiple Tasks:

Next, we demonstrate that when learning multi-

ple tasks sequentially, SSNET can still lead to CF.

In Figure 3, we report the average test accuracy

versus the fraction of overlapping parameters be-

tween the masks of different tasks, i.e., the sparse

overlap (see Equation 2) for ﬁve different 20-way

classiﬁcation tasks from SplitCIFAR100 dataset

with ResNet18 model. We observe that SSNET

outperforms SupSup for lower sparse overlap but

as the sparse overlap increases, the performance

declines because the supermask subnetworks for

different tasks have more overlapping (common)

weights (bold dashed lines in Figure 1). This leads

to higher parameter interference resulting in in-

creased forgetting which suppresses the gain from

subnetwork weight training.

Our ﬁnal proposal, EXSSNET, resolves both of

these problems by selectively training a subset of

the weights in the supermask subnetwork to prevent

parameter interference. When learning multiple

tasks, this prevents CF, resulting in strictly better

performance than SupSup (Figure 3) while having

the representational power to match bridge the gap

with fully trained models (Figure 2).

3 Method

As shown in Figure 1, when learning a new task

, EXSSNET follows three steps: (1) We learn

a supermask

for the task; (2) We use all the

previous tasks’ masks

M1, . . . , Mi−1

to create a

free parameter mask

Mfree

, that ﬁnds the parame-

ters selected by the mask

that were not selected

by any of the previous masks; (3) We update the

weights corresponding to the mask

Mfree

as this

avoids parameter interference. Now, we formally

describe all the step of our method EXSSNET

(Exclusive Supermask SubNEtwork Training) for

a Multi-layer perceptron (MLP).

Notation: During training, we can treat each

layer

of an MLP network separately. An inter-

mediate layer

has

nodes denoted by

V(l)=

{v1, . . . , vnl}

. For a node

in layer

, let

denote

its input and

Zv=σ(Iv)

denote its output, where

σ(.)

is the activation function. Given this nota-

tion,

can be written as

Iv=Pu∈V(l−1) wuvZu

where

wuv

is the network weight connecting node

to node

. The complete network weights for the

MLP are denoted by

. When training the task

we have access to the supermasks from all previ-

ous tasks

{Mj}i−1

j=1

and the model weights

W(i−1)

obtained after learning task ti−1.

3.1 EXSSNET: Exclusive Supermask

SubNEtwork Training

Finding Supermasks: Following Wortsman et al.

(2020), we use the algorithm of Ramanujan et al.

(2019) to learn a supermask

for the current

task

. The supermask

is learned with re-

spect to the underlying model weights

W(i−1)

and

the mask selects a fraction of weights that lead to

good performance on the task without training the

weights. To achieve this, we learn a score

suv

for

each weight

wuv

, and once trained, these scores

are thresholded to obtain the mask. Here, the input

to a node

Iv=Pu∈V(l−1) wuvZumuv

, where

muv =h(suv)

is the binary mask value and

h(.)

is a function which outputs 1 for top-

of the

scores in the layer with

being the mask density.

Next, we use a straight-through gradient estimator

(Bengio et al.,2013) and iterate over the current

task’s data samples to update the scores for the

corresponding supermask Mias follows,

suv =suv −αˆgsuv ; ˆgsuv =∂L

∂Iv

∂suv =∂L

∂Ivwuv Zu

(1)

Finding Exclusive Mask Parameters: Given a

learned mask

, we use all the previous tasks’

masks

M1, . . . , Mi−1

to create a free parameter

mask

Mfree

, that ﬁnds the parameters selected by

the mask

that were not selected by any of the

previous masks. We do this by – (1) creating a new

mask

M1:i−1

containing all the parameters already

updated by any of the previous tasks by taking a

union of all the previous masks

{Mj}i−1

j=1

by using

the logical or operation, and (2) Then we obtain

a mask

Mfree

by taking the intersection of all the

network parameters not used by any previous task

which is given by the negation of the mask

M1:i−1

with the current task mask

via a logical and

operation. Next, we use this mask

Mfree

for the

exclusive supermask subnetwork weight training.

Exclusive Supermask Subnetwork Weight

Training: For training the subnetwork param-

eters for task

given the free parameter mask

Mfree

, we perform the forward pass on the model

model(x, W ⊙ˆ

Mi)

where

Mi=Mfree

i+((1−

Mfree

i)⊙Mi).detach()

, where

⊙

is the element-

wise multiplication. Hence,

allows us to use

all the connections in

during the forward pass

of the training but during the backward pass, only

the parameters in

Mfree

are updated because the

gradient value is 0 for all the weights

wuv

where

mfree

uv = 0

. While during the inference on task

we use the mask Mi. In contrast, SSNET uses the

task mask

both during the training and infer-

ence as model(x, W (i−1) ⊙Mi). This updates all

the parameters in the mask including the parame-

ters that are already updated by previous tasks that

result in CF. Therefore, in cases where the sparse

overlap is high, EXSSNET is preferred over SS-

NET. To summarize, EXSSNET circumvents the

CF issue of SSNET while beneﬁting from the sub-

network training to improve overall performance

as shown in Figure 3.

3.2 KKT: Knn-Based Knowledge Transfer

When learning multiple tasks, it is a desired

property to transfer information learned by the

previous tasks to achieve better performance on

new tasks and to learn them faster (Biesialska et al.,

2020). Hence, we propose a K-Nearest Neighbours

(KNN) based knowledge transfer (KKT) module

that uses KNN classiﬁcation to dynamically ﬁnd

the most relevant previous task (Veniat et al.,2021)

to initialize the supermask for the current task.

To be more speciﬁc, before learning the mask

for the current task

, we randomly sample

a small fraction of data from task

and split it

into a train and test set. Next, we use the trained

subnetworks of each previous task

t1, . . . , ti−1

obtain features on this sampled data. Then we learn

i−1

independent KNN-classiﬁcation models using

these features. Then we evaluate these

i−1

models

on the sampled test set to obtain accuracy scores

which denote the predictive power of features from

each previous task for the current task. Finally, we

select the previous task with the highest accuracy

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExclusiveSupermaskSubnetworkTrainingforContinualLearningPrateekYadav&MohitBansalDepartmentofComputerScienceUNCChapelHill{praty,mbansal}@cs.unc.eduAbstractContinualLearning(CL)methodsfocusonac-cumulatingknowledgeovertimewhileavoid-ingcatastrophicforgetting.Recently,Worts-manetal.(2020)proposedaCLmeth...

展开>> 收起<<

Exclusive Supermask Subnetwork Training for Continual Learning Prateek Yadav Mohit Bansal Department of Computer Science.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exclusive Supermask Subnetwork Training for Continual Learning Prateek Yadav Mohit Bansal Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: