Exclusive Supermask Subnetwork Training for Continual Learning
Prateek Yadav & Mohit Bansal
Department of Computer Science
UNC Chapel Hill
{praty,mbansal}@cs.unc.edu
Abstract
Continual Learning (CL) methods focus on ac-
cumulating knowledge over time while avoid-
ing catastrophic forgetting. Recently, Worts-
man et al. (2020) proposed a CL method,
SupSup, which uses a randomly initialized,
fixed base network (model) and finds a su-
permask for each new task that selectively
keeps or removes each weight to produce a
subnetwork. They prevent forgetting as the
network weights are not being updated. Al-
though there is no forgetting, the performance
of SupSup is sub-optimal because fixed weights
restrict its representational power. Further-
more, there is no accumulation or transfer of
knowledge inside the model when new tasks
are learned. Hence, we propose EXSSNET
(Exclusive Supermask SubNEtwork Training),
that performs exclusive and non-overlapping
subnetwork weight training. This avoids con-
flicting updates to the shared weights by subse-
quent tasks to improve performance while still
preventing forgetting. Furthermore, we pro-
pose a novel KNN-based Knowledge Transfer
(KKT) module that utilizes previously acquired
knowledge to learn new tasks better and faster.
We demonstrate that EXSSNET outperforms
strong previous methods on both NLP and
Vision domains while preventing forgetting.
Moreover, EXSSNET is particularly advan-
tageous for sparse masks that activate
2
-
10
%
of the model parameters, resulting in an aver-
age improvement of
8.3
% over SupSup. Fur-
thermore, EXSSNET scales to a large num-
ber of tasks (100). Our code is available at
https://github.com/prateeky2806/exessnet.
1 Introduction
Artificial intelligence aims to develop agents that
can learn to accomplish a set of tasks. Continual
Learning (CL) (Ring,1998;Thrun,1998) is crucial
for this, but when a model is sequentially trained
on different tasks with different data distributions,
it can lose its ability to perform well on previous
tasks, a phenomenon is known as catastrophic for-
getting (CF) (McCloskey and Cohen,1989;Zhao
and Schmidhuber,1996;Thrun,1998). This is
caused by the lack of access to data from previ-
ous tasks, as well as conflicting updates to shared
model parameters when sequentially learning mul-
tiple tasks, which is called parameter interference
(McCloskey and Cohen,1989).
Recently, some CL methods avoid parameter
interference by taking inspiration from the Lottery
Ticket Hypothesis (Frankle and Carbin,2018) and
Supermasks (Zhou et al.,2019) to exploit the
expressive power of sparse subnetworks. Given
that we have a combinatorial number of sparse
subnetworks inside a network, Zhou et al. (2019)
noted that even within randomly weighted neural
networks, there exist certain subnetworks known
as supermasks that achieve good performance. A
supermask is a sparse binary mask that selectively
keeps or removes each connection in a fixed
and randomly initialized network to produce a
subnetwork with good performance on a given
task. We call this the subnetwork as supermask
subnetwork that is shown in Figure 1, highlighted
in red weights. Building upon this idea, Wortsman
et al. (2020) proposed a CL method, SupSup,
which initializes a network with fixed and random
weights and then learns a different supermask for
each new task. This allows them to prevent catas-
trophic forgetting (CF) as there is no parameter
interference (because the model weights are fixed).
Although SupSup (Wortsman et al.,2020) pre-
vents CF, there are some problems with using su-
permasks for CL: (1) Fixed random model weights
in SupSup limits the supermask subnetwork’s rep-
resentational power resulting in sub-optimal perfor-
mance. (2) When learning a task, there is no mecha-
nism for transferring learned knowledge from previ-
ous tasks to better learn the current task. Moreover,
the model is not accumulating knowledge over time
as the weights are not being updated.
arXiv:2210.10209v2 [cs.CV] 5 Jul 2023