ImpressLearn Continual Learning via Combined Task Im- pressions Dhrupad Bhardwaj db4045nyu.edu

2025-05-08 0 0 2.53MB 15 页 10玖币
侵权投诉
ImpressLearn: Continual Learning via Combined Task Im-
pressions
Dhrupad Bhardwaj db4045@nyu.edu
Center For Data Science
New York University
Julia Kempe jk185@nyu.edu
Center For Data Science
New York University
Artem Vysogorets amv458@nyu.edu
Center For Data Science
New York University
Angela M. Teng at2507@nyu.edu
Center For Data Science
New York University
Evaristus C. Ezekwem ece278@nyu.edu
Center For Data Science New York University
Abstract
This work proposes a new method to sequentially train deep neural networks on multiple
tasks without suffering catastrophic forgetting, while endowing it with the capability to
quickly adapt to unseen tasks. Starting from existing work on network masking (Wortsman
et al., 2020), we show that simply learning a linear combination of a small number of task-
specific supermasks (impressions) on a randomly initialized backbone network is sufficient
to both retain accuracy on previously learned tasks, as well as achieve high accuracy on
unseen tasks. In contrast to previous methods, we do not require to generate dedicated
masks or contexts for each new task, instead leveraging transfer learning to keep per-task
parameter overhead small. Our work illustrates the power of linearly combining individual
impressions, each of which fares poorly in isolation, to achieve performance comparable to
a dedicated mask. Moreover, even repeated impressions from the same task (homogeneous
masks), when combined, can approach the performance of heterogeneous combinations if
sufficiently many impressions are used. Our approach scales more efficiently than existing
methods, often requiring orders of magnitude fewer parameters and can function without
modification even when task identity is missing. In addition, in the setting where task
labels are not given at inference, our algorithm gives an often favorable alternative to the
one-shot procedure used by Wortsman et al. (2020). We evaluate our method on a number
of well-known image classification datasets and network architectures.
1 Introduction
Sequential learning without catastrophic forgetting has been an area of active research in machine learning
for some time (Maes et al., 1996; Thrun & Pratt, 1998; Serra et al., 2018). A precondition for achieving
Artificial General Intelligence is that models should be able to learn and remember a wide variety of tasks
sequentially, without forgetting previously learned ones. In real-world scenarios, data from different tasks
may not be available simultaneously, which makes it imperative to both allow continued learning of a
1
arXiv:2210.01987v2 [cs.CV] 31 Jan 2023
Figure 1: Intuitive representation of ImpressLearn. Binary basis-masks (left) are linearly combined via
learnable coefficients αto construct a task-specific real-valued mask (right) that is applied to a fixed randomly
initialized backbone network at inference.
potentially unbounded number of tasks (see also The Sequential Learning Problem (McCloskey & Cohen,
1989), Constraints Imposed by Learning and Forgetting Functions (Ratcliff, 1990) and Lifelong Learning
Algorithms (Thrun & Pratt, 1998)). Recently, some successful approaches to combat this problem use
task specific sub-models, which allow neural networks to context-switch between different learning tasks
(Wortsman et al., 2020; Mallya et al., 2018; Mancini et al., 2018). The underlying context for each task
can be represented as “filters” or “masks”, altering the network’s structure for each task. Yet all of these
approaches scale unfavorably with the number of unique tasks to be learned.
ImpressLearn. We propose a novel method leveraging transfer learning and network masking to sequen-
tially learn a practically unlimited number of tasks with much lower per-task parameter overhead compared
to prevailing benchmarks. Our method, termed ImpressLearn, uses elements from Supermasks in Superpo-
sition (SupSup) by Wortsman et al. (2020); SupSup uses the observation that randomly initialized neural
networks contain subnetworks, obtained by applying binary parameter masks (supermasks), that achieve
good performance on any particular task (Zhou et al., 2019). These supermasks can be learned with stan-
dard gradient descent and stored, one mask per task. At inference, the appropriate task-specific mask is
applied when task identity is known. When task-labels of a previously seen task are unavailable, the correct
mask can be inferred via entropy minimization.
While SupSup provides strong results, it restricts knowledge transfer between tasks and requires considerable
memory overhead for each additional task. ImpressLearn, on the other hand, solves both problems with one
stone: it accommodates positive knowledge transfer by reusing existing supermasks and, as a result, requires
much less additional parameters for each new task. Our supermasks (basis-masks) are constructed from
a small batch of initial basis-tasks (heterogeneous setting), or from just a single basis-task (homogeneous
setting). By default, we select the first Ntasks as basis-tasks where Ncontrols the performance-cost
trade-off. Leveraging transfer learning, this set of learned basis-masks, each of which can be interpreted
as an impression of a previously seen basis-task, serves as a collection of latent features for new learning
objectives, encoding common structural information. This might be reminiscent of associative learning where
impressions of previous scenarios are combined to cope with new ones. Once a set of basis-masks is identified,
we learn an appropriate linear combination of these impressions to quickly construct a real-valued mask that
performs well on an unseen task. Hence, apart from a fixed number of basis-masks, only a small number
of floating-pint coefficients need to be learned and stored for each subsequent task. This greatly benefits
scalability, a major drawback of previous methods (Wortsman et al., 2020; Mallya et al., 2018). In principle,
the efficiency of ImpressLearn allows for an unlimited number of new tasks. In particular, as the number of
tasks increases, the per-task memory overhead converges to just a few extra parameters after amortizing the
cost of basis-masks, which is considerably cheaper than allocating additional parameter masks for each task.
2
Homogeneous and random basis-masks. Somewhat surprisingly, we can even generate all basis-masks
from the same initial task using different random seeds for the learning algorithm (but the same randomly
initialized backbone network). We show that with a sufficiently large number of such homogeneous im-
pressions, our algorithm learns linear combinations with close to benchmark accuracy on new tasks. We
are reminded of an infant learning by taking different “snapshots” of the same object to infer properties of
another. This homogeneous setting is particularly useful to address possible drifts in the data; akin to ensem-
bling, it leverages the power of linear combinations for transfer learning. An additional important advantage
is that the homogeneous setting has no limit on the number of basis-masks we can generate ab-initio.
As another baseline for ImpressLearn, we experimented with optimizing for a linear combination of entirely
random masks of desired sparsity. We demonstrate on several benchmarks that if we choose a sufficiently
large collection of such random basis-masks, our optimization still yields competitive performance. While
combinations of random masks naturally lag behind the heterogeneous and the homogeneous settings, we
show that there is a trade-off between the number of masks and their task-specificity (non-randomness). In
settings where producing task-specific basis-masks is costly, optimizing for a linear combination of a large
number of random masks can still yield satisfactory results.
Example: LeNet-300-100 on RotatedMNIST. As a preview of our approach and its performance,
Figure 2 shows the accuracy of ImpressLearn compared to SupSup on RotatedMNIST dataset. First, as a
sanity check, we apply basis-masks obtained from one task to tasks they were not optimized for. As expected,
this yields essentially random accuracy (see X in Figure 2), confirming that the performance of ImpressLearn
is beyond pure transfer learning and comes from linearly combining the initial impressions. Next, Figure
2 illustrates that ImpressLearn even with a small number of heterogeneous basis-masks is on par or even
superior to SupSup on unseen tasks. Note that, for each additional non-basis task, ImpressLearn with 10
basis-masks requires only 3×10 = 30 parameters (one per basis-mask per layer); in contrast, SupSup needs
to generate an entire binary mask, which requires 25,000+ tensor indices to specify assuming 10% sparsity.
Figure 2 also illustrates the performance of ImpressLearn over homogeneous basis-masks; in this setting,
we need a larger number of basis-masks to achieve accuracy comparable to the heterogeneous scenario.
Ultimately, however, we still match the performance of SupSup with a vastly smaller number of additional
per-task parameters. Lastly, the rightmost plot in Figure 2 shows that our algorithm can successfully
operate when task identity is not provided at inference (cf. GN regime (Wortsman et al., 2020)). Here, our
optimization procedure with the entropy objective finds the “correct” basis-mask or a linear combination
of basis-masks yielding similar or better performance compared to the one-shot baseline from Wortsman
et al. (2020). In Section 4, we provide empirical evidence of the efficacy of ImpressLearn on a variety of
benchmarks, outlining the radical savings in parameters that need to be stored per-task compared to SupSup.
The GN regime. In principle, both SupSup and ImpressLearn require access to task identities in order
to apply the right mask to the backbone network during inference. To relax this requirement, Wortsman
et al. (2020) extend their algorithm to a more challenging regime (coined GN —Given/Not given) where
task identifiers are present during training but unavailable at inference. In this regime, Wortsman et al.
(2020) use a one-shot minimization of the entropy of the model’s outputs to single out the correct mask. In
Section 4, we show that our algorithm is able to achieve this feat, too. We demonstrate that, when applied
to basis-tasks in the GN regime, the optimization routine of ImpressLearn either identifies the corresponding
basis-mask or even finds a better performing combination of basis-masks.
The remainder of the paper is organized as follows. In Section 2, we briefly review related work and
general approaches to countering catastrophic forgetting, highlighting research that motivated our approach.
In Section 3, we detail the ImpressLearn algorithm as well as discuss its components and features. In Section
4, we demonstrate the effectiveness of ImpressLearn on a variety of datasets and architectures to show
close-to-benchmark performance with a drastically reduced parameter count on unseen tasks, especially
where the number of incoming tasks is large. Additionally, we conduct several ablation studies to provide
better understanding of the algorithm. Finally, in Section 5, we talk about limitations of our work and
avenues for future research.
3
摘要:

ImpressLearn:ContinualLearningviaCombinedTaskIm-pressionsDhrupadBhardwajdb4045@nyu.eduCenterForDataScienceNewYorkUniversityJuliaKempejk185@nyu.eduCenterForDataScienceNewYorkUniversityArtemVysogoretsamv458@nyu.eduCenterForDataScienceNewYorkUniversityAngelaM.Tengat2507@nyu.eduCenterForDataScienceNewYo...

展开>> 收起<<
ImpressLearn Continual Learning via Combined Task Im- pressions Dhrupad Bhardwaj db4045nyu.edu.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.53MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注