ImpressLearn Continual Learning via Combined Task Im- pressions Dhrupad Bhardwaj db4045nyu.edu

2025-05-08 0 0 2.53MB 15 页 10玖币

侵权投诉

ImpressLearn: Continual Learning via Combined Task Im-

pressions

Dhrupad Bhardwaj db4045@nyu.edu

Center For Data Science

New York University

Julia Kempe jk185@nyu.edu

Center For Data Science

New York University

Artem Vysogorets amv458@nyu.edu

Center For Data Science

New York University

Angela M. Teng at2507@nyu.edu

Center For Data Science

New York University

Evaristus C. Ezekwem ece278@nyu.edu

Center For Data Science New York University

Abstract

This work proposes a new method to sequentially train deep neural networks on multiple

tasks without suﬀering catastrophic forgetting, while endowing it with the capability to

quickly adapt to unseen tasks. Starting from existing work on network masking (Wortsman

et al., 2020), we show that simply learning a linear combination of a small number of task-

speciﬁc supermasks (impressions) on a randomly initialized backbone network is suﬃcient

to both retain accuracy on previously learned tasks, as well as achieve high accuracy on

unseen tasks. In contrast to previous methods, we do not require to generate dedicated

masks or contexts for each new task, instead leveraging transfer learning to keep per-task

parameter overhead small. Our work illustrates the power of linearly combining individual

impressions, each of which fares poorly in isolation, to achieve performance comparable to

a dedicated mask. Moreover, even repeated impressions from the same task (homogeneous

masks), when combined, can approach the performance of heterogeneous combinations if

suﬃciently many impressions are used. Our approach scales more eﬃciently than existing

methods, often requiring orders of magnitude fewer parameters and can function without

modiﬁcation even when task identity is missing. In addition, in the setting where task

labels are not given at inference, our algorithm gives an often favorable alternative to the

one-shot procedure used by Wortsman et al. (2020). We evaluate our method on a number

of well-known image classiﬁcation datasets and network architectures.

1 Introduction

Sequential learning without catastrophic forgetting has been an area of active research in machine learning

for some time (Maes et al., 1996; Thrun & Pratt, 1998; Serra et al., 2018). A precondition for achieving

Artiﬁcial General Intelligence is that models should be able to learn and remember a wide variety of tasks

sequentially, without forgetting previously learned ones. In real-world scenarios, data from diﬀerent tasks

may not be available simultaneously, which makes it imperative to both allow continued learning of a

arXiv:2210.01987v2 [cs.CV] 31 Jan 2023

Figure 1: Intuitive representation of ImpressLearn. Binary basis-masks (left) are linearly combined via

learnable coeﬃcients αto construct a task-speciﬁc real-valued mask (right) that is applied to a ﬁxed randomly

initialized backbone network at inference.

potentially unbounded number of tasks (see also The Sequential Learning Problem (McCloskey & Cohen,

1989), Constraints Imposed by Learning and Forgetting Functions (Ratcliﬀ, 1990) and Lifelong Learning

Algorithms (Thrun & Pratt, 1998)). Recently, some successful approaches to combat this problem use

task speciﬁc sub-models, which allow neural networks to context-switch between diﬀerent learning tasks

(Wortsman et al., 2020; Mallya et al., 2018; Mancini et al., 2018). The underlying context for each task

can be represented as “ﬁlters” or “masks”, altering the network’s structure for each task. Yet all of these

approaches scale unfavorably with the number of unique tasks to be learned.

ImpressLearn. We propose a novel method leveraging transfer learning and network masking to sequen-

tially learn a practically unlimited number of tasks with much lower per-task parameter overhead compared

to prevailing benchmarks. Our method, termed ImpressLearn, uses elements from Supermasks in Superpo-

sition (SupSup) by Wortsman et al. (2020); SupSup uses the observation that randomly initialized neural

networks contain subnetworks, obtained by applying binary parameter masks (supermasks), that achieve

good performance on any particular task (Zhou et al., 2019). These supermasks can be learned with stan-

dard gradient descent and stored, one mask per task. At inference, the appropriate task-speciﬁc mask is

applied when task identity is known. When task-labels of a previously seen task are unavailable, the correct

mask can be inferred via entropy minimization.

While SupSup provides strong results, it restricts knowledge transfer between tasks and requires considerable

memory overhead for each additional task. ImpressLearn, on the other hand, solves both problems with one

stone: it accommodates positive knowledge transfer by reusing existing supermasks and, as a result, requires

much less additional parameters for each new task. Our supermasks (basis-masks) are constructed from

a small batch of initial basis-tasks (heterogeneous setting), or from just a single basis-task (homogeneous

setting). By default, we select the ﬁrst Ntasks as basis-tasks where Ncontrols the performance-cost

trade-oﬀ. Leveraging transfer learning, this set of learned basis-masks, each of which can be interpreted

as an impression of a previously seen basis-task, serves as a collection of latent features for new learning

objectives, encoding common structural information. This might be reminiscent of associative learning where

impressions of previous scenarios are combined to cope with new ones. Once a set of basis-masks is identiﬁed,

we learn an appropriate linear combination of these impressions to quickly construct a real-valued mask that

performs well on an unseen task. Hence, apart from a ﬁxed number of basis-masks, only a small number

of ﬂoating-pint coeﬃcients need to be learned and stored for each subsequent task. This greatly beneﬁts

scalability, a major drawback of previous methods (Wortsman et al., 2020; Mallya et al., 2018). In principle,

the eﬃciency of ImpressLearn allows for an unlimited number of new tasks. In particular, as the number of

tasks increases, the per-task memory overhead converges to just a few extra parameters after amortizing the

cost of basis-masks, which is considerably cheaper than allocating additional parameter masks for each task.

Homogeneous and random basis-masks. Somewhat surprisingly, we can even generate all basis-masks

from the same initial task using diﬀerent random seeds for the learning algorithm (but the same randomly

initialized backbone network). We show that with a suﬃciently large number of such homogeneous im-

pressions, our algorithm learns linear combinations with close to benchmark accuracy on new tasks. We

are reminded of an infant learning by taking diﬀerent “snapshots” of the same object to infer properties of

another. This homogeneous setting is particularly useful to address possible drifts in the data; akin to ensem-

bling, it leverages the power of linear combinations for transfer learning. An additional important advantage

is that the homogeneous setting has no limit on the number of basis-masks we can generate ab-initio.

As another baseline for ImpressLearn, we experimented with optimizing for a linear combination of entirely

random masks of desired sparsity. We demonstrate on several benchmarks that if we choose a suﬃciently

large collection of such random basis-masks, our optimization still yields competitive performance. While

combinations of random masks naturally lag behind the heterogeneous and the homogeneous settings, we

show that there is a trade-oﬀ between the number of masks and their task-speciﬁcity (non-randomness). In

settings where producing task-speciﬁc basis-masks is costly, optimizing for a linear combination of a large

number of random masks can still yield satisfactory results.

Example: LeNet-300-100 on RotatedMNIST. As a preview of our approach and its performance,

Figure 2 shows the accuracy of ImpressLearn compared to SupSup on RotatedMNIST dataset. First, as a

sanity check, we apply basis-masks obtained from one task to tasks they were not optimized for. As expected,

this yields essentially random accuracy (see X in Figure 2), conﬁrming that the performance of ImpressLearn

is beyond pure transfer learning and comes from linearly combining the initial impressions. Next, Figure

2 illustrates that ImpressLearn even with a small number of heterogeneous basis-masks is on par or even

superior to SupSup on unseen tasks. Note that, for each additional non-basis task, ImpressLearn with 10

basis-masks requires only 3×10 = 30 parameters (one per basis-mask per layer); in contrast, SupSup needs

to generate an entire binary mask, which requires 25,000+ tensor indices to specify assuming 10% sparsity.

Figure 2 also illustrates the performance of ImpressLearn over homogeneous basis-masks; in this setting,

we need a larger number of basis-masks to achieve accuracy comparable to the heterogeneous scenario.

Ultimately, however, we still match the performance of SupSup with a vastly smaller number of additional

per-task parameters. Lastly, the rightmost plot in Figure 2 shows that our algorithm can successfully

operate when task identity is not provided at inference (cf. GN regime (Wortsman et al., 2020)). Here, our

optimization procedure with the entropy objective ﬁnds the “correct” basis-mask or a linear combination

of basis-masks yielding similar or better performance compared to the one-shot baseline from Wortsman

et al. (2020). In Section 4, we provide empirical evidence of the eﬃcacy of ImpressLearn on a variety of

benchmarks, outlining the radical savings in parameters that need to be stored per-task compared to SupSup.

The GN regime. In principle, both SupSup and ImpressLearn require access to task identities in order

to apply the right mask to the backbone network during inference. To relax this requirement, Wortsman

et al. (2020) extend their algorithm to a more challenging regime (coined GN —Given/Not given) where

task identiﬁers are present during training but unavailable at inference. In this regime, Wortsman et al.

(2020) use a one-shot minimization of the entropy of the model’s outputs to single out the correct mask. In

Section 4, we show that our algorithm is able to achieve this feat, too. We demonstrate that, when applied

to basis-tasks in the GN regime, the optimization routine of ImpressLearn either identiﬁes the corresponding

basis-mask or even ﬁnds a better performing combination of basis-masks.

The remainder of the paper is organized as follows. In Section 2, we brieﬂy review related work and

general approaches to countering catastrophic forgetting, highlighting research that motivated our approach.

In Section 3, we detail the ImpressLearn algorithm as well as discuss its components and features. In Section

4, we demonstrate the eﬀectiveness of ImpressLearn on a variety of datasets and architectures to show

close-to-benchmark performance with a drastically reduced parameter count on unseen tasks, especially

where the number of incoming tasks is large. Additionally, we conduct several ablation studies to provide

better understanding of the algorithm. Finally, in Section 5, we talk about limitations of our work and

avenues for future research.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImpressLearn:ContinualLearningviaCombinedTaskIm-pressionsDhrupadBhardwajdb4045@nyu.eduCenterForDataScienceNewYorkUniversityJuliaKempejk185@nyu.eduCenterForDataScienceNewYorkUniversityArtemVysogoretsamv458@nyu.eduCenterForDataScienceNewYorkUniversityAngelaM.Tengat2507@nyu.eduCenterForDataScienceNewYo...

展开>> 收起<<

ImpressLearn Continual Learning via Combined Task Im- pressions Dhrupad Bhardwaj db4045nyu.edu.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ImpressLearn Continual Learning via Combined Task Im- pressions Dhrupad Bhardwaj db4045nyu.edu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: