On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1

2025-04-27 0 0 1.01MB 22 页 10玖币

侵权投诉

On the Effectiveness of Lipschitz-Driven

Rehearsal in Continual Learning

Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1

Concetto Spampinato2Simone Calderara1

1AImageLab - University of Modena and Reggio Emilia

2PeRCeiVe Lab - University of Catania

Abstract

Rehearsal approaches enjoy immense popularity with Continual Learning (CL)

practitioners. These methods collect samples from previously encountered data

distributions in a small memory buffer; subsequently, they repeatedly optimize

on the latter to prevent catastrophic forgetting. This work draws attention to a

hidden pitfall of this widespread practice: repeated optimization on a small pool

of data inevitably leads to tight and unstable decision boundaries, which are a

major hindrance to generalization. To address this issue, we propose Lipschitz-

DrivEn Rehearsal (LiDER), a surrogate objective that induces smoothness in the

backbone network by constraining its layer-wise Lipschitz constants w.r.t. replay

examples. By means of extensive experiments, we show that applying LiDER

delivers a stable performance gain to several state-of-the-art rehearsal CL methods

across multiple datasets, both in the presence and absence of pre-training. Through

additional ablative experiments, we highlight peculiar aspects of buffer overﬁtting

in CL and better characterize the effect produced by LiDER. Code is available at

https://github.com/aimagelab/LiDER.

1 Introduction

The last few years have seen a renewed interest in aiding Deep Neural Networks (DNNs) to acquire

new knowledge and, at the same time, retain high performance on previously encountered data. In

this regard, the mitigation of catastrophic forgetting [

] has driven the recent research towards novel

incremental methods [

], often framed under the ﬁeld of Continual Learning (CL). Among other

valid CL strategies, rehearsal approaches caught the attention of a large body of literature [

]

thanks to their advantages. Simply, they maintain a small ﬁxed-size buffer containing a fraction

of examples from previous tasks; afterward, these examples are mixed together with the ones of

the current task, hence provided continuously as training data. In this respect, different approaches

establish different regularization strategies on top of the retained examples [

], as well as which

kind of information to store (e.g. model responses [15, 12], explanations [25], etc.).

In spite of their widespread application, these approaches fall into a common pitfall: as the memory

buffer holds only a small fraction of past examples, there is a high risk of overﬁtting on that

memory [

], thus harming generalization. Several approaches mitigate such an issue through data-

augmentation techniques, either by generating different versions of the same buffer datapoint [

]

or by combining different examples into a single one [

]. Other works [

], instead,

select carefully the valuable samples that should be inserted into the buffer: they argue that random

selection may pick non-informative and noisy instances, affecting the model generalization.

This work tackles the issue described above from a different perspective, viewing catastrophic

forgetting in the light of the progressive deterioration of decision boundaries between classes.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06443v2 [cs.LG] 16 Oct 2022

0.0 0.2 0.4 0.6 0.8 1.0

Random xPerturbation Magnitude

−50

Adv. xPerturbation Magnitude

−50

−25

τ0τ1τ2

Rehearsed Examples

τ3τ4

−40 0 40

−50

−25

τ0

−40 0 40

τ1

−40 0 40

τ2

Non-Rehearsed Examples

−40 0 40

τ3

−40 0 40

τ4-25

-20

-15

-10

-5

Decision Magnitude S(x)

Figure 1: Diagrams derived from [

] describing how the magnitude of an input perturbation affects

the model’s prediction, measured as the difference between the correct logit and the maximum one.

The dashed lines delimit the green areas, where the correct response is preserved. The analysis is

carried out on the examples of the ﬁrst task of Split CIFAR-10 and spans across tasks progressing,

from

τ0

(left) to

τ4

(right). In different rows, the decision surfaces around datapoints either contained

in the memory buffer (ﬁrst) or non-rehearsed (second). We refer to Sec. 5.3 for additional insights.

Indeed, while for the examples of the current task we expect the decision boundaries to be already

smooth and robust against local perturbations, the same could not be said for past examples. In fact,

the restrained access to only a small portion of past tasks increases the epistemic uncertainty [

]

of the model: as a consequence, we expect the decision surfaces tied to past classes to slowly erode

everywhere, with the exception of certain input regions i.e., those close to the neighborhood of buffer

datapoints (thanks to their repeated optimization). We refer to Fig.1 for a visualization of such

phenomenon, which shows the evolution of the decision surfaces around the points of the ﬁrst task of

Split CIFAR-10, from the ﬁrst (left) to the last one (right). We differentiate the target of the analysis

(rehearsed examples vs. non-rehearsed examples) in distinct rows: as can be seen, the green area –

the input region where the model outputs correct predictions against local input perturbations [

]) –

tightens around buffer datapoints (ﬁrst row) and erodes for non-rehearsed examples (second row).

Such an intuition motivates our research of novel mechanisms for guaranteeing the robustness of

the decision boundaries. To this aim, we resort to enforcing the Lipschitz continuity of the model

w.r.t. its input: indeed, a long-standing research trend [

] has pointed out that

such a property favors generalization capabilities and robustness to adversarial attacks. In particular,

constraining the Lipschitz constant of a model – intuitively, a bound on how much the model’s

response can change in proportion to a change in its input [

] – has proven to strengthen the decision

surface around a point [

], preventing attacks of a given magnitude from changing

the output of the classiﬁer.

1k 2.5k 4k

Adv. Acc.

|M| L

.55

.57

.59

While these works assessed Lipschitz regularization in the clas-

sical scenario (i.e., single joint i.i.d. task), we advocate that it is

even more beneﬁcial in continual learning, particularly for those

approaches based on replay memories. As shown in the inset

ﬁgure, without explicit regularization, the Lipschitz constant of

a model increases for smaller memory buffers: in other terms,

its corresponding function space becomes increasingly sensitive

w.r.t. local input perturbations (as also highlighted by the lower accuracy attained in the presence of

adversarial attacks). In light of the above considerations, we ascribe such a tendency to the higher

uncertainty, which derives from subjecting the model to a low-data training regime.

To the best of our knowledge, our work is the ﬁrst attempt to assess the effectiveness of Lipschitz-

constrained DNNs in continual learning: in particular, we have equipped several widely-known

and state-of-the-art rehearsal approaches with our Lipschitz-guided optimization objective named

Lipschitz-DrivEn Rehearsal (LiDER)

, showing that it systematically leads to better results in

several benchmarks.

2 Related works

Continual Learning

. CL examines the capability of a deep model to learn from a sequence of

non-i.i.d. classiﬁcation tasks [

] while preventing the onset of catastrophic forgetting [

]. To

achieve this goal, models are trained according to speciﬁcally designed strategies, meant to inﬂuence

their evolution and maximize the retention of previously acquired knowledge.

Among them, regularization methods work by introducing additional constraints in the form of loss

terms; they are designed to limit the amount of total change either in parameter space [

]

or functional space [

]. Differently, structural methods purposefully organize the allocation

of model capacity to prevent interference and facilitate parameter sharing [

]. Lastly,

rehearsal methods store and reuse a subset of previously seen data-points to prevent overﬁtting

on current data and avert forgetting [

]. While rehearsal strategies are by far the most

frequently adopted thanks to their effectiveness and ﬂexibility [

], it is not uncommon to adopt

solutions combining multiple approaches [

]. In this paper, we similarly propose a

strategy that leverages an existing replay memory buffer to compute an additional regularization term,

aimed at conditioning the learning dynamics and avoiding overﬁtting.

CL evaluations are often carried out in the so-called Task-Incremental setting (TIL) [

]

– that is, the model is provided a task identiﬁer at test time to avoid interference across predictions of

distinct tasks. However, recent works put an increasing focus on the harder Class-Incremental setting

(CIL) [

], which entails the production of a uniﬁed prediction encompassing all seen

classes. W.r.t. the latter, TIL has been criticized as a less challenging and realistic benchmark [

4]; we therefore conduct our main experiments on state-of-the-art CL models in the CIL setting1.

Lipschitz-based Regularization

. Standard DNNs typically suffer from their overparametrization [

], leading to the tendency to overﬁt the training data by producing jagged decision boundaries

that closely ﬁt the seen examples. On the contrary, a model’s reliability depends on its capability

for generalization, which is linked to the appearance of smooth decision boundaries [

Starting from the ﬁrst studies focusing on this simple dichotomy, the Lipschitz constant

of a DNN

has been established as a commonplace measure of both smoothness and generalization [

]

and still constitutes a key ingredient for current evaluations of model capacity [8, 30].

Most notably, Szegedy et al. [

] verify that constraining

reduces the model’s vulnerability to

adversarial perturbations. Many current approaches to Adversarial Learning similarly operate either

by minimizing

at the global or local level [

] or by devising models characterized by a

small

by design [

]. In other areas, the smoothing effect of

-based regularization has been

favorably applied to both GAN training [55] and neural ﬁelds [48].

3 Method

A CL problem usually involves learning a function

from a stream of data, which we formalize as a

succession of separate datasets

T={τ0, τ1, . . . , τ|T|}

, where

τt={(xi, yi)}Nt

i=1

and

τi∩τj=∅

; the

label set

for each

τt

are non-overlapping. In this setting, the ideal objective consists in minimizing

the overall loss over all tasks experienced, formally:

f∗= argmin

E|T|

t=0E(x,y)∼τthL(f(x), y)i,(1)

where

is an appropriate loss for solving the task at hand. In a continual scenario, only data from

the current task

τt

is freely available; therefore, CL methods need to maintain knowledge from the

past t−1tasks in order to solve the overall problem.

For the sake of simplicity, we consider a feed forward neural network

f(·)=(HK◦σK◦HK−1◦

σK−1◦. . . H1)(·)

,i.e., a sequence of

-activated linear functions

Hk(h) = WT

(biases are omit-

ted). A ﬁnal projection head

g(·) = softmax(·)

is applied to produce per-class output probabilities.

As stated in other works [

], other common transformations that make up DNNs (e.g., convolutions,

max-pool) can also be seen in terms of matrix multiplications, thus making our approach applicable

to more complex networks.

However, we remark that TIL can also be useful, as it reveals forgetting disentangled from other incremental

learning effects. This motivates us to adopt TIL in some of our additional experiments.

Lipschitz continuity. A function fis said to be Lipschitz continuous if there exists a value L∈R+

such that the following inequality holds:

||f(x)−f(y)||2≤L||x−y||2,∀x,y∈Rn.(2)

If such a value exists, the smallest

that satisﬁes the condition is usually referred to as the Lipschitz

norm ||f||L. Therefore, for a single point x∈Rn, we obtain:

||f||L= sup

x6=y;y∈Rn

||f(x)−f(y)||2

||x−y||2

.(3)

Unfortunately, computing the Lipschitz constant of even the most simple multi-layer perceptron is a

NP-hard problem [

]. Therefore, several works relied on its estimation by computing reliable upper

bounds. As stated in [

], an effective way to bound the Lipschitz constant of

f(·)

is to compute

the constants of each linear projection Hkand then aggregate them. In more detail:

||Hk||L= sup

x6=y;y∈Rn

||WTx−WTy||2

||x−y||2

= sup

ξ6=0;ξ∈Rn

||Wkξ||2

||ξ||2

=σmax(Wk),(4)

where

σmax(Wk)

is the largest singular value of the weight matrix

(also know as its spectral

norm

||Wk||SN

). To account for non-linear composite functions (e.g., the residual building blocks of

most convolutional architecture), we leverage the following inequality:

||g(z(x)) −g(z(y))||2≤ ||g||L||z(x)−z(y)||2

≤ ||g||L||z||L||x−y||2⇒ ||z◦g||L≤ ||g||L||z||L,

where

g(·)

and

z(·)

are two Lipschitz continuous functions characterized by the constants

||g||L

and

||z||L

. In the case of ReLU-activated networks (but the following result can be extended to other

common non-linear functions), the forward pass through

σll= 1,2, . . . , L

can be re-arranged as

a matrix multiplication by a diagonal matrix

Dl∈Rdl×dl+1

whose diagonal elements equal either

zero or one. Therefore, their corresponding Lipschitz constant

||σl||L≤1

. On top of that, we can

compute an upper bound on the Lipschitz constant of the entire network:

||f||L≤ ||HK||L· ||σK||L·. . . · ||H1||L≤

k=1

||Hk||L=

k=1

||Wk||SN .(5)

Computing the spectral norm of weights matrices

. The computation of

||Wk||SN

can be done [

] naively through the Singular Value Decomposition (SVD), yielding, among the others, the largest

singular value. Such approach has been applied in recent works [

]; however, for complex

structures (e.g., convolutions or entire residual blocks) the SVD decomposition is inaccessible. Hence,

we rely on the approximation introduced in [

] and compute the largest eigenvalue

λk

of the

Transmitting Matrix TMk(which represents a good proxy of ||Wk||SN 2):

TMk,(Fk)T(Fk−1)T(Fk)T(Fk−1),(6)

where

Fk∈RB×dk

is the L2-normalized feature map produced by the

-th layer from a batch of

samples. Finally, our approach exploits the power iteration method [

] to compute the largest

eigenvalue of TMk, which is backpropagation-friendly.

3.1 Lipschitz-Driven Rehearsal

In a continual setting, a model is asked i) to be adaptable to incoming samples from the stream

(plasticity), and ii) to be accurate on past tasks (stability). We seek to ensure a balance between these

clashing objectives through the two following loss terms.

Controlling Lipschitz-continuity

. To mitigate overﬁtting on buffer datapoints, we ﬁrstly impose

that each layer behaves as a

-Lipschitz continuous function, for a given real positive target constant

ck:

Lc-Lip =1

k=0

|λk

1−ck|.(7)

2We refer the reader to [67] for additional justiﬁcations for this step.

During the computation of each

λk

, we discard the activation maps incoming from the examples of

the current task. Indeed, as we have access to the entire training set (and not a subset as holds for

old tasks), additional regularization is not needed: the decision boundaries tied to the current task

are less prone to the risk of over-adapting to certain points. Regarding the target constants

, we

could ﬁx them as hyperparameters of our learning objective (as done in [

]) and exploit them as a

sort of budget assigned to each layer; however, we empirically observed that it is beneﬁcial, instead,

learning these targets by means of gradient descent (see Sec. 5.4), especially in a CL scenario where

there is no access to the full data distribution. Indeed, these can be interpreted as additional learnable

parameters, which represent the appropriate level of strictness each layer should be subjected to.

To avoid trivial solutions, we also encourage the estimated upper bounds to be as much as possible

close to zero:

L0-Lip =1

k=0

|λk

1|.(8)

Intuitively, when

λk

1→0

, the outputs of the corresponding

-th layer have low sensitivity to changes

in its input. In our intentions, this could relieve continuous rehearsal from eroding the decision

surface in a way that ﬁts well only certain examples (i.e., those retained in the memory buffer).

Overall objective

. The overall objective of LiDER combines the two introduced loss terms; formally:

LLiDER =αLc-Lip +βL0-Lip.(9)

This objective can be plugged in almost any rehearsal approach. For such a reason, we keep it general

and avoid reporting the common loss terms asking for accurate predictions, as their form depends

on the speciﬁc choices made by each approach. Finally, we further remark that the introduced loss

terms require minimal additional computation. Moreover, they do not need additional samples to be

retained, besides those that are already present in the memory buffer.

3.2 Relation with other regularization approaches

At ﬁrst glance, the regularization of our approach can be understood as a mean to enforce ﬂat minima

for each of the tasks, as advocated by Mirzadeh et al. [

] and Yin et al. [

]. We remark that

they reason in

parameter space

and pursue ﬂatness of the loss landscape w.r.t. weights: namely,

they encourage the model to be robust when perturbations are applied to its

weights

. Differently,

we seek models that are robust w.r.t. changes in

input space

. The two lines may exploit the same

mathematical tools – such as the Hessian and Lipschitz continuity – but build upon orthogonal axes

(weights vs input): we recommend to always assess which of the two is used as reference property.

In this respect, the bridge between the two objectives is worth-exploring and still open to debate [

]. The authors of [

] reported that, theoretically, no correlation exists between the Hessian

w.r.t. weights and the robustness of the model w.r.t. the input. Such a statement is corroborated by

Fig. 1 of [

]: although a ﬂat minimum is reached in parameter space, non-smooth variations appear

in input space. However, the authors of [

] empirically found that models with higher Hessian

spectrum w.r.t. weights are also more prone to adversarial attacks. A similar thesis has been argued

by the authors of [

], while the third result reported in [

] seems to refute it. Furthermore, Sec. 5.3

of our paper investigated the opposite link and revealed that CL models trained to be robust w.r.t.

input changes tend to attain ﬂatter minima in parameter space.

4 Experiments

To assess our proposal, we perform a suite of experiments through the Mammoth framework [

], an open-source codebase introduced in [

] for testing CL algorithms. In particular,

we show that our method can be easily applied to state-of-the-art replay methods and enhance their

performance in a wide variety of challenging settings and backbone architectures. Moreover, we

show that our proposal remains rewarding and can improve the generalization capabilities of CL

models even when a pre-trained model is employed. Such scenario is important for a twofold reason:

i) as shown in [

], pre-training implicitly mitigates forgetting by widening the local minima found

in function space, thus making the model more robust to input perturbations; additionally, ii) we

accommodate for real-world scenarios where pre-training is usually involved as an initial step. Due to

space constraints, we kindly refer the reader to the supplementary material for additional experimental

details (e.g., optimizer, hyperparameters, etc.).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheEffectivenessofLipschitz-DrivenRehearsalinContinualLearningLorenzoBonicelli1MatteoBoschini1AngeloPorrello1ConcettoSpampinato2SimoneCalderara11AImageLab-UniversityofModenaandReggioEmilia2PeRCeiVeLab-UniversityofCataniaAbstractRehearsalapproachesenjoyimmensepopularitywithContinualLearning(CL)prac...

展开>> 收起<<

On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: