On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1

2025-04-27 0 0 1.01MB 22 页 10玖币
侵权投诉
On the Effectiveness of Lipschitz-Driven
Rehearsal in Continual Learning
Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1
Concetto Spampinato2Simone Calderara1
1AImageLab - University of Modena and Reggio Emilia
2PeRCeiVe Lab - University of Catania
Abstract
Rehearsal approaches enjoy immense popularity with Continual Learning (CL)
practitioners. These methods collect samples from previously encountered data
distributions in a small memory buffer; subsequently, they repeatedly optimize
on the latter to prevent catastrophic forgetting. This work draws attention to a
hidden pitfall of this widespread practice: repeated optimization on a small pool
of data inevitably leads to tight and unstable decision boundaries, which are a
major hindrance to generalization. To address this issue, we propose Lipschitz-
DrivEn Rehearsal (LiDER), a surrogate objective that induces smoothness in the
backbone network by constraining its layer-wise Lipschitz constants w.r.t. replay
examples. By means of extensive experiments, we show that applying LiDER
delivers a stable performance gain to several state-of-the-art rehearsal CL methods
across multiple datasets, both in the presence and absence of pre-training. Through
additional ablative experiments, we highlight peculiar aspects of buffer overfitting
in CL and better characterize the effect produced by LiDER. Code is available at
https://github.com/aimagelab/LiDER.
1 Introduction
The last few years have seen a renewed interest in aiding Deep Neural Networks (DNNs) to acquire
new knowledge and, at the same time, retain high performance on previously encountered data. In
this regard, the mitigation of catastrophic forgetting [
52
] has driven the recent research towards novel
incremental methods [
42
,
68
], often framed under the field of Continual Learning (CL). Among other
valid CL strategies, rehearsal approaches caught the attention of a large body of literature [
62
,
21
,
15
]
thanks to their advantages. Simply, they maintain a small fixed-size buffer containing a fraction
of examples from previous tasks; afterward, these examples are mixed together with the ones of
the current task, hence provided continuously as training data. In this respect, different approaches
establish different regularization strategies on top of the retained examples [
63
,
49
], as well as which
kind of information to store (e.g. model responses [15, 12], explanations [25], etc.).
In spite of their widespread application, these approaches fall into a common pitfall: as the memory
buffer holds only a small fraction of past examples, there is a high risk of overfitting on that
memory [
75
], thus harming generalization. Several approaches mitigate such an issue through data-
augmentation techniques, either by generating different versions of the same buffer datapoint [
7
,
15
]
or by combining different examples into a single one [
14
,
11
]. Other works [
3
,
4
,
16
,
85
], instead,
select carefully the valuable samples that should be inserted into the buffer: they argue that random
selection may pick non-informative and noisy instances, affecting the model generalization.
This work tackles the issue described above from a different perspective, viewing catastrophic
forgetting in the light of the progressive deterioration of decision boundaries between classes.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06443v2 [cs.LG] 16 Oct 2022
0.0 0.2 0.4 0.6 0.8 1.0
Random xPerturbation Magnitude
50
50
Adv. xPerturbation Magnitude
50
25
0
25
50
τ0τ1τ2
Rehearsed Examples
τ3τ4
40 0 40
50
25
0
25
50
τ0
40 0 40
τ1
40 0 40
τ2
Non-Rehearsed Examples
40 0 40
τ3
40 0 40
τ4-25
-20
-15
-10
-5
0
5
10
15
20
25
Decision Magnitude S(x)
Figure 1: Diagrams derived from [
87
] describing how the magnitude of an input perturbation affects
the model’s prediction, measured as the difference between the correct logit and the maximum one.
The dashed lines delimit the green areas, where the correct response is preserved. The analysis is
carried out on the examples of the first task of Split CIFAR-10 and spans across tasks progressing,
from
τ0
(left) to
τ4
(right). In different rows, the decision surfaces around datapoints either contained
in the memory buffer (first) or non-rehearsed (second). We refer to Sec. 5.3 for additional insights.
Indeed, while for the examples of the current task we expect the decision boundaries to be already
smooth and robust against local perturbations, the same could not be said for past examples. In fact,
the restrained access to only a small portion of past tasks increases the epistemic uncertainty [
38
]
of the model: as a consequence, we expect the decision surfaces tied to past classes to slowly erode
everywhere, with the exception of certain input regions i.e., those close to the neighborhood of buffer
datapoints (thanks to their repeated optimization). We refer to Fig.1 for a visualization of such
phenomenon, which shows the evolution of the decision surfaces around the points of the first task of
Split CIFAR-10, from the first (left) to the last one (right). We differentiate the target of the analysis
(rehearsed examples vs. non-rehearsed examples) in distinct rows: as can be seen, the green area –
the input region where the model outputs correct predictions against local input perturbations [
87
]) –
tightens around buffer datapoints (first row) and erodes for non-rehearsed examples (second row).
Such an intuition motivates our research of novel mechanisms for guaranteeing the robustness of
the decision boundaries. To this aim, we resort to enforcing the Lipschitz continuity of the model
w.r.t. its input: indeed, a long-standing research trend [
82
,
72
,
45
,
46
,
80
,
29
] has pointed out that
such a property favors generalization capabilities and robustness to adversarial attacks. In particular,
constraining the Lipschitz constant of a model – intuitively, a bound on how much the model’s
response can change in proportion to a change in its input [
5
] – has proven to strengthen the decision
surface around a point [
22
,
45
,
46
,
80
,
91
], preventing attacks of a given magnitude from changing
the output of the classifier.
1k 2.5k 4k
.3
.4
.5
Adv. Acc.
|M| L
.55
.57
.59
While these works assessed Lipschitz regularization in the clas-
sical scenario (i.e., single joint i.i.d. task), we advocate that it is
even more beneficial in continual learning, particularly for those
approaches based on replay memories. As shown in the inset
figure, without explicit regularization, the Lipschitz constant of
a model increases for smaller memory buffers: in other terms,
its corresponding function space becomes increasingly sensitive
w.r.t. local input perturbations (as also highlighted by the lower accuracy attained in the presence of
adversarial attacks). In light of the above considerations, we ascribe such a tendency to the higher
uncertainty, which derives from subjecting the model to a low-data training regime.
To the best of our knowledge, our work is the first attempt to assess the effectiveness of Lipschitz-
constrained DNNs in continual learning: in particular, we have equipped several widely-known
and state-of-the-art rehearsal approaches with our Lipschitz-guided optimization objective named
Lipschitz-DrivEn Rehearsal (LiDER)
, showing that it systematically leads to better results in
several benchmarks.
2
2 Related works
Continual Learning
. CL examines the capability of a deep model to learn from a sequence of
non-i.i.d. classification tasks [
23
,
58
] while preventing the onset of catastrophic forgetting [
52
]. To
achieve this goal, models are trained according to specifically designed strategies, meant to influence
their evolution and maximize the retention of previously acquired knowledge.
Among them, regularization methods work by introducing additional constraints in the form of loss
terms; they are designed to limit the amount of total change either in parameter space [
42
,
89
,
19
,
2
]
or functional space [
47
,
10
]. Differently, structural methods purposefully organize the allocation
of model capacity to prevent interference and facilitate parameter sharing [
1
,
51
,
66
,
37
]. Lastly,
rehearsal methods store and reuse a subset of previously seen data-points to prevent overfitting
on current data and avert forgetting [
16
,
49
,
4
,
3
]. While rehearsal strategies are by far the most
frequently adopted thanks to their effectiveness and flexibility [
26
,
4
], it is not uncommon to adopt
solutions combining multiple approaches [
32
,
1
,
15
,
60
,
18
]. In this paper, we similarly propose a
strategy that leverages an existing replay memory buffer to compute an additional regularization term,
aimed at conditioning the learning dynamics and avoiding overfitting.
CL evaluations are often carried out in the so-called Task-Incremental setting (TIL) [
42
,
89
,
49
,
23
,
19
]
– that is, the model is provided a task identifier at test time to avoid interference across predictions of
distinct tasks. However, recent works put an increasing focus on the harder Class-Incremental setting
(CIL) [
4
,
32
,
15
,
81
], which entails the production of a unified prediction encompassing all seen
classes. W.r.t. the latter, TIL has been criticized as a less challenging and realistic benchmark [
26
,
74
,
4]; we therefore conduct our main experiments on state-of-the-art CL models in the CIL setting1.
Lipschitz-based Regularization
. Standard DNNs typically suffer from their overparametrization [
90
,
57
], leading to the tendency to overfit the training data by producing jagged decision boundaries
that closely fit the seen examples. On the contrary, a model’s reliability depends on its capability
for generalization, which is linked to the appearance of smooth decision boundaries [
82
,
8
,
86
,
29
].
Starting from the first studies focusing on this simple dichotomy, the Lipschitz constant
L
of a DNN
has been established as a commonplace measure of both smoothness and generalization [
82
,
72
,
39
]
and still constitutes a key ingredient for current evaluations of model capacity [8, 30].
Most notably, Szegedy et al. [
72
] verify that constraining
L
reduces the model’s vulnerability to
adversarial perturbations. Many current approaches to Adversarial Learning similarly operate either
by minimizing
L
at the global or local level [
46
,
73
,
45
] or by devising models characterized by a
small
L
by design [
22
,
33
]. In other areas, the smoothing effect of
L
-based regularization has been
favorably applied to both GAN training [55] and neural fields [48].
3 Method
A CL problem usually involves learning a function
f
from a stream of data, which we formalize as a
succession of separate datasets
T={τ0, τ1, . . . , τ|T|}
, where
τt={(xi, yi)}Nt
i=1
and
τiτj=
; the
label set
Yt
for each
τt
are non-overlapping. In this setting, the ideal objective consists in minimizing
the overall loss over all tasks experienced, formally:
f= argmin
f
E|T|
t=0E(x,y)τthL(f(x), y)i,(1)
where
L
is an appropriate loss for solving the task at hand. In a continual scenario, only data from
the current task
τt
is freely available; therefore, CL methods need to maintain knowledge from the
past t1tasks in order to solve the overall problem.
For the sake of simplicity, we consider a feed forward neural network
f(·)=(HKσKHK1
σK1. . . H1)(·)
,i.e., a sequence of
σ
-activated linear functions
Hk(h) = WT
kh
(biases are omit-
ted). A final projection head
g(·) = softmax(·)
is applied to produce per-class output probabilities.
As stated in other works [
29
], other common transformations that make up DNNs (e.g., convolutions,
max-pool) can also be seen in terms of matrix multiplications, thus making our approach applicable
to more complex networks.
1
However, we remark that TIL can also be useful, as it reveals forgetting disentangled from other incremental
learning effects. This motivates us to adopt TIL in some of our additional experiments.
3
Lipschitz continuity. A function fis said to be Lipschitz continuous if there exists a value LR+
such that the following inequality holds:
||f(x)f(y)||2L||xy||2,x,yRn.(2)
If such a value exists, the smallest
L
that satisfies the condition is usually referred to as the Lipschitz
norm ||f||L. Therefore, for a single point xRn, we obtain:
||f||L= sup
x6=y;yRn
||f(x)f(y)||2
||xy||2
.(3)
Unfortunately, computing the Lipschitz constant of even the most simple multi-layer perceptron is a
NP-hard problem [
77
]. Therefore, several works relied on its estimation by computing reliable upper
bounds. As stated in [
86
,
67
], an effective way to bound the Lipschitz constant of
f(·)
is to compute
the constants of each linear projection Hkand then aggregate them. In more detail:
||Hk||L= sup
x6=y;yRn
||WTxWTy||2
||xy||2
= sup
ξ6=0;ξRn
||Wkξ||2
||ξ||2
=σmax(Wk),(4)
where
σmax(Wk)
is the largest singular value of the weight matrix
Wk
(also know as its spectral
norm
||Wk||SN
). To account for non-linear composite functions (e.g., the residual building blocks of
most convolutional architecture), we leverage the following inequality:
||g(z(x)) g(z(y))||2≤ ||g||L||z(x)z(y)||2
≤ ||g||L||z||L||xy||2⇒ ||zg||L≤ ||g||L||z||L,
where
g(·)
and
z(·)
are two Lipschitz continuous functions characterized by the constants
||g||L
and
||z||L
. In the case of ReLU-activated networks (but the following result can be extended to other
common non-linear functions), the forward pass through
σll= 1,2, . . . , L
can be re-arranged as
a matrix multiplication by a diagonal matrix
DlRdl×dl+1
whose diagonal elements equal either
zero or one. Therefore, their corresponding Lipschitz constant
||σl||L1
. On top of that, we can
compute an upper bound on the Lipschitz constant of the entire network:
||f||L≤ ||HK||L· ||σK||L·. . . · ||H1||L
K
Y
k=1
||Hk||L=
K
Y
k=1
||Wk||SN .(5)
Computing the spectral norm of weights matrices
. The computation of
||Wk||SN
can be done [
55
,
29
] naively through the Singular Value Decomposition (SVD), yielding, among the others, the largest
singular value. Such approach has been applied in recent works [
55
,
29
]; however, for complex
structures (e.g., convolutions or entire residual blocks) the SVD decomposition is inaccessible. Hence,
we rely on the approximation introduced in [
67
] and compute the largest eigenvalue
λk
1
of the
Transmitting Matrix TMk(which represents a good proxy of ||Wk||SN 2):
TMk,(Fk)T(Fk1)T(Fk)T(Fk1),(6)
where
FkRB×dk
is the L2-normalized feature map produced by the
l
-th layer from a batch of
B
samples. Finally, our approach exploits the power iteration method [
56
] to compute the largest
eigenvalue of TMk, which is backpropagation-friendly.
3.1 Lipschitz-Driven Rehearsal
In a continual setting, a model is asked i) to be adaptable to incoming samples from the stream
(plasticity), and ii) to be accurate on past tasks (stability). We seek to ensure a balance between these
clashing objectives through the two following loss terms.
Controlling Lipschitz-continuity
. To mitigate overfitting on buffer datapoints, we firstly impose
that each layer behaves as a
c
-Lipschitz continuous function, for a given real positive target constant
ck:
Lc-Lip =1
K
K
X
k=0
|λk
1ck|.(7)
2We refer the reader to [67] for additional justifications for this step.
4
During the computation of each
λk
1
, we discard the activation maps incoming from the examples of
the current task. Indeed, as we have access to the entire training set (and not a subset as holds for
old tasks), additional regularization is not needed: the decision boundaries tied to the current task
are less prone to the risk of over-adapting to certain points. Regarding the target constants
ck
, we
could fix them as hyperparameters of our learning objective (as done in [
48
]) and exploit them as a
sort of budget assigned to each layer; however, we empirically observed that it is beneficial, instead,
learning these targets by means of gradient descent (see Sec. 5.4), especially in a CL scenario where
there is no access to the full data distribution. Indeed, these can be interpreted as additional learnable
parameters, which represent the appropriate level of strictness each layer should be subjected to.
To avoid trivial solutions, we also encourage the estimated upper bounds to be as much as possible
close to zero:
L0-Lip =1
K
K
X
k=0
|λk
1|.(8)
Intuitively, when
λk
10
, the outputs of the corresponding
k
-th layer have low sensitivity to changes
in its input. In our intentions, this could relieve continuous rehearsal from eroding the decision
surface in a way that fits well only certain examples (i.e., those retained in the memory buffer).
Overall objective
. The overall objective of LiDER combines the two introduced loss terms; formally:
LLiDER =αLc-Lip +βL0-Lip.(9)
This objective can be plugged in almost any rehearsal approach. For such a reason, we keep it general
and avoid reporting the common loss terms asking for accurate predictions, as their form depends
on the specific choices made by each approach. Finally, we further remark that the introduced loss
terms require minimal additional computation. Moreover, they do not need additional samples to be
retained, besides those that are already present in the memory buffer.
3.2 Relation with other regularization approaches
At first glance, the regularization of our approach can be understood as a mean to enforce flat minima
for each of the tasks, as advocated by Mirzadeh et al. [
54
] and Yin et al. [
84
]. We remark that
they reason in
parameter space
and pursue flatness of the loss landscape w.r.t. weights: namely,
they encourage the model to be robust when perturbations are applied to its
weights
. Differently,
we seek models that are robust w.r.t. changes in
input space
. The two lines may exploit the same
mathematical tools – such as the Hessian and Lipschitz continuity – but build upon orthogonal axes
(weights vs input): we recommend to always assess which of the two is used as reference property.
In this respect, the bridge between the two objectives is worth-exploring and still open to debate [
83
,
71
,
87
,
36
]. The authors of [
83
] reported that, theoretically, no correlation exists between the Hessian
w.r.t. weights and the robustness of the model w.r.t. the input. Such a statement is corroborated by
Fig. 1 of [
87
]: although a flat minimum is reached in parameter space, non-smooth variations appear
in input space. However, the authors of [
83
] empirically found that models with higher Hessian
spectrum w.r.t. weights are also more prone to adversarial attacks. A similar thesis has been argued
by the authors of [
87
], while the third result reported in [
36
] seems to refute it. Furthermore, Sec. 5.3
of our paper investigated the opposite link and revealed that CL models trained to be robust w.r.t.
input changes tend to attain flatter minima in parameter space.
4 Experiments
To assess our proposal, we perform a suite of experiments through the Mammoth framework [
12
,
17
,
9
,
6
,
27
,
13
,
14
,
50
], an open-source codebase introduced in [
15
] for testing CL algorithms. In particular,
we show that our method can be easily applied to state-of-the-art replay methods and enhance their
performance in a wide variety of challenging settings and backbone architectures. Moreover, we
show that our proposal remains rewarding and can improve the generalization capabilities of CL
models even when a pre-trained model is employed. Such scenario is important for a twofold reason:
i) as shown in [
53
], pre-training implicitly mitigates forgetting by widening the local minima found
in function space, thus making the model more robust to input perturbations; additionally, ii) we
accommodate for real-world scenarios where pre-training is usually involved as an initial step. Due to
space constraints, we kindly refer the reader to the supplementary material for additional experimental
details (e.g., optimizer, hyperparameters, etc.).
5
摘要:

OntheEffectivenessofLipschitz-DrivenRehearsalinContinualLearningLorenzoBonicelli1MatteoBoschini1AngeloPorrello1ConcettoSpampinato2SimoneCalderara11AImageLab-UniversityofModenaandReggioEmilia2PeRCeiVeLab-UniversityofCataniaAbstractRehearsalapproachesenjoyimmensepopularitywithContinualLearning(CL)prac...

展开>> 收起<<
On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning Lorenzo Bonicelli1Matteo Boschini1Angelo Porrello1.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.01MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注