
2 Related works
Continual Learning
. CL examines the capability of a deep model to learn from a sequence of
non-i.i.d. classification tasks [
23
,
58
] while preventing the onset of catastrophic forgetting [
52
]. To
achieve this goal, models are trained according to specifically designed strategies, meant to influence
their evolution and maximize the retention of previously acquired knowledge.
Among them, regularization methods work by introducing additional constraints in the form of loss
terms; they are designed to limit the amount of total change either in parameter space [
42
,
89
,
19
,
2
]
or functional space [
47
,
10
]. Differently, structural methods purposefully organize the allocation
of model capacity to prevent interference and facilitate parameter sharing [
1
,
51
,
66
,
37
]. Lastly,
rehearsal methods store and reuse a subset of previously seen data-points to prevent overfitting
on current data and avert forgetting [
16
,
49
,
4
,
3
]. While rehearsal strategies are by far the most
frequently adopted thanks to their effectiveness and flexibility [
26
,
4
], it is not uncommon to adopt
solutions combining multiple approaches [
32
,
1
,
15
,
60
,
18
]. In this paper, we similarly propose a
strategy that leverages an existing replay memory buffer to compute an additional regularization term,
aimed at conditioning the learning dynamics and avoiding overfitting.
CL evaluations are often carried out in the so-called Task-Incremental setting (TIL) [
42
,
89
,
49
,
23
,
19
]
– that is, the model is provided a task identifier at test time to avoid interference across predictions of
distinct tasks. However, recent works put an increasing focus on the harder Class-Incremental setting
(CIL) [
4
,
32
,
15
,
81
], which entails the production of a unified prediction encompassing all seen
classes. W.r.t. the latter, TIL has been criticized as a less challenging and realistic benchmark [
26
,
74
,
4]; we therefore conduct our main experiments on state-of-the-art CL models in the CIL setting1.
Lipschitz-based Regularization
. Standard DNNs typically suffer from their overparametrization [
90
,
57
], leading to the tendency to overfit the training data by producing jagged decision boundaries
that closely fit the seen examples. On the contrary, a model’s reliability depends on its capability
for generalization, which is linked to the appearance of smooth decision boundaries [
82
,
8
,
86
,
29
].
Starting from the first studies focusing on this simple dichotomy, the Lipschitz constant
L
of a DNN
has been established as a commonplace measure of both smoothness and generalization [
82
,
72
,
39
]
and still constitutes a key ingredient for current evaluations of model capacity [8, 30].
Most notably, Szegedy et al. [
72
] verify that constraining
L
reduces the model’s vulnerability to
adversarial perturbations. Many current approaches to Adversarial Learning similarly operate either
by minimizing
L
at the global or local level [
46
,
73
,
45
] or by devising models characterized by a
small
L
by design [
22
,
33
]. In other areas, the smoothing effect of
L
-based regularization has been
favorably applied to both GAN training [55] and neural fields [48].
3 Method
A CL problem usually involves learning a function
f
from a stream of data, which we formalize as a
succession of separate datasets
T={τ0, τ1, . . . , τ|T|}
, where
τt={(xi, yi)}Nt
i=1
and
τi∩τj=∅
; the
label set
Yt
for each
τt
are non-overlapping. In this setting, the ideal objective consists in minimizing
the overall loss over all tasks experienced, formally:
f∗= argmin
f
E|T|
t=0E(x,y)∼τthL(f(x), y)i,(1)
where
L
is an appropriate loss for solving the task at hand. In a continual scenario, only data from
the current task
τt
is freely available; therefore, CL methods need to maintain knowledge from the
past t−1tasks in order to solve the overall problem.
For the sake of simplicity, we consider a feed forward neural network
f(·)=(HK◦σK◦HK−1◦
σK−1◦. . . H1)(·)
,i.e., a sequence of
σ
-activated linear functions
Hk(h) = WT
kh
(biases are omit-
ted). A final projection head
g(·) = softmax(·)
is applied to produce per-class output probabilities.
As stated in other works [
29
], other common transformations that make up DNNs (e.g., convolutions,
max-pool) can also be seen in terms of matrix multiplications, thus making our approach applicable
to more complex networks.
1
However, we remark that TIL can also be useful, as it reveals forgetting disentangled from other incremental
learning effects. This motivates us to adopt TIL in some of our additional experiments.
3