is observed changes. In some settings, the model has ac-
cess to the context variable or context-ID during train-
ing and testing, which is referred to as task-incremental
learning. Finally, in this work we assume that P(Y|X)
is not subject to change, i.e. the label yof a sample x
never changes [5,6]. See Figure 1for an overview and
stratification of common benchmarks, including those
proposed here.
2.2. Contemporary CL for Computer Vision
In theory, continual learning is concerned with any
kind of distribution change in the training data. Given
the many possible choices, many different benchmarks
have been proposed and used. To understand which
ones are currently popular in computer vision, we sur-
veyed CVPR, ICCV and NeurIPS of 2021; three highly
ranked computer vision and machine learning confer-
ences. From their proceedings, we selected all pa-
pers with the words continual, lifelong, sequential, in-
cremental or forget in their titles. These keywords
were defined using a manually collected list of 50 pa-
pers concerning CL, spanning publication year 2017-
2022, of which 98% had positive matches with our
keywords. After filtering for false positives, 60 rele-
vant papers remain. Of these works, 73% included at
least one classification problem, 10% a semantic seg-
mentation problem, 7% a generative problem, 3% an
object detection problem and 10% various other prob-
lems. In the papers that focused on classification, 188
experiments (excluding ablations etc. ) were conducted.
Of these, 90% used a non-continual dataset, and ran-
domly change P(X) at discrete intervals, such that each
class has only non-zero probability in a single context
(strictly class-incremental). 8.5% changed P(X) without
affecting P(Y) (strictly domain incremental), and only
1.5% included more gradual context switches, see Sec-
tion 2.3. See Figure 1for the distribution of datasets
used. While the random and discrete context switches
in these benchmarks are only a small part of the space
of CL-problems, they are currently used to assess the
quality of almost all new CL-algorithms.
2.3. Towards more realistic benchmarks
While discrete and random distribution shifts are an
interesting tool to study continual learning, they are,
as previously discussed, not necessarily representative
of the changing context in which continual learning
systems can be used. In fact, as shown in Figure 1,
there is a whole continuum of context changes between
strict class and domain incremental learning. To
aid progress towards applicable continual learning,
Lomonaco et al. introduced CORe50 [7], and claim
that realistic CL-benchmarks should have access to
multiple views of the same object. This coincides with
a more gradual changing context of the data distribution
P(X), an idea shared by [8,9,10], whom introduce
iCub World, Toys-200 and Stream-51, respectively.
Cossu et al. are critical of the lack of class repetition
in continual learning benchmarks, and claim that this
makes CL artificially difficult, and unlike real-world
scenarios [11]. They consider it realistic that a class
has non-zero probability during multiple contexts,
albeit with different frequencies. Additional support for
this hypothesis can be found in related works [12,7].
Repetition occurs naturally in benchmarks leveraging
temporal meta-data of images, an idea implemented
by the benchmarks Wanderlust [13], Clear [14] and
CLOC [15]. Using time of day as a context variable,
both the data and label distributions change gradually
and non-randomly, which comes closest to a real-world
setting, see Figure 1for how they compare to more
traditional benchmarks. Regardless of context changes,
some benchmarks only allow for online learning, where
only a single pass over the data within a context is
allowed [16]. This is regarded a realistic scenario in
other works [17,15,18,19]. Finally, using the context
as an input variable, such that the model knows the
context of a sample (task-incremental learning), has
been critiqued as too restricting [16]. Despite these
critiques and proposals to work towards more natural
benchmarks, Section 2.2 indicated that they are seldom
adopted in papers proposing new methods.
Most of these benchmarks focus on classification
problems, as are most papers surveyed in Section 2.2.
Despite its prevalence, classification is likely not the
only scenario where CL will be applied in practice,
since it often requires having a single (centered) object
per image. Recent works [20,21] started exploring
object detection and semantic segmentation in CL; two
problems that are more likely to practically benefit from
CL.
2.4. Evaluation of Continual Learning
Besides good benchmarks, metrics that accurately re-
flect the goals of continual learning are indispensable.
CL-methods are commonly evaluated using the average
accuracy of each task at the end of training and aver-
age backward transfer (BWT); the difference between
the accuracy of a task directly after it was trained and
after all tasks were trained [22]. These metrics are not
necessarily aligned with the CL-goal of including new
knowledge to an already working system. According
3