
represents the class label of
xt∗
i
and
Yt
is the set of classes in the
t
-th increment. Note that unlike
most continual learning setups,
Yi∩Yj6=∅
,
∀i6=j
. After training on
Dt
, the model
M
is tested
to recognize all the encountered classes so far
Y1, Y 2, ..., Y t
. The main challenges of FoCAL are
three-fold: (1) avoid catastrophic forgetting, (2) prevent overfitting on the few training samples, (3)
efficiently choose most informative samples in each increment.
For FoCAL for the task of object classification, we consider the model
M
(a CNN) as a composition
of a feature extractor
f(.;θ)
with parameters
θ
and a classification model with weights
W
. The
feature extractor transforms the input images into a feature space
F ∈ Rn
. The classification model
takes the features generated by the feature extractor, and generates an output vector followed by a
softmax function to generate multi-class probabilities. In this paper, we use a pre-trained feature
extractor, therefore parameters
θ
are fixed. Thus, we incrementally finetune the classification model
on
D1, D2, ...
and get parameters
W1, W 2, ...
. In an increment
t
, we expand the output layer by
|Yt|
neurons to incorporate new classes. Note that this setup does not alleviate the three challenges of
FoCAL mentioned above. The subsections below describe the main components of our framework
(Figure 1) to transform this setup for FoCAL.
2.1 GMM Based Continual Learning (GBCL)
We aim to develop a model that not only helps the system with continual learning but is also motivated
by the newness of an object. To accomplish this, we must evaluate how different an incoming object
is from previously learned object classes, ideally without any additional supervision. To accomplish
this, we consider a clustering-based approach to represent the distribution of object classes. Unlike
previous works on clustering-based approaches for continual learning [
10
,
11
] that represent the
object classes as mean feature vectors (centroids), we estimate the distribution of the each object class
using a uniform Gaussian mixture model (GMM). We believe that representing each class data as a
GMM may better represent the true distribution of the data rather than assuming that the distribution
is circular. We call our complete algorithm for continually learning GMMs of multiple object classes
as GMM based continual learning (GBCL).
Once the
k
feature vectors (
Dt
) selected by acquisition function (Section 2.2) as most informative
samples are labeled by the oracle in increment
t
, GBCL is applied to learn GMMs for the classes
Yt
.
For each
i
th feature vector
xt∗
i
in
Dt
labeled as
yt
i
, if
yt
i
is a new class never seen by the model before,
we initialize a new Gaussian distribution
N(xt
i, O)
for class
y
with
xt
i
as the mean (centroid) and a
zero matrix (
O
) as the covariance matrix
2
. Otherwise, if
yt
i
is a known class, we find the probabilities
N(xt
i|cy
1, σy
1), ..., N(xt
i|cy
j, σy
j), ..., N(xt
i|cy
ny, σy
ny)
for
xy
i
to belong to all the previously learned
Gaussian distributions for class
y
, where
ny
is the total number of mixture components in the GMM
for class
y
, and
cy
j
and
σy
j
represent the centroid and covariance matrix for the
j
th mixture component
of class
y
, respectively. If the maximum probability among the calculated probabilities for all the
distributions is higher than a pre-defined probability threshold
P
,
xt
i
is used to update the parameters
(centroid and covariance matrix) of the most probable distribution (
N(cy
j, σy
j)
) in class
y
. The
updated centroid ˆcy
jis calculated as a weighted mean between the previous centroid cy
jand xt
i:
ˆcy
j=wy
j×cy
j+xt
i
wy
j+ 1 (1)
where,
wy
j
is the number of images already clustered in the
j
th (most probable) Gaussian distribution.
The updated covariance matrix ˆσy
jis calculated based on the procedure described in [23]):
ˆσy
j=wy
j−1
wy
j
σy
j+wy
j−1
wy
j
2(xt
i−ˆ
cy
j)T(xt
i−ˆcy
j)(2)
where,
σy
j
is the previous covariance matrix and
(xt
i−ˆ
(c)y
j)T(xt
i−ˆcy
j)
represents the covariance
between
xt
i
and
ˆcy
j
. If, on the other hand, the maximum probability among the calculated probabilities
2
We do not describe mixing coefficients here, as they will always be
1/n
for a uniform GMM, where
n
is the
number of mixture components.
3