
new categories, which is generally not true in real applications. In this work, we examine the dynamic
setup of novel category discovery where the system was initially given a set of data labeled by known
classes, and unlabeled data are continuously streamed into the system for discovering new classes.
The system is requested to consistently yield satisfying performance for known classes, and at the
same time, dynamically discover new categories from the streaming unlabeled data. We refer to it as
Continuous Category Discovery, or CCD for short.
We illustrate the process of CCD in Figure 1. It is comprised of two main stages: the initial stage
where a classification model is trained by a set of labeled examples, and the continuous category
discovery stage where new categories are continuously discovered from a stream of unlabeled
data belonging to both known and unknown classes. A intuitive approach to address the dynamic
nature of CCD is to combine the existing methods for open-set recognition [
9
–
11
], novel category
discovery [
12
,
6
,
13
], and incremental learning [
14
,
15
]. This is however insufficient because our
learning system has to accomplish two tasks at the same time, i.e., accurately classify instances into
the known classes, and discover new categories from an unlabeled data stream. It turns out that these
two task models usually produce different types of features: discriminative features on known classes
are preferred by classification model, while rich and diverse features are critical for identifying new
classes, as illustrated in Figure 2. A simple combination of novel category discovery and incremental
learning will fail to address the trade-off consistently over time, which is further verified by our
empirical studies.
Unlabeled set 1 Unlabeled set t
…
…
Continuous Category Discovery Stage
…
Labeled set
Dolphin
Fox DolphinFox Novel Category 1 Dolphin
Fox Novel Category 1 Novel Category n
Initial Stage
Time-step 1 Time-step t Time
…
Continuous
Datasets
Classifier
…
Known class
Unknown class
Figure 1:
Overview of the Continuous Category Discovery (CCD).
The continuous data stream is
mixed with unlabeled samples from both known and novel categories. CCD requires to distinguish
known categories, discover novel categories and merge the discovered categories into known set.
To address the challenge of continuous category discovery, we propose a framework of
Grow
and
Merge
, or
GM
for short. After pre-training a static model
A
over the labeled data, we will update
model
A
with respect to unlabeled data stream by alternating between the growing phase and the
merging phase: in the growing phase, we will increase the diversity of features by continuously
training our model over received unlabeled data through a combination of supervised and self-
supervised learning; in the merging phase, we will merge the grown model with the static one by
taking a weighted combination of both models. By alternating between the growing and merging
phases, we are able to maintain a good performance for known classes, and at same time, the power
of discovering new categories. This is clearly visualized in Figure 2, where the first two panels show
that existing approaches can do well on one of the two tasks but not both, and last panel shows that
features learned by the proposed GM framework works well for both tasks.
Finally, one of the common issues with continuous training is catastrophic forgetting. To alleviating
the forgetting effect as we are growing the number of categories over time, we maintain a small
set of labeled samples from known categories and pseudo-labeled samples from novel categories.
These selected examples are used in the growing phase to expand feature diversity for effective
category discovery. Extensive experimental results show that our proposed method consistently
shows satisfying performance under multiple practical scenarios compared with existing methods.
The main contributions of this paper are summarized as follows:
•
We study a new problem named continuous category discovery, or CCD, which better
reflects the challenge of category discovery in the wild. It needs to simultaneously maintain
a good performance for known categories and the ability of discovering novel categories.
•
We propose a framework of grow and merge, or GM, for CCD, that is able to resolve the
conflicts between the classification task and the task of discovering new categories.
2