
Benchmark #TPs #classes/TP #train/TP #test/TP
CIFAR-LECO 2 20/100 10k 10k
iNat-LECO 2 123/810 50k 4k
Mapillary-LECO 2 66/116 2.5k 2k
iNat-4TP-LECO 4 123/339/729/810 25k 4k
Table 1: LECO benchmarks use CIFAR100 [
42
],
iNaturalist [
69
,
75
], and Mapillary [
52
,
55
]. We
also vary the number of training data in exper-
iments; as conclusions hold across settings, we
report these results in Table 9, 10, 11.
dataset is now labeled with inconsistent old-vs-new annotations, complicating learning. We show that
joint-training [
47
] on both new and old ontologies, when combined with semi-supervised learning
(SSL), can be remarkably effective. Concretely, we generate pseudo labels for the new ontology
on the old data. But because pseudo labels are noisy [
44
,
58
,
68
,
79
] and potentially biased [
77
],
we make use of the coarse labels from the old-ontology as “coarse supervision”, or the coarse-fine
label relationships [
26
,
45
,
61
,
73
], to reconcile conflicts between the pseudo fine labels and the true
coarse labels (Section 6). There is another natural question: should one finetune the previous TP’s
model or train from scratch? One may think the former has no benefit (because it accesses the same
training data) and might suffer from local minima. Suprisingly, we find finetuning actually works
much better, echoing curriculum learning [
5
,
21
,
63
]. Perhaps most suprisingly, we demonstrate
that such strategies are near optimal, approaching an “upperbound” that trains on all available data
re-annotated with the new ontology.
Salient results. To study LECO, we repurpose three datasets: CIFAR100, iNaturlist, and Mapillary
(Table 1). The latter two large-scale datasets have long-tailed distribution of class labels; particularly,
Mapillary’s new ontology does not have strictly surjective mapping to the old, because it relabeled
the same data from scratch with new ontology. Our results on these datasets lead to consistent
technical insights summarized above. We preview the results on the iNaturalist-LECO setup (which
simulates the large-scale long-tailed scenario): (1) finetuning the previous TP’s model outperforms
training from scratch on data labeled with new ontology: 73.64% vs. 65.40% in accuracy; (2) jointly
training on both old and new data significantly boosts accuracy to 82.98%; (3) taking advantage of the
relationship between old and new labels (via Learning-with-Partial-Labels and SSL with pseudo-label
refinement) further improves to 84.34%, effectively reaching an “upperbound” 84.51%!
Contributions
. We make three major contributions. First, we motivate the problem LECO and define
its benchmarking protocol. Second, we extensively study approaches to LECO by modifying methods
from related problems, including semi-supervised learning, class-incremental learning, and learning
with partial labels. Third, we draw consistent conclusions and technical insights, as described above.
2 Related Work
Class-incremental learning
(CIL) [
18
,
47
,
74
,
86
] and LECO both require learning classifiers for
new classes over distinct time periods (TPs), but they have important differences. First, the typical CIL
setup assumes new labels have no relations with the old [
1
,
47
,
49
], while LECO’s new labels refine
or expand the old / “coarse” labels (cf. Fig. 2). Second, CIL purposely sets a small memory buffer
(for storing labeled data) and evaluates accuracy over both old and new classes with the emphasis on
information retention or catastrophic forgetting [
10
,
40
,
47
,
83
]. However, LECO allows all (history)
labeled data to be stored and highlight the difficulty of learning new classes which are fine/subclasses
of the old ones. Further, we note that many real-world applications store history data (and should not
limit a buffer size to save data) for privacy-related consideration (e.g., medical data records) and as
forensic evidence (e.g., videos from surveillance cameras). Therefore, to approach LECO, we apply
CIL methods that do not restrict buffer size, which usually serve as upper bounds in CIL [
17
]. In
this work, we repurpose “upper bound” CIL methods for LECO including finetuning [
3
,
80
], joint
training [
11
,
47
,
50
,
51
], as well as the “lower bound” methods by freezing the backbone [
3
,
20
,
66
].
Semi-supervised learning
(SSL) learns over both labeled and unlabeled data. State-of-the-art SSL
methods follow a self-training paradigm [
54
,
65
,
79
] that uses a model trained over labeled data to
pseudo-label the unlabeled samples, which are then used together with labeled data for training. For
example, Lee et al.
[44]
use confident predictions as target labels for unlabeled data, and Rizve et al.
[62]
further incorporate low-probability predictions as negative pseudo labels. Some others improve
SSL using self-supervised learning techniques [
24
,
84
], which force predictions to be similar for
different augmentations of the same data [
6
,
7
,
68
,
79
,
85
]. We leverage insights of SSL to approach
LECO, e.g., pseudo labeling the old data. As pseudo labels are often biased [77], we further exploit
the old-ontology to reject or improve the inconsistent pseudo labels, yielding better performance.
3