Continual Learning with Evolving Class Ontologies Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan13Shu Kong4 1CMU2UIUC3Argo AI4Texas AM University

2025-04-27 0 0 4.57MB 21 页 10玖币
侵权投诉
Continual Learning with Evolving Class Ontologies
Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan1,3Shu Kong4
1CMU 2UIUC 3Argo AI 4Texas A&M University
Open-source code in webpage
Abstract
Lifelong learners must recognize concept vocabularies that evolve over time. A
common yet underexplored scenario is learning with class labels that continually
refine/expand old classes. For example, humans learn to recognize
dog
before
dog breeds. In practical settings, dataset versioning often introduces refinement
to ontologies, such as autonomous vehicle benchmarks that refine a previous
vehicle
class into
school-bus
as autonomous operations expand to new
cities. This paper formalizes a protocol for studying the problem of Learning
with Evolving Class Ontology (LECO). LECO requires learning classifiers in
distinct time periods (TPs); each TP introduces a new ontology of “fine” labels that
refines old ontologies of “coarse” labels (e.g., dog breeds that refine the previous
dog
). LECO explores such questions as whether to annotate new data or relabel
the old, how to exploit coarse labels, and whether to finetune the previous TP’s
model or train from scratch. To answer these questions, we leverage insights from
related problems such as class-incremental learning. We validate them under the
LECO protocol through the lens of image classification (on CIFAR and iNaturalist)
and semantic segmentation (on Mapillary). Extensive experiments lead to some
surprising conclusions; while the current status quo in the field is to relabel existing
datasets with new class ontologies (such as COCO-to-LVIS or Mapillary1.2-to-2.0),
LECO demonstrates that a far better strategy is to annotate new data with the new
ontology. However, this produces an aggregate dataset with inconsistent old-vs-new
labels, complicating learning. To address this challenge, we adopt methods from
semi-supervised and partial-label learning. We demonstrate that such strategies can
surprisingly be made near-optimal, in the sense of approaching an “oracle” that
learns on the aggregate dataset exhaustively labeled with the newest ontology.
1 Introduction
Humans, as lifelong learners, learn to recognize the ontology of concepts which is being refined and
expanded over time. For example, we learn to recognize
dog
and then dog breeds (refining the class
dog
). The class ontology often evolves from coarse to fine concepts in the real open world and can
sometimes introduce new classes. We call this Learning with Evolving Class Ontology (LECO).
Motivation
. LECO is common when building machine-learning systems in practice. One often trains
and maintains machine-learned models with class labels (or a class ontology) that are refined over time
periods (TPs) (Fig. 2). Such ontology evolution can be caused by new requirements in applications,
such as autonomous vehicles that expand operations to new cities. This is well demonstrated in many
contemporary large-scale datasets that release updated versions by refining and expanding classes,
such as Mapillary [
52
,
55
], Argoverse [
12
,
78
], KITTI [
4
,
23
] and iNaturalist [
33
37
,
69
,
75
]. For
example, Mapillary V1.2 [
55
] defined the
road
class, and Mapillary V2.0 [
52
] refined it with fine
Authors share senior authorship.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04993v4 [cs.CV] 15 Dec 2022
Figure 1: Learning with Evolving Class Ontology (LECO) requires training models in time periods (TPs)
with new ontologies that refine/expand the previous ones.
Left:
The Mapillary dataset, which was constructed
to study semantic segmentation for autonomous vehicles, has updated versions from V1.2 [
55
] to V2.0 [
52
],
which refined previous labels with fine-grained ones on the same data, e.g., V1.2’s
sidewalk
is split to V2.0’s
sidewalk
and
driveway
. The visual demonstration with semantic segmentation ground-truth blacks out
unrelated classes for clarity. We explore LECO using this dataset.
Right:
To study LECO, we also repurpose
the large-scale iNaturalist dataset [
33
37
,
69
,
75
] to simulate the coarse-to-fine ontology evolution through the
lens of image classification. LECO asks the first basic question: for the new TP that defines new ontology of
classes, should one relabel the old data or label new data? Interestingly, both labeling protocols have been used
in the community for large-scale datasets. Our study provides the (perhaps obvious) answer – one should always
annotate new data with the new ontology rather than re-annotating the old. One reason is that the former produces
more labeled data. But this produces an aggregate dataset with inconsistent old-vs-new labels. We make use
of insights from semi-supervised learning and learning-with-partial-labels to learn from such heterogenous
annotations, approaching the upper bound of learning from an oracle aggregate dataset with all new labels.
labels including
parking-aisle
and
road-shoulder
; Argoverse V1.0 [
12
] defined the
bus
class, and Argoverse V2.0 [78] refined it with fine classes including school-bus.
Prior art
. LECO requires learning new classes continually in TPs, similar to class-incremental
learning (CIL) [
18
,
47
,
74
,
86
]. They differ in important ways. First, CIL assumes brand-new labels
that have no relations with the old [
47
,
49
,
60
], while LECO’s new labels are fine-grained ones that
expand the old / “coarse” labels (cf. Fig. 2). Interestingly, CIL can be seen as a special case of LECO
by making use of a “catch-all”
background
class [
41
] (often denoted as
void
or
unlabeled
in
existing datasets [
14
,
22
,
49
,
52
,
55
]), which can be refined over time to include specific classes that
were previously ignored. Second, CIL restricts itself to a small memory buffer for storing historical
data to focus on information retention and catastrophic forgetting [
10
,
47
]. However, LECO stores all
historical data (since large-scale human annotations typically come at a much higher cost than that
of the additional storage) and is concerned with the most-recent classes of interest. Most related to
LECO are approaches that explore the relationship of new classes to old ones; Sariyildiz et al.
[64]
study concept similarity but remove common superclasses in their study, while Abdelsalam et al.
[1]
explore CIL with a goal to discover inter-class relations on data labeled (as opposed to assuming
such relationships can be derived from a given taxonomic hierarchy). To explore LECO, we set up
its benchmarking protocol (Section 3) and study extensive methods modified from those of related
problems (Sections 4, 5, and 6).
Technical insights
. LECO studies learning strategies and answers questions as basic as whether
one should label new data or relabel the old data
2
. Interestingly, both protocols have been used in
the community for large-scale dataset versioning: Argoverse [
12
,
78
] annotates new data, while
Mapillary [
52
,
55
] relabels its old data. Our experiments provide a definitive answer – one should
always annotate new data with the new ontology rather than reannotating the old data. The simple
reason is that the former produces more labeled data. However, this comes at a cost: the aggregate
2
We assume the labeling cost is same for either case. One may think re-annotating old data is cheaper,
because one does not need to curate new data and may be able to redesign an annotation interface that exploits
the old annotations (e.g., show old
vehicle
labels and just ask for fine-grained make-and-model label). But
in practice, in order to prevent carry-over of annotation errors and to simplify the annotation interface, most
benchmark curators tend to relabel from scratch, regardless of on old or new data. Examples of relabeling
old-data from scratch include Mapillary V1.2 [55] V2.0 [52], and COCO [48] LVIS [27].
2
Benchmark #TPs #classes/TP #train/TP #test/TP
CIFAR-LECO 2 20/100 10k 10k
iNat-LECO 2 123/810 50k 4k
Mapillary-LECO 2 66/116 2.5k 2k
iNat-4TP-LECO 4 123/339/729/810 25k 4k
Table 1: LECO benchmarks use CIFAR100 [
42
],
iNaturalist [
69
,
75
], and Mapillary [
52
,
55
]. We
also vary the number of training data in exper-
iments; as conclusions hold across settings, we
report these results in Table 9, 10, 11.
dataset is now labeled with inconsistent old-vs-new annotations, complicating learning. We show that
joint-training [
47
] on both new and old ontologies, when combined with semi-supervised learning
(SSL), can be remarkably effective. Concretely, we generate pseudo labels for the new ontology
on the old data. But because pseudo labels are noisy [
44
,
58
,
68
,
79
] and potentially biased [
77
],
we make use of the coarse labels from the old-ontology as “coarse supervision”, or the coarse-fine
label relationships [
26
,
45
,
61
,
73
], to reconcile conflicts between the pseudo fine labels and the true
coarse labels (Section 6). There is another natural question: should one finetune the previous TP’s
model or train from scratch? One may think the former has no benefit (because it accesses the same
training data) and might suffer from local minima. Suprisingly, we find finetuning actually works
much better, echoing curriculum learning [
5
,
21
,
63
]. Perhaps most suprisingly, we demonstrate
that such strategies are near optimal, approaching an “upperbound” that trains on all available data
re-annotated with the new ontology.
Salient results. To study LECO, we repurpose three datasets: CIFAR100, iNaturlist, and Mapillary
(Table 1). The latter two large-scale datasets have long-tailed distribution of class labels; particularly,
Mapillary’s new ontology does not have strictly surjective mapping to the old, because it relabeled
the same data from scratch with new ontology. Our results on these datasets lead to consistent
technical insights summarized above. We preview the results on the iNaturalist-LECO setup (which
simulates the large-scale long-tailed scenario): (1) finetuning the previous TP’s model outperforms
training from scratch on data labeled with new ontology: 73.64% vs. 65.40% in accuracy; (2) jointly
training on both old and new data significantly boosts accuracy to 82.98%; (3) taking advantage of the
relationship between old and new labels (via Learning-with-Partial-Labels and SSL with pseudo-label
refinement) further improves to 84.34%, effectively reaching an “upperbound” 84.51%!
Contributions
. We make three major contributions. First, we motivate the problem LECO and define
its benchmarking protocol. Second, we extensively study approaches to LECO by modifying methods
from related problems, including semi-supervised learning, class-incremental learning, and learning
with partial labels. Third, we draw consistent conclusions and technical insights, as described above.
2 Related Work
Class-incremental learning
(CIL) [
18
,
47
,
74
,
86
] and LECO both require learning classifiers for
new classes over distinct time periods (TPs), but they have important differences. First, the typical CIL
setup assumes new labels have no relations with the old [
1
,
47
,
49
], while LECO’s new labels refine
or expand the old / “coarse” labels (cf. Fig. 2). Second, CIL purposely sets a small memory buffer
(for storing labeled data) and evaluates accuracy over both old and new classes with the emphasis on
information retention or catastrophic forgetting [
10
,
40
,
47
,
83
]. However, LECO allows all (history)
labeled data to be stored and highlight the difficulty of learning new classes which are fine/subclasses
of the old ones. Further, we note that many real-world applications store history data (and should not
limit a buffer size to save data) for privacy-related consideration (e.g., medical data records) and as
forensic evidence (e.g., videos from surveillance cameras). Therefore, to approach LECO, we apply
CIL methods that do not restrict buffer size, which usually serve as upper bounds in CIL [
17
]. In
this work, we repurpose “upper bound” CIL methods for LECO including finetuning [
3
,
80
], joint
training [
11
,
47
,
50
,
51
], as well as the “lower bound” methods by freezing the backbone [
3
,
20
,
66
].
Semi-supervised learning
(SSL) learns over both labeled and unlabeled data. State-of-the-art SSL
methods follow a self-training paradigm [
54
,
65
,
79
] that uses a model trained over labeled data to
pseudo-label the unlabeled samples, which are then used together with labeled data for training. For
example, Lee et al.
[44]
use confident predictions as target labels for unlabeled data, and Rizve et al.
[62]
further incorporate low-probability predictions as negative pseudo labels. Some others improve
SSL using self-supervised learning techniques [
24
,
84
], which force predictions to be similar for
different augmentations of the same data [
6
,
7
,
68
,
79
,
85
]. We leverage insights of SSL to approach
LECO, e.g., pseudo labeling the old data. As pseudo labels are often biased [77], we further exploit
the old-ontology to reject or improve the inconsistent pseudo labels, yielding better performance.
3
Learning with partial labels
(LPL) tackles the case when some examples are fully labeled while
others are only partially labeled [
9
,
15
,
26
,
53
,
56
,
88
]. In the area of fine-grained recognition, partial
labels can be coarse superclasses which annotate some training data [
31
,
45
,
61
,
73
]. In a TP of
LECO, the old data from previous TPs can be used as partially-labeled examples as they contain only
coarse labels. Therefore, to approach LECO, we explore state-of-the-art methods [
70
] in this line of
work. However, it is important to note that ontology evolution in LECO can be more than splitting
old classes, e.g., the new class
special-subject
can be a result of merging multiple classes
child
and
police-officer
. Indeed, we find it happens quite often in the ontology evolution
from the Mapillary’s versioning from V1.2 [
55
] to V2.0 [
52
]. In this case, the LPL method does not
show better results than the simple joint training approach (31.04 vs. 31.05 on Mapillary in Table 4).
3 Problem Setup of Learning with Evolving Class Ontology (LECO)
Notations
. Among
T
time periods (TPs), TP
t
has data distributions
3Dt=X×Yt
, where
X
is the
input and
Yt
is the label set. The evolution of class ontology is ideally modeled as a tree structure:
the
ith
class
yt
iYt
is a node that has a unique parent
yt1
jYt1
, which can be split into
multiple fine classes in TP
t
. Therefore,
|Yt|>|Yt1|
. TP
t
concerns classification w.r.t label set
Yt
.
Benchmarks
. We define LECO benchmarks using three datasets: CIFAR100 [
42
], iNaturalist [
69
,
75
], and Mapillary [
52
,
55
] (Table 1). CIFAR100 is released under the MIT license, and iNaturalist and
Mapillary are publicly available for non-commercial research and educational purposes. CIFAR100
and Mapillary contain classes related to person and potentially have fairness and privacy concerns,
hence we cautiously proceed our research and release our code under the MIT License without
re-distributing the data. The iNaturalist and Mapillary are large-scale and have long-tailed class
distributions (refer to the Table 4 and 5 for class distributions). For each benchmark, we sample
data from the corresponding dataset to construct time periods (TPs), e.g., TP
0
and TP
1
define their
own ontologies of class labels
Y0
and
Y1
, respectively. For CIFAR-LECO, we use the two-level
hierarchy offered by the original CIFAR100 dataset [
42
]: using the 20 superclasses as the ontology in
TP
0
, each of them is split into 5 subclasses as new ontology in TP
1
. For iNat-LECO, we used the
most recent version of Semi-iNat-2021 [
71
] that comes with rich taxonomic labels at seven-level
hierarchy. To construct two TPs, we choose the third level (i.e., “order”) with 123 categories for
TP
0
, and the seventh (“species”) with 810 categories for TP
1
. We further construct four TPs with
each having an ontology out of four at distinct taxa levels [“order”, “family”, “genus”, “species”].
For Mapillary-LECO, we use its ontology of V1.2 [
55
] (2017) in TP
0
, and that of V2.0 [
52
] (2021)
in TP
1
. It is worth noting that, as a real-world LECO example, Mapillary includes a catch-all
background
class (aka
void
) in V1.2, which was split into meaningful fine classes in V2.0, such
as temporary-barrier,traffic-island and void-ground.
Moreover, in each TP of benchmark CIFAR-LECO and iNat-LECO, we randomly sample 20% data
as validation set for hyperparameter tuning and model select. We use their official valsets as our
test-sets for benchmarking. In Mapillary-LECO, we do not use a valset but instead use the default
hyperparameters reported in [
72
], tuned to optimize another related dataset Cityscapes [
14
]. We find
it unreasonably computational demanding to tune on large-scale semantic segmentation datasets.
Table 1 summarizes our benchmarks’ statistics.
Remark: beyond class refining in ontology evolution
. In practice, ontology evolution also contains
the case of class merging, as well as class renaming (which is trivial to address). The Mapillary’s
versioning clearly demonstrates this: 10 of 66 classes of its V1.2 were refined into new classes,
resulting into 116 classes in total of its V2.0; objects of the same old class were either merged
into a new class or assigned with new fine classes. Because of these non-trivial cases under the
Mapillary-LECO benchmark, state-of-the-art methods of learning with partial labels are less effective
but our other solutions work well.
Remark: number of TPs and buffer size
. For most experiments, we set two TPs. One may think
more TPs are required for experiments because CIL literature does so [
18
,
47
,
74
,
86
]. Recall that
CIL emphasizes catastrophic forgetting issue caused by using limited buffer to save history data, so
using more TPs for more classes and limited buffer helps exacerbate this issue. Differently, LECO
emphasizes the difficulty of learning new classes that are subclasses of the old ones, and does not
3
The data distribution may not be necessarily static where LECO can still happen, as discussed in Section 1.
Our work sets a static data distribution to simplify the study of LECO.
4
limit a buffer to store history data. Therefore, using two TPs are sufficient to show the difficulty.
Even so, we have experiments that set four TPs, which demosntrate consistent conclusions drawn
from those using two TPs. Moreover, even with unlimited buffer, CIL methods still work poorly on
LECO benchmarks (cf. “TrainScratch” [59] and “FreezePrev” [47] in Table 2).
Metric
. Our benchmarks study LECO through the lens of image classification (on CIFAR-LECO
and iNat-LECO) and semantic segmentation (on Mapillary-LECO). Therefore, we evaluate methods
using mean accuracy (mAcc) averaged over per-class accuracies, and mean intersection-over-union
(mIoU) averaged over classes, respectively. The metrics treat all classes equally, avoiding the bias
towards common classes due to the natural long-tailed class distribution (cf. iNaturalist [
69
,
75
]
and Mapillary [
55
]) (cf. class distributions in the appendix). We do not use specific algorithms
to address the long-tailed recognition problem but instead adopt the recent simple technique to
improve performance by tuning weight decay [
2
]. We run each method five times using random seeds
on CIFAR-LECO and iNat-LECO, and report their averaged mAcc with standard deviations. For
Mapillary, due to exoribitant training cost, we perform one run to benchmark methods.
4 Baseline Approaches to LECO
We first repurpose CIL and continual learning methods as baselines to approach LECO, as both
require learning classifiers for new classes in TPs. We start with preliminary backgrounds.
Preliminaries
. Following prior art [
39
,
47
], we train convolutional neural networks (CNNs) and
treat a network as a feature extractor
f
(parametrized by
θf
) plus a classifier
g
(parameterized by
θg
). The feature extractor
f
consists of all the layers below the penultimate layer of ResNet [
28
], and
the classifier
g
is the linear classifier followed by softmax normalization. Specifically, in TP
t
, an
input image
x
is fed into
f
to obtain its feature representation
z=f(x;θf)
. The feature
z
is then fed
into
g
for the softmax probabilities,
q=g(z;θg)
, w.r.t classes in
Yt
. We train CNNs by minimizing
the Cross Entropy (CE) loss using mini-batch SGD. At TP
t
, to construct a training batch
Bt
, we
randomly sample Kexamples, i.e., Bt={(x1, yt
1),· · · ,(xK, yt
K)}. The CE loss on Btis
L(Bt) = X
(xk,yt
k)∈Bt
H(yt
k, g(f(xk))) (1)
where H(p, q) = Pcp(c)log(q(c)) is the entropy between two probability vectors.
Similary, for semantic segmentation, we use the same CE loss 1 at pixel level, along with the recent
architecture HRNet with OCR module [72, 76, 81] (refer to appendix C for details).
Annotation Strategies
. Closely related to LECO is the problem of class-incremental learning
(CIL) or more generally, continual learning (CL) [
47
,
59
,
60
,
83
,
86
]. However, unlike typical CL
benchmarks, LECO does not artificially limit the buffer size for storing history samples. Indeed,
the expense of hard drive buffer is less of a problem compared to the high cost of data annotation.
Therefore, LECO embraces all the history labeled data, regardless of being annoated using old or
new ontologies. This leads to a fundamental question: whether to annotate new data or re-label the
old, given a fixed labeling budget Nat each TP (cf. Fig. 2)?
(
LabelNew
) Sample and annotate
N
new examples using new ontology, which trivially infers
the old-ontology labels. We further have the history data as labeled with old ontology.
(
RelabelOld
) Re-annotate the
N
history data using new ontology. However, these
N
examples
are all available data although they have both old- and new-ontology labels.
The answer is perhaps obvious –
LabelNew
– because it produces more labeled data. Furthermore,
LabelNew
allows one to exploit the old data to boost performance (Section 5). With proper techniques,
this significantly reduces the performance gap with a model trained over both old and new data
assumably annotated with new-ontology, termed as
AllFine
short for “supervised learning with all
fine labels”.
Training Strategies
. We consider CL baseline methods as LECO baselines. When new tasks
(i.e., classes of
Yt
) arrive in TP
t
, it is desirable to transfer model weights trained for
Yt1
in
TP
t1
[
47
,
60
]. To do so, we initialize weights
θft
of the feature extractor
ft
with
θft1
trained in
TPt1, i.e., θft:=θft1. Then we can
• (FreezePrev) Freeze ftto extract features, and learn a classifier gtover them [47].
5
摘要:

ContinualLearningwithEvolvingClassOntologiesZhiqiuLin1DeepakPathak1Yu-XiongWang2DevaRamanan1;3ShuKong41CMU2UIUC3ArgoAI4TexasA&MUniversityOpen-sourcecodeinwebpageAbstractLifelonglearnersmustrecognizeconceptvocabulariesthatevolveovertime.Acommonyetunderexploredscenarioislearningwithclasslabelsthatco...

展开>> 收起<<
Continual Learning with Evolving Class Ontologies Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan13Shu Kong4 1CMU2UIUC3Argo AI4Texas AM University.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:4.57MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注