Continual Learning with Evolving Class Ontologies Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan13Shu Kong4 1CMU2UIUC3Argo AI4Texas AM University

2025-04-27 0 0 4.57MB 21 页 10玖币

侵权投诉

Continual Learning with Evolving Class Ontologies

Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan1,3∗Shu Kong4∗

1CMU 2UIUC 3Argo AI 4Texas A&M University

Open-source code in webpage

Abstract

Lifelong learners must recognize concept vocabularies that evolve over time. A

common yet underexplored scenario is learning with class labels that continually

reﬁne/expand old classes. For example, humans learn to recognize

dog

before

dog breeds. In practical settings, dataset versioning often introduces reﬁnement

to ontologies, such as autonomous vehicle benchmarks that reﬁne a previous

vehicle

class into

school-bus

as autonomous operations expand to new

cities. This paper formalizes a protocol for studying the problem of Learning

with Evolving Class Ontology (LECO). LECO requires learning classiﬁers in

distinct time periods (TPs); each TP introduces a new ontology of “ﬁne” labels that

reﬁnes old ontologies of “coarse” labels (e.g., dog breeds that reﬁne the previous

dog

). LECO explores such questions as whether to annotate new data or relabel

the old, how to exploit coarse labels, and whether to ﬁnetune the previous TP’s

model or train from scratch. To answer these questions, we leverage insights from

related problems such as class-incremental learning. We validate them under the

LECO protocol through the lens of image classiﬁcation (on CIFAR and iNaturalist)

and semantic segmentation (on Mapillary). Extensive experiments lead to some

surprising conclusions; while the current status quo in the ﬁeld is to relabel existing

datasets with new class ontologies (such as COCO-to-LVIS or Mapillary1.2-to-2.0),

LECO demonstrates that a far better strategy is to annotate new data with the new

ontology. However, this produces an aggregate dataset with inconsistent old-vs-new

labels, complicating learning. To address this challenge, we adopt methods from

semi-supervised and partial-label learning. We demonstrate that such strategies can

surprisingly be made near-optimal, in the sense of approaching an “oracle” that

learns on the aggregate dataset exhaustively labeled with the newest ontology.

1 Introduction

Humans, as lifelong learners, learn to recognize the ontology of concepts which is being reﬁned and

expanded over time. For example, we learn to recognize

dog

and then dog breeds (reﬁning the class

dog

). The class ontology often evolves from coarse to ﬁne concepts in the real open world and can

sometimes introduce new classes. We call this Learning with Evolving Class Ontology (LECO).

Motivation

. LECO is common when building machine-learning systems in practice. One often trains

and maintains machine-learned models with class labels (or a class ontology) that are reﬁned over time

periods (TPs) (Fig. 2). Such ontology evolution can be caused by new requirements in applications,

such as autonomous vehicles that expand operations to new cities. This is well demonstrated in many

contemporary large-scale datasets that release updated versions by reﬁning and expanding classes,

such as Mapillary [

], Argoverse [

], KITTI [

] and iNaturalist [

–

]. For

example, Mapillary V1.2 [

] deﬁned the

road

class, and Mapillary V2.0 [

] reﬁned it with ﬁne

∗Authors share senior authorship.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04993v4 [cs.CV] 15 Dec 2022

Figure 1: Learning with Evolving Class Ontology (LECO) requires training models in time periods (TPs)

with new ontologies that reﬁne/expand the previous ones.

Left:

The Mapillary dataset, which was constructed

to study semantic segmentation for autonomous vehicles, has updated versions from V1.2 [

] to V2.0 [

which reﬁned previous labels with ﬁne-grained ones on the same data, e.g., V1.2’s

sidewalk

is split to V2.0’s

sidewalk

and

driveway

. The visual demonstration with semantic segmentation ground-truth blacks out

unrelated classes for clarity. We explore LECO using this dataset.

Right:

To study LECO, we also repurpose

the large-scale iNaturalist dataset [

–

] to simulate the coarse-to-ﬁne ontology evolution through the

lens of image classiﬁcation. LECO asks the ﬁrst basic question: for the new TP that deﬁnes new ontology of

classes, should one relabel the old data or label new data? Interestingly, both labeling protocols have been used

in the community for large-scale datasets. Our study provides the (perhaps obvious) answer – one should always

annotate new data with the new ontology rather than re-annotating the old. One reason is that the former produces

more labeled data. But this produces an aggregate dataset with inconsistent old-vs-new labels. We make use

of insights from semi-supervised learning and learning-with-partial-labels to learn from such heterogenous

annotations, approaching the upper bound of learning from an oracle aggregate dataset with all new labels.

labels including

parking-aisle

and

road-shoulder

; Argoverse V1.0 [

] deﬁned the

bus

class, and Argoverse V2.0 [78] reﬁned it with ﬁne classes including school-bus.

Prior art

. LECO requires learning new classes continually in TPs, similar to class-incremental

learning (CIL) [

]. They differ in important ways. First, CIL assumes brand-new labels

that have no relations with the old [

], while LECO’s new labels are ﬁne-grained ones that

expand the old / “coarse” labels (cf. Fig. 2). Interestingly, CIL can be seen as a special case of LECO

by making use of a “catch-all”

background

class [

] (often denoted as

void

unlabeled

existing datasets [

]), which can be reﬁned over time to include speciﬁc classes that

were previously ignored. Second, CIL restricts itself to a small memory buffer for storing historical

data to focus on information retention and catastrophic forgetting [

]. However, LECO stores all

historical data (since large-scale human annotations typically come at a much higher cost than that

of the additional storage) and is concerned with the most-recent classes of interest. Most related to

LECO are approaches that explore the relationship of new classes to old ones; Sariyildiz et al.

[64]

study concept similarity but remove common superclasses in their study, while Abdelsalam et al.

[1]

explore CIL with a goal to discover inter-class relations on data labeled (as opposed to assuming

such relationships can be derived from a given taxonomic hierarchy). To explore LECO, we set up

its benchmarking protocol (Section 3) and study extensive methods modiﬁed from those of related

problems (Sections 4, 5, and 6).

Technical insights

. LECO studies learning strategies and answers questions as basic as whether

one should label new data or relabel the old data

. Interestingly, both protocols have been used in

the community for large-scale dataset versioning: Argoverse [

] annotates new data, while

Mapillary [

] relabels its old data. Our experiments provide a deﬁnitive answer – one should

always annotate new data with the new ontology rather than reannotating the old data. The simple

reason is that the former produces more labeled data. However, this comes at a cost: the aggregate

We assume the labeling cost is same for either case. One may think re-annotating old data is cheaper,

because one does not need to curate new data and may be able to redesign an annotation interface that exploits

the old annotations (e.g., show old

vehicle

labels and just ask for ﬁne-grained make-and-model label). But

in practice, in order to prevent carry-over of annotation errors and to simplify the annotation interface, most

benchmark curators tend to relabel from scratch, regardless of on old or new data. Examples of relabeling

old-data from scratch include Mapillary V1.2 [55] →V2.0 [52], and COCO [48] →LVIS [27].

Benchmark #TPs #classes/TP #train/TP #test/TP

CIFAR-LECO 2 20/100 10k 10k

iNat-LECO 2 123/810 50k 4k

Mapillary-LECO 2 66/116 2.5k 2k

iNat-4TP-LECO 4 123/339/729/810 25k 4k

Table 1: LECO benchmarks use CIFAR100 [

iNaturalist [

], and Mapillary [

]. We

also vary the number of training data in exper-

iments; as conclusions hold across settings, we

report these results in Table 9, 10, 11.

dataset is now labeled with inconsistent old-vs-new annotations, complicating learning. We show that

joint-training [

] on both new and old ontologies, when combined with semi-supervised learning

(SSL), can be remarkably effective. Concretely, we generate pseudo labels for the new ontology

on the old data. But because pseudo labels are noisy [

] and potentially biased [

we make use of the coarse labels from the old-ontology as “coarse supervision”, or the coarse-ﬁne

label relationships [

], to reconcile conﬂicts between the pseudo ﬁne labels and the true

coarse labels (Section 6). There is another natural question: should one ﬁnetune the previous TP’s

model or train from scratch? One may think the former has no beneﬁt (because it accesses the same

training data) and might suffer from local minima. Suprisingly, we ﬁnd ﬁnetuning actually works

much better, echoing curriculum learning [

]. Perhaps most suprisingly, we demonstrate

that such strategies are near optimal, approaching an “upperbound” that trains on all available data

re-annotated with the new ontology.

Salient results. To study LECO, we repurpose three datasets: CIFAR100, iNaturlist, and Mapillary

(Table 1). The latter two large-scale datasets have long-tailed distribution of class labels; particularly,

Mapillary’s new ontology does not have strictly surjective mapping to the old, because it relabeled

the same data from scratch with new ontology. Our results on these datasets lead to consistent

technical insights summarized above. We preview the results on the iNaturalist-LECO setup (which

simulates the large-scale long-tailed scenario): (1) ﬁnetuning the previous TP’s model outperforms

training from scratch on data labeled with new ontology: 73.64% vs. 65.40% in accuracy; (2) jointly

training on both old and new data signiﬁcantly boosts accuracy to 82.98%; (3) taking advantage of the

relationship between old and new labels (via Learning-with-Partial-Labels and SSL with pseudo-label

reﬁnement) further improves to 84.34%, effectively reaching an “upperbound” 84.51%!

Contributions

. We make three major contributions. First, we motivate the problem LECO and deﬁne

its benchmarking protocol. Second, we extensively study approaches to LECO by modifying methods

from related problems, including semi-supervised learning, class-incremental learning, and learning

with partial labels. Third, we draw consistent conclusions and technical insights, as described above.

2 Related Work

Class-incremental learning

(CIL) [

] and LECO both require learning classiﬁers for

new classes over distinct time periods (TPs), but they have important differences. First, the typical CIL

setup assumes new labels have no relations with the old [

], while LECO’s new labels reﬁne

or expand the old / “coarse” labels (cf. Fig. 2). Second, CIL purposely sets a small memory buffer

(for storing labeled data) and evaluates accuracy over both old and new classes with the emphasis on

information retention or catastrophic forgetting [

]. However, LECO allows all (history)

labeled data to be stored and highlight the difﬁculty of learning new classes which are ﬁne/subclasses

of the old ones. Further, we note that many real-world applications store history data (and should not

limit a buffer size to save data) for privacy-related consideration (e.g., medical data records) and as

forensic evidence (e.g., videos from surveillance cameras). Therefore, to approach LECO, we apply

CIL methods that do not restrict buffer size, which usually serve as upper bounds in CIL [

]. In

this work, we repurpose “upper bound” CIL methods for LECO including ﬁnetuning [

], joint

training [

], as well as the “lower bound” methods by freezing the backbone [

Semi-supervised learning

(SSL) learns over both labeled and unlabeled data. State-of-the-art SSL

methods follow a self-training paradigm [

] that uses a model trained over labeled data to

pseudo-label the unlabeled samples, which are then used together with labeled data for training. For

example, Lee et al.

[44]

use conﬁdent predictions as target labels for unlabeled data, and Rizve et al.

[62]

further incorporate low-probability predictions as negative pseudo labels. Some others improve

SSL using self-supervised learning techniques [

], which force predictions to be similar for

different augmentations of the same data [

]. We leverage insights of SSL to approach

LECO, e.g., pseudo labeling the old data. As pseudo labels are often biased [77], we further exploit

the old-ontology to reject or improve the inconsistent pseudo labels, yielding better performance.

Learning with partial labels

(LPL) tackles the case when some examples are fully labeled while

others are only partially labeled [

]. In the area of ﬁne-grained recognition, partial

labels can be coarse superclasses which annotate some training data [

]. In a TP of

LECO, the old data from previous TPs can be used as partially-labeled examples as they contain only

coarse labels. Therefore, to approach LECO, we explore state-of-the-art methods [

] in this line of

work. However, it is important to note that ontology evolution in LECO can be more than splitting

old classes, e.g., the new class

special-subject

can be a result of merging multiple classes

child

and

police-officer

. Indeed, we ﬁnd it happens quite often in the ontology evolution

from the Mapillary’s versioning from V1.2 [

] to V2.0 [

]. In this case, the LPL method does not

show better results than the simple joint training approach (31.04 vs. 31.05 on Mapillary in Table 4).

3 Problem Setup of Learning with Evolving Class Ontology (LECO)

Notations

. Among

time periods (TPs), TP

has data distributions

3Dt=X×Yt

, where

is the

input and

is the label set. The evolution of class ontology is ideally modeled as a tree structure:

the

ith

class

i∈Yt

is a node that has a unique parent

yt−1

j∈Yt−1

, which can be split into

multiple ﬁne classes in TP

. Therefore,

|Yt|>|Yt−1|

. TP

concerns classiﬁcation w.r.t label set

Benchmarks

. We deﬁne LECO benchmarks using three datasets: CIFAR100 [

], iNaturalist [

], and Mapillary [

] (Table 1). CIFAR100 is released under the MIT license, and iNaturalist and

Mapillary are publicly available for non-commercial research and educational purposes. CIFAR100

and Mapillary contain classes related to person and potentially have fairness and privacy concerns,

hence we cautiously proceed our research and release our code under the MIT License without

re-distributing the data. The iNaturalist and Mapillary are large-scale and have long-tailed class

distributions (refer to the Table 4 and 5 for class distributions). For each benchmark, we sample

data from the corresponding dataset to construct time periods (TPs), e.g., TP

and TP

deﬁne their

own ontologies of class labels

and

, respectively. For CIFAR-LECO, we use the two-level

hierarchy offered by the original CIFAR100 dataset [

]: using the 20 superclasses as the ontology in

, each of them is split into 5 subclasses as new ontology in TP

. For iNat-LECO, we used the

most recent version of Semi-iNat-2021 [

] that comes with rich taxonomic labels at seven-level

hierarchy. To construct two TPs, we choose the third level (i.e., “order”) with 123 categories for

, and the seventh (“species”) with 810 categories for TP

. We further construct four TPs with

each having an ontology out of four at distinct taxa levels [“order”, “family”, “genus”, “species”].

For Mapillary-LECO, we use its ontology of V1.2 [

] (2017) in TP

, and that of V2.0 [

] (2021)

in TP

. It is worth noting that, as a real-world LECO example, Mapillary includes a catch-all

background

class (aka

void

) in V1.2, which was split into meaningful ﬁne classes in V2.0, such

as temporary-barrier,traffic-island and void-ground.

Moreover, in each TP of benchmark CIFAR-LECO and iNat-LECO, we randomly sample 20% data

as validation set for hyperparameter tuning and model select. We use their ofﬁcial valsets as our

test-sets for benchmarking. In Mapillary-LECO, we do not use a valset but instead use the default

hyperparameters reported in [

], tuned to optimize another related dataset Cityscapes [

]. We ﬁnd

it unreasonably computational demanding to tune on large-scale semantic segmentation datasets.

Table 1 summarizes our benchmarks’ statistics.

Remark: beyond class reﬁning in ontology evolution

. In practice, ontology evolution also contains

the case of class merging, as well as class renaming (which is trivial to address). The Mapillary’s

versioning clearly demonstrates this: 10 of 66 classes of its V1.2 were reﬁned into new classes,

resulting into 116 classes in total of its V2.0; objects of the same old class were either merged

into a new class or assigned with new ﬁne classes. Because of these non-trivial cases under the

Mapillary-LECO benchmark, state-of-the-art methods of learning with partial labels are less effective

but our other solutions work well.

Remark: number of TPs and buffer size

. For most experiments, we set two TPs. One may think

more TPs are required for experiments because CIL literature does so [

]. Recall that

CIL emphasizes catastrophic forgetting issue caused by using limited buffer to save history data, so

using more TPs for more classes and limited buffer helps exacerbate this issue. Differently, LECO

emphasizes the difﬁculty of learning new classes that are subclasses of the old ones, and does not

The data distribution may not be necessarily static where LECO can still happen, as discussed in Section 1.

Our work sets a static data distribution to simplify the study of LECO.

limit a buffer to store history data. Therefore, using two TPs are sufﬁcient to show the difﬁculty.

Even so, we have experiments that set four TPs, which demosntrate consistent conclusions drawn

from those using two TPs. Moreover, even with unlimited buffer, CIL methods still work poorly on

LECO benchmarks (cf. “TrainScratch” [59] and “FreezePrev” [47] in Table 2).

Metric

. Our benchmarks study LECO through the lens of image classiﬁcation (on CIFAR-LECO

and iNat-LECO) and semantic segmentation (on Mapillary-LECO). Therefore, we evaluate methods

using mean accuracy (mAcc) averaged over per-class accuracies, and mean intersection-over-union

(mIoU) averaged over classes, respectively. The metrics treat all classes equally, avoiding the bias

towards common classes due to the natural long-tailed class distribution (cf. iNaturalist [

]

and Mapillary [

]) (cf. class distributions in the appendix). We do not use speciﬁc algorithms

to address the long-tailed recognition problem but instead adopt the recent simple technique to

improve performance by tuning weight decay [

]. We run each method ﬁve times using random seeds

on CIFAR-LECO and iNat-LECO, and report their averaged mAcc with standard deviations. For

Mapillary, due to exoribitant training cost, we perform one run to benchmark methods.

4 Baseline Approaches to LECO

We ﬁrst repurpose CIL and continual learning methods as baselines to approach LECO, as both

require learning classiﬁers for new classes in TPs. We start with preliminary backgrounds.

Preliminaries

. Following prior art [

], we train convolutional neural networks (CNNs) and

treat a network as a feature extractor

(parametrized by

θf

) plus a classiﬁer

(parameterized by

θg

). The feature extractor

consists of all the layers below the penultimate layer of ResNet [

], and

the classiﬁer

is the linear classiﬁer followed by softmax normalization. Speciﬁcally, in TP

, an

input image

is fed into

to obtain its feature representation

z=f(x;θf)

. The feature

is then fed

into

for the softmax probabilities,

q=g(z;θg)

, w.r.t classes in

. We train CNNs by minimizing

the Cross Entropy (CE) loss using mini-batch SGD. At TP

, to construct a training batch

, we

randomly sample Kexamples, i.e., Bt={(x1, yt

1),· · · ,(xK, yt

K)}. The CE loss on Btis

L(Bt) = −X

(xk,yt

k)∈Bt

H(yt

k, g(f(xk))) (1)

where H(p, q) = −Pcp(c)log(q(c)) is the entropy between two probability vectors.

Similary, for semantic segmentation, we use the same CE loss 1 at pixel level, along with the recent

architecture HRNet with OCR module [72, 76, 81] (refer to appendix C for details).

Annotation Strategies

. Closely related to LECO is the problem of class-incremental learning

(CIL) or more generally, continual learning (CL) [

]. However, unlike typical CL

benchmarks, LECO does not artiﬁcially limit the buffer size for storing history samples. Indeed,

the expense of hard drive buffer is less of a problem compared to the high cost of data annotation.

Therefore, LECO embraces all the history labeled data, regardless of being annoated using old or

new ontologies. This leads to a fundamental question: whether to annotate new data or re-label the

old, given a ﬁxed labeling budget Nat each TP (cf. Fig. 2)?

•

(

LabelNew

) Sample and annotate

new examples using new ontology, which trivially infers

the old-ontology labels. We further have the history data as labeled with old ontology.

•

(

RelabelOld

) Re-annotate the

history data using new ontology. However, these

examples

are all available data although they have both old- and new-ontology labels.

The answer is perhaps obvious –

LabelNew

– because it produces more labeled data. Furthermore,

LabelNew

allows one to exploit the old data to boost performance (Section 5). With proper techniques,

this signiﬁcantly reduces the performance gap with a model trained over both old and new data

assumably annotated with new-ontology, termed as

AllFine

short for “supervised learning with all

ﬁne labels”.

Training Strategies

. We consider CL baseline methods as LECO baselines. When new tasks

(i.e., classes of

) arrive in TP

, it is desirable to transfer model weights trained for

Yt−1

t−1

[

]. To do so, we initialize weights

θft

of the feature extractor

with

θft−1

trained in

TPt−1, i.e., θft:=θft−1. Then we can

• (FreezePrev) Freeze ftto extract features, and learn a classiﬁer gtover them [47].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContinualLearningwithEvolvingClassOntologiesZhiqiuLin1DeepakPathak1Yu-XiongWang2DevaRamanan1;3ShuKong41CMU2UIUC3ArgoAI4TexasA&MUniversityOpen-sourcecodeinwebpageAbstractLifelonglearnersmustrecognizeconceptvocabulariesthatevolveovertime.Acommonyetunderexploredscenarioislearningwithclasslabelsthatco...

展开>> 收起<<

Continual Learning with Evolving Class Ontologies Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan13Shu Kong4 1CMU2UIUC3Argo AI4Texas AM University.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Continual Learning with Evolving Class Ontologies Zhiqiu Lin1Deepak Pathak1Yu-Xiong Wang2Deva Ramanan13Shu Kong4 1CMU2UIUC3Argo AI4Texas AM University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: