Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overﬁtting Mitigation Yixiong Zou1 Shanghang Zhang2 Yuhua Li1and Ruixuan Li1

2025-05-02 1 0 2.62MB 19 页 10玖币

侵权投诉

Margin-Based Few-Shot Class-Incremental Learning

with Class-Level Overﬁtting Mitigation

Yixiong Zou1, Shanghang Zhang2, Yuhua Li1and Ruixuan Li1∗

1School of Computer Science and Technology, Huazhong University of Science and Technology

2School of Computer Science, Peking University

1{yixiongz, idcliyuhua, rxli}@hust.edu.cn, 2shanghang@pku.edu.cn

Abstract

Few-shot class-incremental learning (FSCIL) is designed to incrementally recog-

nize novel classes with only few training samples after the (pre-)training on base

classes with sufﬁcient samples, which focuses on both base-class performance and

novel-class generalization. A well known modiﬁcation to the base-class training

is to apply a margin to the base-class classiﬁcation. However, a dilemma exists

that we can hardly achieve both good base-class performance and novel-class gen-

eralization simultaneously by applying the margin during the base-class training,

which is still under explored. In this paper, we study the cause of such dilemma for

FSCIL. We ﬁrst interpret this dilemma as a class-level overﬁtting (CO) problem

from the aspect of pattern learning, and then ﬁnd its cause lies in the easily-satisﬁed

constraint of learning margin-based patterns. Based on the analysis, we propose

a novel margin-based FSCIL method to mitigate the CO problem by providing

the pattern learning process with extra constraint from the margin-based patterns

themselves. Extensive experiments on CIFAR100, Caltech-USCD Birds-200-2011

(CUB200), and miniImageNet demonstrate that the proposed method effectively

mitigates the CO problem and achieves state-of-the-art performance.

1 Introduction

With the development of deep learning, deep neural networks gradually demonstrate superior perfor-

mance on the recognition of pre-deﬁned classes with large amount of training data [

]. However,

the model’s generalization capability on the downstream novel classes is much less explored and still

needs to be improved [

]. To deal with this problem, the few-shot class-incremental learning

(FSCIL) task [

] comes into sight. FSCIL ﬁrst (pre-)trains a model on a set of

pre-deﬁned classes (base classes), and then generalizes the model to the incremental novel classes

with only few training samples, simulating human’s ability of continually learning novel concepts

with only few examples, and emphasizing both the performance on the pre-deﬁned base classes and

the generalization on the downstream novel classes.

However, a dilemma is recently revealed [

–

] that better loss functions, which lead to higher

performance on the pre-training data, could lead to worse generalization on the downstream tasks.

As introduced by [

] and depicted in Fig. 1, similar phenomenon also exists in the FSCIL task

that a positive classiﬁcation margin [

] applied to the classiﬁcation of the base-class

(pre-)training could lead to higher base-class performance but lower novel-class performance, while a

negative margin could result in lower base-class performance but increase the novel-class performance.

Although this dilemma widely exists in the tasks involving novel-class generalization such as few-shot

learning (FSL) and FSCIL, only few works [

] tried to explore its cause, and can hardly be used to

handle it. Due to space limitation, we will provide extended related works in the appendix.

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04524v1 [cs.CV] 10 Oct 2022

No Margin Positive Margin Negative Margin Ours

Base ClassNovel Class

Figure 1: A dilemma exists between base-class performance and downstream novel-class generaliza-

tion. By applying positive margins, base-class features are better separated which indicates better

base-class performance, but the novel-class features are confused which indicates lower novel-class

generalization. In contrast, by applying negative margins, base-class features are confused but the

novel-class features are better separated. In this paper, we study the cause of such dilemma for the

few-shot class-incremental learning task, and propose a method to mitigate such dilemma to better

separate both base and novel classes.

In this paper, we study the cause of the dilemma in the margin-based classiﬁcation for the FSCIL

problem from the the aspect of pattern learning. We ﬁnd this dilemma can be understood as a

class-level overﬁtting (CO) problem, which can be interpreted by the ﬁtness of the learned patterns

to each base class. The ﬁtness determines how much the learned patterns are speciﬁc to some base

classes or shared among classes, making the learned patterns either discriminative (tend to overﬁt

base classes) or transferable (tend to underﬁt base classes) and causes the dilemma. Based on the

interpretation, we discover the cause of the dilemma lies in the easily-satisﬁed constraint of learning

shared or class-speciﬁc patterns. Therefore, we further design a novel FSCIL method to mitigate the

dilemma of CO by providing the pattern learning process with extra constraint from margin-based

patterns themselves, improving performance on both base and novel classes as shown in Fig. 1, and

achieving state-of-the-art performance in terms of the all-class accuracy. Our contributions are:

•

We interpret the dilemma of the margin-based classiﬁcation as a class-level overﬁtting problem

from the aspect of pattern learning.

•

We ﬁnd the cause of the class-level overﬁtting problem lies in the easily-satisﬁed constraint of

learning shared or class-speciﬁc patterns.

•

We propose a novel FSCIL method to mitigate the class-level overﬁtting problem based on the

interpretation and analysis of the cause.

•

Extensive experiments on three public datasets verify the rationale of the model design, and show

that we can achieve state-of-the-art performance.

2 Interpreting the Dilemma of Few-Shot Class-Incremental Learning

In this section, we ﬁrst describe the Few-Shot Class-Incremental Learning (FSCIL) task and the

baseline model, and then conduct experiments to analyze the dilemma.

2.1 Task and Baseline Description

The FSCIL task aims to incrementally recognize novel classes with only few training samples.

Basically, the model is ﬁrst (pre-)trained on a set of base classes with sufﬁcient training samples (a.k.a.

base session), then confronted with novel classes with limited training samples (a.k.a. incremental

session), and ﬁnally required to recognize test samples from all encountered classes.

Speciﬁcally, given the base session dataset

D0={(xi, yi)}n0

i=1

with the label space

, the model is

trained to recognize all |Y0|classes from Y0by minimizing the loss

(xi,yi)∈D0

L(φ(xi), yi),(1)

where

L(·,·)

is typically a cross-entropy loss,

φ(·)

is the predictor which is composed of a backbone

network

f(·)

for feature extraction and a linear classiﬁer, represented as

φ(x) = W>f(x)

where

φ(x)∈RN0×1

W∈Rd×N0

and

f(x)∈Rd×1

. Typically,

f(x)

and the

are

normalized [

When the

th incremental session comes, the model needs to learn from its training data

Dk=

{(xi, yi)}nk

i=1

. The weight of the classiﬁer will be extended to represent the novel label space

imported by this session, represented as

W={w0

1, w0

2, ..., w0

|Y0|} ∪ ... ∪ {wk

1, ..., wk

|Yk|}

where

denotes the weight of the classiﬁer corresponding to the jth class of the kth session.

A strong baseline [

] is to freeze model’s parameters to avoid the catastrophic forgetting brought

by the ﬁnetuning on novel-classes. For the incremental sessions (i.e.,

> 0), the average of the

features extracted from the training data will be used as the classiﬁer’s weight [

] (a.k.a. prototype)

j=1

kPnj

i=1 f(xi)

, where

denotes the number of training samples in the class

for the

session

. As this baseline focuses on the base-class training, in this paper, the term training, if not

otherwise stated, refers to the base-class training. Finally, the performance of the

th session will be

obtained by classifying the test samples from all Pk

i=0 |Yi|encountered classes.

2.2 Margin-Based Classiﬁcation

A well known modiﬁcation to base-class training loss (Eq. 1) is to integrate a margin [

] as

L(xi, yi) = −log eτ(wyif(xi)−m)

eτ(wyif(xi)−m)+Pj6=yieτ wjf(xi),(2)

where wyirefers to the classiﬁer weight for class yi,τis typically set to 16.0 and mis the margin.

As analyzed in [

], empirically a dilemma exists that a positive margin could improve the base-

class performance but harm the novel-class generalization, and reversely, a negative margin could

contribute to the novel-class performance but decrease that of the base classes as shown in Fig. 1.

Similar phenomenon has been observed in other works such as [

] that a better loss function for the

pre-training task could harm the generalization on downstream tasks.

2.3 Interpretation of Class-Level Overﬁtting from Pattern Learning View

Experiments are conducted on the CIFAR100 [

] dataset and reported in Fig. 2. CIFAR100 contains

100 classes in all. As split by [

], 60 classes are chosen as base classes, and the remaining 40 classes

(with 5 training samples in each class) are chosen as novel classes

. Experiments are conducted on

the last incremental session, where all 100 classes are involved.

From Fig. 2 (left), we can see that as the margin increases, the base-class accuracy increases while

the novel-class accuracy decreases, which is consistent with [

] and validates the dilemma exists.

Compared with the well-known overﬁtting between the training and testing data, such dilemma,

although all validated on testing data, is more like the overﬁtting to base classes instead of samples.

Therefore, we term it as

class-level overﬁtting (CO)

. Additionally, the balance is reached when no

margin is added, i.e., FSCIL cannot be improved by simply applying the margin. For such dilemma,

[

] gave an explanation by the degraded mapping from novel to base classes. However, it could

hardly be used to develop methods for handling such dilemma. In this section, we go a step further to

explain this phenomenon from the aspect of pattern learning for developing methods to handle it.

A pattern denotes a part of information that the model extracts from the input, which is a ﬁner-grained

level of analyzing the model’s behavior. As studied in the interpretability of deep nets [

each channel in the feature extracted by deep networks could correspond to a certain pattern of the

input

, which can be viewed as to compose the base and novel classes [

]. Therefore, we conduct

experiments on feature channels to study the patterns learned by applying different margins.

2.3.1 Class-Level Overﬁtting Interpreted by Pattern Fitness to Base Classes

Pattern’s ﬁtness to each base class.

We ﬁrst evaluate the sparsity of the base-class patterns, which

is measured by the

norm of each feature vector. As the extracted features are

normalized, the

smaller the

norm is, for each feature, the sparser the patterns with high activation are. Results

are plotted in Fig. 2 (mid), where we can see a consistent decrease in

when the margin increases,

which means the model needs less activated patterns (channels) to represent each base class. As the

number of activated patterns decreases, the effectiveness of each activated pattern must increase to

account for the performance increase in the base-class pre-training in Fig. 2 (left). Therefore, we

hold that as the margin increases, the patterns learned by the model could ﬁt each base class better.

2Please refer to the appendix for details.

Figure 2:

Left

: Class-level overﬁtting exists between base and novel classes, and simply applying

margins to the training can not help the overall performance.

Mid

: Pattern ﬁts base classes more as

the margin increases, making it more discriminative but less transferable.

Right

: Transferability of

patterns decreases as the margin increases, pushing classes away from each other.

Pattern ﬁtness measured by the template-matching score.

To further verify the ﬁtness increase,

we view each pattern as a semantic template and measure its matching score to each base class. As

analyzed in [

] and [

], each pattern can be understood as a template [

] for the model to match

the input (so that each class would has its own set of templates for recognition), and the activation

can be viewed as the matching score. Therefore, we could know how much all patterns ﬁt (match)

each class by ﬁnding the most important patterns for each class and compare their activation. As

analyzed in [

] and [

], patterns (channels) with higher weights in the classiﬁcation layer are more

important, and the most important ones dominate the model decisions. Therefore, given an input,

we select its most important patterns by the top classiﬁcation weights of its ground-truth class, and

record the average activation on these patterns. The Mean value of such Top Activation across all

samples is denoted as MTA in Fig. 2 (mid). As can be seen, as the margin increases, MTA increases

consistently, which further veriﬁes patterns’ increase in ﬁtting each base class.

Better pattern ﬁtness, worse pattern transferability.

As each pattern could ﬁt a corresponding

base class better, its discriminability increases accordingly, but could it be transferred across classes?

To answer it, we test the transferability of patterns. Since classes are related (e.g., cat and tiger),

transferable patterns activated in one class could also be activated in other classes (e.g., felid patterns).

Therefore, we ﬁrst ﬁnd important patterns for each base class by the classiﬁcation weights, then

record activation of these patterns on

other

classes, and measure the transferability of patterns by the

mean value of such other-class-activation. The results are plotted in Fig. 2 (right). As can be seen,

the transferability consistently decreases when the margin increases. Combine this result with Fig. 2

(mid), we hold that patterns tend to be less transferable when they ﬁt each base class better.

Discussion.

The ﬁtness also reﬂects the how much the given pattern is speciﬁc to a base class.

Imagine the extreme situation where each base-class only needs one pattern for representation, the

ﬁtness would reach its upper bound to make such pattern thoroughly speciﬁc to the corresponding

class. Therefore, we interpret that the higher the margin is, the more speciﬁc (overﬁtting) the patterns

are to each base class, which makes patterns more discriminative but less transferable. Meanwhile,

the lower the margin is, the more the patterns could be shared between classes (underﬁtting), making

patterns more transferable but less discriminative. The CO dilemma lies in that patterns can hardly be

both class-speciﬁc and shared among classes by simply applying the classiﬁcation margin.

2.3.2 Inherent Class Relations Lead to the Change in Pattern’s Base-Class Fitness

Pattern’s ﬁtness negatively inﬂuences class relations.

In Fig. 2 (right), we also plot the class

relations w.r.t. the margins. The class relations are measured by the average of cosine similarities

between every two classes’ prototypes. As can be seen, the relation drops as the margin grows, in

consistent with the trend of the patterns’ transferability. This is rationale because the if two prototypes

share some patterns, the activation of the corresponding channels will be similar, making the cosine

similarity larger. As the transferability of patterns is negatively related to pattern’s base-class ﬁtness,

we hold that the class relations are also negatively related to the base-class ﬁtness.

Inherent class relations inﬂuence pattern’s ﬁtness.

The margin applied to the classiﬁcation

directly modiﬁes the decision boundary between every two classes, and the decision boundary is

related to the relationship between every two classes. Therefore, we study how the class relation

inﬂuences the pattern’s ﬁtness to base classes. Speciﬁcally, given 60 base classes, for the model

trained without margins, we ﬁrst calculate the cosine similarity between every two different classes,

which gives 60

(60 - 1) / 2 = 1,750 relations denoted as

, which represents the

inherent relations

between all classes. Similarly, we calculate 1,750 relations for the model trained with positive and

negative margins respectively, denoted as

Rpos

and

Rneg

. Then we calculate

Dpos =Rpos −R0

and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Margin-BasedFew-ShotClass-IncrementalLearningwithClass-LevelOverttingMitigationYixiongZou1,ShanghangZhang2,YuhuaLi1andRuixuanLi11SchoolofComputerScienceandTechnology,HuazhongUniversityofScienceandTechnology2SchoolofComputerScience,PekingUniversity1{yixiongz,idcliyuhua,rxli}@hust.edu.cn,2shanghang@...

展开>> 收起<<

Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overﬁtting Mitigation Yixiong Zou1 Shanghang Zhang2 Yuhua Li1and Ruixuan Li1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overﬁtting Mitigation Yixiong Zou1 Shanghang Zhang2 Yuhua Li1and Ruixuan Li1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: