Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overfitting Mitigation Yixiong Zou1 Shanghang Zhang2 Yuhua Li1and Ruixuan Li1

2025-05-02 0 0 2.62MB 19 页 10玖币
侵权投诉
Margin-Based Few-Shot Class-Incremental Learning
with Class-Level Overfitting Mitigation
Yixiong Zou1, Shanghang Zhang2, Yuhua Li1and Ruixuan Li1
1School of Computer Science and Technology, Huazhong University of Science and Technology
2School of Computer Science, Peking University
1{yixiongz, idcliyuhua, rxli}@hust.edu.cn, 2shanghang@pku.edu.cn
Abstract
Few-shot class-incremental learning (FSCIL) is designed to incrementally recog-
nize novel classes with only few training samples after the (pre-)training on base
classes with sufficient samples, which focuses on both base-class performance and
novel-class generalization. A well known modification to the base-class training
is to apply a margin to the base-class classification. However, a dilemma exists
that we can hardly achieve both good base-class performance and novel-class gen-
eralization simultaneously by applying the margin during the base-class training,
which is still under explored. In this paper, we study the cause of such dilemma for
FSCIL. We first interpret this dilemma as a class-level overfitting (CO) problem
from the aspect of pattern learning, and then find its cause lies in the easily-satisfied
constraint of learning margin-based patterns. Based on the analysis, we propose
a novel margin-based FSCIL method to mitigate the CO problem by providing
the pattern learning process with extra constraint from the margin-based patterns
themselves. Extensive experiments on CIFAR100, Caltech-USCD Birds-200-2011
(CUB200), and miniImageNet demonstrate that the proposed method effectively
mitigates the CO problem and achieves state-of-the-art performance.
1 Introduction
With the development of deep learning, deep neural networks gradually demonstrate superior perfor-
mance on the recognition of pre-defined classes with large amount of training data [
31
,
16
]. However,
the model’s generalization capability on the downstream novel classes is much less explored and still
needs to be improved [
20
,
14
]. To deal with this problem, the few-shot class-incremental learning
(FSCIL) task [
17
,
29
,
4
,
35
,
44
,
48
] comes into sight. FSCIL first (pre-)trains a model on a set of
pre-defined classes (base classes), and then generalizes the model to the incremental novel classes
with only few training samples, simulating human’s ability of continually learning novel concepts
with only few examples, and emphasizing both the performance on the pre-defined base classes and
the generalization on the downstream novel classes.
However, a dilemma is recently revealed [
20
,
7
9
] that better loss functions, which lead to higher
performance on the pre-training data, could lead to worse generalization on the downstream tasks.
As introduced by [
24
] and depicted in Fig. 1, similar phenomenon also exists in the FSCIL task
that a positive classification margin [
24
,
38
,
11
,
33
] applied to the classification of the base-class
(pre-)training could lead to higher base-class performance but lower novel-class performance, while a
negative margin could result in lower base-class performance but increase the novel-class performance.
Although this dilemma widely exists in the tasks involving novel-class generalization such as few-shot
learning (FSL) and FSCIL, only few works [
24
] tried to explore its cause, and can hardly be used to
handle it. Due to space limitation, we will provide extended related works in the appendix.
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04524v1 [cs.CV] 10 Oct 2022
No Margin Positive Margin Negative Margin Ours
Base ClassNovel Class
Figure 1: A dilemma exists between base-class performance and downstream novel-class generaliza-
tion. By applying positive margins, base-class features are better separated which indicates better
base-class performance, but the novel-class features are confused which indicates lower novel-class
generalization. In contrast, by applying negative margins, base-class features are confused but the
novel-class features are better separated. In this paper, we study the cause of such dilemma for the
few-shot class-incremental learning task, and propose a method to mitigate such dilemma to better
separate both base and novel classes.
In this paper, we study the cause of the dilemma in the margin-based classification for the FSCIL
problem from the the aspect of pattern learning. We find this dilemma can be understood as a
class-level overfitting (CO) problem, which can be interpreted by the fitness of the learned patterns
to each base class. The fitness determines how much the learned patterns are specific to some base
classes or shared among classes, making the learned patterns either discriminative (tend to overfit
base classes) or transferable (tend to underfit base classes) and causes the dilemma. Based on the
interpretation, we discover the cause of the dilemma lies in the easily-satisfied constraint of learning
shared or class-specific patterns. Therefore, we further design a novel FSCIL method to mitigate the
dilemma of CO by providing the pattern learning process with extra constraint from margin-based
patterns themselves, improving performance on both base and novel classes as shown in Fig. 1, and
achieving state-of-the-art performance in terms of the all-class accuracy. Our contributions are:
We interpret the dilemma of the margin-based classification as a class-level overfitting problem
from the aspect of pattern learning.
We find the cause of the class-level overfitting problem lies in the easily-satisfied constraint of
learning shared or class-specific patterns.
We propose a novel FSCIL method to mitigate the class-level overfitting problem based on the
interpretation and analysis of the cause.
Extensive experiments on three public datasets verify the rationale of the model design, and show
that we can achieve state-of-the-art performance.
2 Interpreting the Dilemma of Few-Shot Class-Incremental Learning
In this section, we first describe the Few-Shot Class-Incremental Learning (FSCIL) task and the
baseline model, and then conduct experiments to analyze the dilemma.
2.1 Task and Baseline Description
The FSCIL task aims to incrementally recognize novel classes with only few training samples.
Basically, the model is first (pre-)trained on a set of base classes with sufficient training samples (a.k.a.
base session), then confronted with novel classes with limited training samples (a.k.a. incremental
session), and finally required to recognize test samples from all encountered classes.
Specifically, given the base session dataset
D0={(xi, yi)}n0
i=1
with the label space
Y0
, the model is
trained to recognize all |Y0|classes from Y0by minimizing the loss
X
(xi,yi)D0
L(φ(xi), yi),(1)
where
L(·,·)
is typically a cross-entropy loss,
φ(·)
is the predictor which is composed of a backbone
network
f(·)
for feature extraction and a linear classifier, represented as
φ(x) = W>f(x)
where
φ(x)RN0×1
,
WRd×N0
and
f(x)Rd×1
. Typically,
f(x)
and the
W
are
L2
normalized [
44
].
2
When the
k
th incremental session comes, the model needs to learn from its training data
Dk=
{(xi, yi)}nk
i=1
. The weight of the classifier will be extended to represent the novel label space
Yk
imported by this session, represented as
W={w0
1, w0
2, ..., w0
|Y0|} ∪ ... ∪ {wk
1, ..., wk
|Yk|}
where
wk
j
denotes the weight of the classifier corresponding to the jth class of the kth session.
A strong baseline [
44
] is to freeze model’s parameters to avoid the catastrophic forgetting brought
by the finetuning on novel-classes. For the incremental sessions (i.e.,
k
> 0), the average of the
features extracted from the training data will be used as the classifier’s weight [
44
] (a.k.a. prototype)
as
wk
j=1
nj
kPnj
k
i=1 f(xi)
, where
nj
k
denotes the number of training samples in the class
j
for the
session
k
. As this baseline focuses on the base-class training, in this paper, the term training, if not
otherwise stated, refers to the base-class training. Finally, the performance of the
k
th session will be
obtained by classifying the test samples from all Pk
i=0 |Yi|encountered classes.
2.2 Margin-Based Classification
A well known modification to base-class training loss (Eq. 1) is to integrate a margin [
24
,
11
,
38
] as
L(xi, yi) = log eτ(wyif(xi)m)
eτ(wyif(xi)m)+Pj6=yieτ wjf(xi),(2)
where wyirefers to the classifier weight for class yi,τis typically set to 16.0 and mis the margin.
As analyzed in [
24
], empirically a dilemma exists that a positive margin could improve the base-
class performance but harm the novel-class generalization, and reversely, a negative margin could
contribute to the novel-class performance but decrease that of the base classes as shown in Fig. 1.
Similar phenomenon has been observed in other works such as [
20
] that a better loss function for the
pre-training task could harm the generalization on downstream tasks.
2.3 Interpretation of Class-Level Overfitting from Pattern Learning View
Experiments are conducted on the CIFAR100 [
22
] dataset and reported in Fig. 2. CIFAR100 contains
100 classes in all. As split by [
35
], 60 classes are chosen as base classes, and the remaining 40 classes
(with 5 training samples in each class) are chosen as novel classes
2
. Experiments are conducted on
the last incremental session, where all 100 classes are involved.
From Fig. 2 (left), we can see that as the margin increases, the base-class accuracy increases while
the novel-class accuracy decreases, which is consistent with [
24
] and validates the dilemma exists.
Compared with the well-known overfitting between the training and testing data, such dilemma,
although all validated on testing data, is more like the overfitting to base classes instead of samples.
Therefore, we term it as
class-level overfitting (CO)
. Additionally, the balance is reached when no
margin is added, i.e., FSCIL cannot be improved by simply applying the margin. For such dilemma,
[
24
] gave an explanation by the degraded mapping from novel to base classes. However, it could
hardly be used to develop methods for handling such dilemma. In this section, we go a step further to
explain this phenomenon from the aspect of pattern learning for developing methods to handle it.
A pattern denotes a part of information that the model extracts from the input, which is a finer-grained
level of analyzing the model’s behavior. As studied in the interpretability of deep nets [
47
,
2
],
each channel in the feature extracted by deep networks could correspond to a certain pattern of the
input
2
, which can be viewed as to compose the base and novel classes [
49
]. Therefore, we conduct
experiments on feature channels to study the patterns learned by applying different margins.
2.3.1 Class-Level Overfitting Interpreted by Pattern Fitness to Base Classes
Pattern’s fitness to each base class.
We first evaluate the sparsity of the base-class patterns, which
is measured by the
L1
norm of each feature vector. As the extracted features are
L2
normalized, the
smaller the
L1
norm is, for each feature, the sparser the patterns with high activation are. Results
are plotted in Fig. 2 (mid), where we can see a consistent decrease in
L1
when the margin increases,
which means the model needs less activated patterns (channels) to represent each base class. As the
number of activated patterns decreases, the effectiveness of each activated pattern must increase to
account for the performance increase in the base-class pre-training in Fig. 2 (left). Therefore, we
hold that as the margin increases, the patterns learned by the model could fit each base class better.
2Please refer to the appendix for details.
3
Figure 2:
Left
: Class-level overfitting exists between base and novel classes, and simply applying
margins to the training can not help the overall performance.
Mid
: Pattern fits base classes more as
the margin increases, making it more discriminative but less transferable.
Right
: Transferability of
patterns decreases as the margin increases, pushing classes away from each other.
Pattern fitness measured by the template-matching score.
To further verify the fitness increase,
we view each pattern as a semantic template and measure its matching score to each base class. As
analyzed in [
47
] and [
2
], each pattern can be understood as a template [
5
] for the model to match
the input (so that each class would has its own set of templates for recognition), and the activation
can be viewed as the matching score. Therefore, we could know how much all patterns fit (match)
each class by finding the most important patterns for each class and compare their activation. As
analyzed in [
47
] and [
49
], patterns (channels) with higher weights in the classification layer are more
important, and the most important ones dominate the model decisions. Therefore, given an input,
we select its most important patterns by the top classification weights of its ground-truth class, and
record the average activation on these patterns. The Mean value of such Top Activation across all
samples is denoted as MTA in Fig. 2 (mid). As can be seen, as the margin increases, MTA increases
consistently, which further verifies patterns’ increase in fitting each base class.
Better pattern fitness, worse pattern transferability.
As each pattern could fit a corresponding
base class better, its discriminability increases accordingly, but could it be transferred across classes?
To answer it, we test the transferability of patterns. Since classes are related (e.g., cat and tiger),
transferable patterns activated in one class could also be activated in other classes (e.g., felid patterns).
Therefore, we first find important patterns for each base class by the classification weights, then
record activation of these patterns on
other
classes, and measure the transferability of patterns by the
mean value of such other-class-activation. The results are plotted in Fig. 2 (right). As can be seen,
the transferability consistently decreases when the margin increases. Combine this result with Fig. 2
(mid), we hold that patterns tend to be less transferable when they fit each base class better.
Discussion.
The fitness also reflects the how much the given pattern is specific to a base class.
Imagine the extreme situation where each base-class only needs one pattern for representation, the
fitness would reach its upper bound to make such pattern thoroughly specific to the corresponding
class. Therefore, we interpret that the higher the margin is, the more specific (overfitting) the patterns
are to each base class, which makes patterns more discriminative but less transferable. Meanwhile,
the lower the margin is, the more the patterns could be shared between classes (underfitting), making
patterns more transferable but less discriminative. The CO dilemma lies in that patterns can hardly be
both class-specific and shared among classes by simply applying the classification margin.
2.3.2 Inherent Class Relations Lead to the Change in Pattern’s Base-Class Fitness
Pattern’s fitness negatively influences class relations.
In Fig. 2 (right), we also plot the class
relations w.r.t. the margins. The class relations are measured by the average of cosine similarities
between every two classes’ prototypes. As can be seen, the relation drops as the margin grows, in
consistent with the trend of the patterns’ transferability. This is rationale because the if two prototypes
share some patterns, the activation of the corresponding channels will be similar, making the cosine
similarity larger. As the transferability of patterns is negatively related to pattern’s base-class fitness,
we hold that the class relations are also negatively related to the base-class fitness.
Inherent class relations influence pattern’s fitness.
The margin applied to the classification
directly modifies the decision boundary between every two classes, and the decision boundary is
related to the relationship between every two classes. Therefore, we study how the class relation
influences the pattern’s fitness to base classes. Specifically, given 60 base classes, for the model
trained without margins, we first calculate the cosine similarity between every two different classes,
which gives 60
×
(60 - 1) / 2 = 1,750 relations denoted as
R0
, which represents the
inherent relations
between all classes. Similarly, we calculate 1,750 relations for the model trained with positive and
negative margins respectively, denoted as
Rpos
and
Rneg
. Then we calculate
Dpos =Rpos R0
and
4
摘要:

Margin-BasedFew-ShotClass-IncrementalLearningwithClass-LevelOverttingMitigationYixiongZou1,ShanghangZhang2,YuhuaLi1andRuixuanLi11SchoolofComputerScienceandTechnology,HuazhongUniversityofScienceandTechnology2SchoolofComputerScience,PekingUniversity1{yixiongz,idcliyuhua,rxli}@hust.edu.cn,2shanghang@...

展开>> 收起<<
Margin-Based Few-Shot Class-Incremental Learning with Class-Level Overfitting Mitigation Yixiong Zou1 Shanghang Zhang2 Yuhua Li1and Ruixuan Li1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.62MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注