1 The Equalization Losses Gradient-Driven Training for Long-tailed Object Recognition

2025-04-24 0 0 3.97MB 16 页 10玖币
侵权投诉
1
The Equalization Losses: Gradient-Driven
Training for Long-tailed Object Recognition
Jingru Tan, Bo Li, Xin Lu, Yongqiang Yao, Fengwei Yu, Tong He,
and Wanli Ouyang Senior Member, IEEE,
Abstract—Long-tail distribution is widely spread in real-world applications. Due to the extremely small ratio of instances, tail categories
often show inferior accuracy. In this paper, we find such performance bottleneck is mainly caused by the imbalanced gradients, which
can be categorized into two parts: (1) positive part, deriving from the samples of the same category, and (2) negative part, contributed
by other categories. Based on comprehensive experiments, it is also observed that the gradient ratio of accumulated positives to
negatives is a good indicator to measure how balanced a category is trained. Inspired by this, we come up with a gradient-driven
training mechanism to tackle the long-tail problem: re-balancing the positive/negative gradients dynamically according to current
accumulative gradients, with a unified goal of achieving balance gradient ratios. Taking advantage of the simple and flexible gradient
mechanism, we introduce a new family of gradient-driven loss functions, namely equalization losses. We conduct extensive
experiments on a wide spectrum of visual tasks, including two-stage/single-stage long-tailed object detection (LVIS), long-tailed image
classification (ImageNet-LT, Places-LT, iNaturalist), and long-tailed semantic segmentation (ADE20K). Our method consistently
outperforms the baseline models, demonstrating the effectiveness and generalization ability of the proposed equalization losses.Codes
will be released at https://github.com/ModelTC/United-Perception.
Index Terms—Long-tailed Object Recognition, Object Detection, Image Classification, Semantic Segmentation
F
1 INTRODUCTION
OBJECT recognition is one of the most fundamental tasks
in computer vision. It is an important step in a host
of visual challenges, including object detection, semantic
segmentation, and object tracking. Despite this fact, the task
remains an open problem, not least due to the discrep-
ancy among the proportions of different categories. Current
benchmarks such as ImageNet [1], PASCAL VOC [2], COCO
[3], and Cityscapes [4] are carefully collected with balanced
annotations for each category, which contradicts the long-
tailed Zipfian distribution in natural images. Although ex-
isting methods have achieved impressive results, we still
can observe performance bottlenecks [5], [6], [7] on various
benchmarks, especially in the non-dominant classes with
fewer samples. As substantiated by recent literature [8], [9],
[10], tail categories are easily overwhelmed by the head
categories while learning on a dataset distributed off balance
and diversely.
Previous approaches can be roughly categorised into two
groups: data resampling [6], [11], [12] and cost-sensitive
learning [13], [14]. These methods address the above prob-
lems by either designing complex sampling strategies or
adjusting loss weights. Although promising, most of these
methods are designed based on categories’ frequency and
often suffer from several drawbacks: (1) those frequency-
Jingru Tan and Bo Li are with the Tongji University, Shanghai, China.
E-mail: {tjr120,1911030}@tongji.edu.cn
Xin Lu, Fengwei Yu and Yongqiang Yao are with Sensetime Research,
Shanghai, China.
E-mail: {luxin,yufengwei}@sensetime.com, soundbupt@gmail.com
Wanli Ouyang is with University of Sydney and Shanghai AI Laboratory.
Email: wanli.ouyang@sydney.edu.au
Tonghe is with Shanghai AI Laboratory. Email: tonghe90@gmail.com
Corresponding author. Equal Contribution.
based methods are not robust enough due to widespread
easy negative samples [15] and redundant positive sam-
ples [14]. (2) the accuracy is sensitive to the predefined
hyper-parameters.
In this paper, we tackle the problem of long-tailed recog-
nition from a novel perspective. We start by analyzing the
distribution of the accumulated gradients across different
categories. Specifically, the gradients of one category con-
sist of two parts: (1) the positive part, deriving from the
samples of the same category, and (2) the negative part,
contributed by other categories. Since the tail categories
have limited positive samples, their positive gradients can
be easily overwhelmed by the negative part. As illustrated
in Fig. 1(bottom row), the gradient ratio of positive to
negative examples are distributed off balance when trained
on long-tailed datasets such as LVIS [6], ImageNet-LT [5],
and ADE20K-LT [7]. The gradients ratio is close to 1 for the
head categories but is close to 0 for the tail categories. We
hypothesize that such gradient imbalance in training is the
main obstacle impeding tail classes from obtaining satisfac-
tory performance. Besides, we also conduct experiments on
the well-balanced datasets such as COCO [3], ImageNet [1],
and ADE20K [7]. It can be observed that all categories have
a gradient ratio close to 1 without introducing any bias
toward positives or negatives. Therefore, we believe that
the gradient ratio can serve as a significant indicator of how
balanced a category is trained.
Such a gradient-based indicator provides useful guid-
ance to adjust gradients of the positive and negative parts,
which can be easily plugged into different classifiers. To
this end, we adapt it to various loss functions. (1) For
binary cross-entropy loss (BCE), we introduce a gradient-
driven re-weighting mechanism and propose the sigmoid
arXiv:2210.05566v1 [cs.CV] 11 Oct 2022
2
Fig. 1: Gradient observation on four different types of tasks. Each column is responsible for a specific task. We define the
gradient of a sample as the derivative of the loss function with respect to its network output logits. For each category, we
demonstrate the accumulated gradient of positive samples (row 1), gradients of negative samples (row 2), and the gradient
ratio of positive samples to negative samples (row 3). We sorted the category index according to its instance number. And
we align COCO 80 categories with LVIS 1203 categories. The left and right y-axis are for imbalanced and balanced datasets.
equalization loss (Sigmoid-EQL). It treats the overall classi-
fication problem as a set of independent binary classification
tasks. Then the accumulative gradient ratio is used to up-
weight the positive gradients and down-weight the negative
gradients accordingly, aiming to balance the gradients of the
two parts. (2) For cross-entropy loss (CE), we propose the
softmax equalization loss (Softmax-EQL), which calibrates
the decision boundary dynamically based on the statistics of
the gradients. (3) For focal loss (FL) [15], we come up with
the equalized focal loss (EFL) by decoupling the coefficients
in [15] into category-agnostic and category-specific parts. By
introducing the gradient into the category-specific parts, the
model is able to focus more on the learning of rare cate-
gories. Those losses do not rely on the pre-computed data
statistics to determine the rebalancing terms. Instead, they
control the training process in a dynamic way. This data-
distribution agnostic property makes them more suitable for
stream and realistic data.
To demonstrate the effectiveness of our proposed
method, comprehensive experiments have been conducted
on various datasets and tasks. For object detection on the
challenging LVIS [6] benchmark, our proposed Sigmoid-
EQL and Softmax-EQL outperform Mask R-CNN [16] by
about 6.4% and 5.7% in terms of AP, respectively. With-
out introducing extra computation overhead, our approach
improves the performance substantially. With the help of
equalization losses, we won the first place both in COCO-
LVIS challenge 2019 and 2020 [17]. We also validate the
effectiveness of our proposed EFL on the task of single-state
object detection. Our method achieves 29.2% AP, delivering
significant improvements over state-of-the-art results. In ad-
dition to the effectiveness, the equalization losses also show
strong generalization ability when transferring to other
datasets and visual tasks. For example, equalization losses
maintain huge improvements when moving from LVIS to
Openimages [18]without further hyper-parameters tuning.
In Openimages, Sigmoid-EQL and EFL outperform the base-
line CE method by 9.1% AP and 6.6% AP, respectively. For
the image classification task, Softmax-EQL achieves state-
of-the-art results on three long-tailed image classification
datasets (ImageNet-LT, Place-LT, and iNatrualist2018). We
also evaluate our method on semantic segmentation using
ADE20K [7]. Our proposed Sigmoid-EQL improves the
powerful baseline, DeepLabV3+ [19], by 1.56% and 2.27%
in terms of mIoU and mAcc, respectively, showing strong
generalization of gradient-driven losses to varying tasks.
2 RELATED WORK
Long-tailed Object Recognition. Common solutions for
long-tailed image recognition are data re-sampling and
3
loss re-weighting. Re-sampling methods under-sample the
head categories [20], [21] or over-sample the tail cate-
gories [11], [12], [22], [23]. Re-weighting methods assign dif-
ferent weights to different categories [14], [24], [25], [26] or
instances [15], [27], [28]. Decoupled training methods [29],
[30] address the classifier imbalance problem with a two-
stage training pipeline by decoupling the learning of rep-
resentation and classifier. In addition, margin calibration
[10], [13], [31], [32] inject category-specific margins into the
CE loss to re-balance the logits distribution of categories.
Recently, other works address the long-tailed problem from
different perspectives such as transfer learning [5], [33], [34],
supervised contrastive learning [35], [36], [37], [38], ensem-
ble learning [39], [40], [41], [42], [43], and so on. Some works
adapt those ideas to object detection [6], including data re-
sampling [6], [44], loss re-weighting [8], [9], [45], decoupled
training [30], [44], [46], margin calibration [10], [47], incre-
mental learning [48] and causal inference [49]. Despite the
efforts, most of them somehow utilize the sample number
as the indicator of imbalance to design their algorithms. In
contrast, we use the gradient as the indicator. It is more
stable and precise thus reflects the models’ training status
better.
Gradient as Indicator. There are some works [15], [27], that
attempt to solve the imbalance problems from the gradient
view. They use the instant gradient to reflect the learning
difficulty of a sample at a certain moment and determine
its loss contribution dynamically, which can be viewed as
an online version of hard negative example mining [50].
Those methods are designed for the serious foreground-
background imbalance problem. Different from them, we
use accumulative gradients to reflect the imbalanced training
status of categories. Our method is designed for long-tailed
object recognition. Meanwhile, our method is complemen-
tary to theirs. We can solve the foreground-background im-
balance problem and foreground-foreground (long-tailed)
problem simultaneously by combining instant gradients and
accumulative gradients indicators.
3 GRADIENT IMBALANCE PROBLEM
In this section, we introduce the imbalanced gradients,
which we believe should be responsible for the inferior
performance of the tail categories. It comes from the entan-
glement of instances and categories, which we will describe
next. Building upon comprehensive experiments, we argue
such gradient statistics can be served as an effective indica-
tor to show the status of category classifiers.
Notation Definition. Suppose we have a training set X=
{xi, yi}Nwith Ccategories. Let the total instance number
over the dataset be Nand N=PC
jnj, where njis the
instance number of category j. For each iteration, we have
a batch of instances Iwith a batch size of B.Y RB×C
are the one-hot labels of the batch. We adopt a CNN f
with parameter θas the feature extractor. Then the feature
representations of the batch could be computed by f(I;θ).
A linear transformation is used as the classifier to output the
logits: Z RB×C.Z=WTf(I;θ) + b, where Wdenotes
the classifier weight matrix and bis the bias.
We denote each image as an instance. Wcan be re-
garded as Cclassifiers, each of which is responsible for
bi-categorizing instances as one class. Each instance can be
regarded as a positive sample for one specific category and a
negative sample for the remaining C1categories. We denote
yj
i∈ {0,1}the label, which equals to 1 if the i-th instance
belongs to the j-th category.
3.1 Entanglement of Instances and Categories
The total number of positive samples Mpos
jand negative
samples Mneg
jfor the j-th classifier can be easily obtained:
Mpos
j=X
i∈X
yj
i, Mneg
j=X
i∈X
(1 yj
i)(1)
The ratio of the number of positive samples to the
negative samples over the dataset is then:
Mpos
j
Mneg
jnj
Nnj1
N
nj1(2)
From Eq. 2, we observe Mpos
jMneg
jfor the tail
categories that have a very limited number of instances,
indicating these categories often suffer from the extremely
imbalanced ratio of positive to negative samples. Previous
methods [13], [14] address the problem by applying differ-
ent loss weights or decision margins to different categories
according to their sample numbers. However, they often
fail to generalize well to other datasets because the sample
numbers can not reflect the training status of each classifier
well. For example, a large number of easy negatives samples
and some redundant positive samples hardly contribute
to the learning of the model. In contrast, we propose to
use gradient statistics as our metric to indicate whether a
category is in balanced training status. And we conjecture
that there is a similar positive-negative imbalance problem
in the gradient ratios of rare categories.
3.2 Gradient Computation.
We define the gradient over the batch Ias the derivative
of objective cost function Lwith respect to their logits
Z. The gradient G=L
Z RB×Cis corresponding to
the gradients of all samples belonging to Ccategories. We
denote the gradient of a certain sample as gj
i. Then the
positive gradients g(t)pos
jand negative gradient g(t)neg
jof
the category jat iteration tcan be computed as follows:
g(t)pos
j=X
i∈I
yj
i|gj
i|, g(t)neg
j=X
i∈I
(1 yj
i)|gj
i|(3)
The accumulated positive gradients G(T)pos
jand negative
gradients G(T)neg
jat iteration Tcould be defined as:
G(T)pos
j=
T
X
t=0
g(t)pos
j, G(T)neg
j=
T
X
t=0
g(t)neg
j(4)
For simplicity, we ignore Tin the notations and directly
adopt Gpos
jand Gneg
jas the accumulated positive and
negative gradients, respectively. The accumulated gradient
ratio could be calculated by Gj=Gpos
j
Gneg
j.
4
3.3 Gradient Observation
To validate our hypothesis that rare categories suffer from
gradient imbalance problems, we collect gradient statistics
during the training process across a wide spectrum of recog-
nition tasks and datasets, including long-tailed image clas-
sification (ImageNet-LT [5]), two-stage/single-stage long-
tailed object detection (LVIS [6]), and long-tailed semantic
segmentation (ADE20K [7]). The results are shown in Fig.
1. We consistently observe four key phenomenons: (1) The
positive gradients Gpos follow a long-tailed distribution. (2)
The negative gradients Gneg follow a long-tailed distribu-
tion. (3) The gradient magnitudes of positive and negative
over categories are different, leading their ratio Galso have
the property of long-tailed distribution. (4) The gradient
ratio Gof head categories is close to 1 while the ratio of
tail categories is close to 0.
The x-axis of Fig. 1is sorted by category instance
numbers nj. We notice that the positive gradients have a
positive correlation to the positive sample number, while
the negative gradients do not have a positive correlation
to the negative sample number. This is because tail cat-
egories get scarcely training by the model. Model hardly
predicts those tail categories to be positives so most of their
negative samples are easy samples. Although easy negative
samples have small gradients, the effect accumulated from
a large amount of them is not negligible. The observation
of the gradient ratio proves that the category classifiers
with fewer samples suffer from a more serious gradient
imbalance problem between positives and negatives, which
validates our conjecture. For the head categories with abun-
dant training samples, the received gradient ratio of the
corresponding classifier is close to 1, indicating the classifier
gives no inclination to positives or negatives, which we
refer to as a balanced training status. For the tail classifier,
the gradient ratio is close to 0, indicating the classifier is
heavily biased towards negative, which we refer to as an
imbalanced training status. It is worth noting that there is
a vast number of background negative samples in single-
stage object detection, as discussed in [15]. We find that the
head categories still have a gradient ratio close to 1 in a well-
trained model, as shown in Fig. 1(column 3). This proves
that the gradient indicator is more stable and reliable than
the sample number.
As illustrated in Fig. 1, more experiments are conducted
on several datasets that have close number of instances
among different categories, including ImageNet [1] and
COCO [3]. We observe the gradients are balanced dis-
tributed and the category classifiers have a balanced gra-
dient ratio closing to 1.
By observing the gradient statistics under imbalanced
and balanced data distribution, we conclude that the gradi-
ent (i.e., Gpos
j,Gneg
j) and the gradient ratio (i.e., Gj) could
serve as an important indicator, showing the training status
of categories.
4 THE EQUALIZATION LOSSES
Our central idea is to utilize gradient statistics as an in-
dicator to reflect the training status of category classifiers
and then adjust their training process dynamically. In this
section, by applying this idea to several loss functions, such
as binary cross-entropy loss, cross-entropy loss, and focal
loss, we introduce a new family of gradient-driven loss
functions, namely the equalization losses.
4.1 Sigmoid Equalization Loss
4.1.1 Binary Cross-Entropy
Binary cross-entropy (BCE) loss estimates the probability of
each category independently using Csigmoid loss functions.
Specifically, in a batch of instances I, the classifier outputs
the estimated probability P RB×Cby applying a sigmoid
activation function to the logits Zand P=σ(Z). We define
the estimated probability of a certain sample as pj
i[0,1].
For this sample, the loss term is computed as 1:
BCE(p, y) = (log(p)if y= 1
log(1 p)otherwise (5)
Follow the notation in [15], we define ptas:
pt=(pif y= 1
1potherwise (6)
then we can rewrite BCE(p, y) = BCE(pt) = log(pt). And
the final loss contribution could be calculated by summing
up the loss values from all samples:
L(P,Y) = X
i∈I
C
X
j=1
BCE(pt)(7)
The probability of each category in the BCE is estimated
independently without cross normalization. This property
makes the binary cross-entropy suitable for tasks that con-
sist of a set of independent sub-tasks, such as object detec-
tion, and multi-label image classification.
4.1.2 Gradient-Driven Re-weighting
Under long-tailed distribution, models are in an unbalanced
training status. As mentioned in Section 3.3, the accumulated
gradient ratio Gjcan reflect the training status of that cate-
gory. Therefore we adopt it to adjust the training process for
each sub-task in BCE independently and equally. Concretely,
we propose a gradient-driven re-weighting mechanism in
which we up-weight the positive gradients and down-
weight negative gradients for each classifier dynamically.
This re-weighting strategy aims to make the gradient ratio
as close to 1 as possible.
We denote qjas the weight term for positive samples
of category jand rjfor negative samples. We propose the
formulation as:
rj=f(Gj)(8)
qj= 1 + α(1 rj)(9)
where f(·)is a mapping function to remap the value of
gradient ratio to a more controllable range. Basically, for
a small gradient ratio with imbalanced training status, we
1. For neatness, we ignore the superscripts and subscripts of yand p.
Unless otherwise stated, they are also be ignored in the formula of CE
and focal loss for a certain sample.
摘要:

1TheEqualizationLosses:Gradient-DrivenTrainingforLong-tailedObjectRecognitionJingruTany,BoLiy,XinLu,YongqiangYao,FengweiYu,TongHe,andWanliOuyangSeniorMember,IEEE,Abstract—Long-taildistributioniswidelyspreadinreal-worldapplications.Duetotheextremelysmallratioofinstances,tailcategoriesoftenshowinferi...

展开>> 收起<<
1 The Equalization Losses Gradient-Driven Training for Long-tailed Object Recognition.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.97MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注