1 The Equalization Losses Gradient-Driven Training for Long-tailed Object Recognition

2025-04-24 0 0 3.97MB 16 页 10玖币

The Equalization Losses: Gradient-Driven

Training for Long-tailed Object Recognition

Jingru Tan†, Bo Li†, Xin Lu, Yongqiang Yao∗, Fengwei Yu, Tong He,

and Wanli Ouyang Senior Member, IEEE,

Abstract—Long-tail distribution is widely spread in real-world applications. Due to the extremely small ratio of instances, tail categories

often show inferior accuracy. In this paper, we ﬁnd such performance bottleneck is mainly caused by the imbalanced gradients, which

can be categorized into two parts: (1) positive part, deriving from the samples of the same category, and (2) negative part, contributed

by other categories. Based on comprehensive experiments, it is also observed that the gradient ratio of accumulated positives to

negatives is a good indicator to measure how balanced a category is trained. Inspired by this, we come up with a gradient-driven

training mechanism to tackle the long-tail problem: re-balancing the positive/negative gradients dynamically according to current

accumulative gradients, with a uniﬁed goal of achieving balance gradient ratios. Taking advantage of the simple and ﬂexible gradient

mechanism, we introduce a new family of gradient-driven loss functions, namely equalization losses. We conduct extensive

experiments on a wide spectrum of visual tasks, including two-stage/single-stage long-tailed object detection (LVIS), long-tailed image

classiﬁcation (ImageNet-LT, Places-LT, iNaturalist), and long-tailed semantic segmentation (ADE20K). Our method consistently

outperforms the baseline models, demonstrating the effectiveness and generalization ability of the proposed equalization losses.Codes

will be released at https://github.com/ModelTC/United-Perception.

Index Terms—Long-tailed Object Recognition, Object Detection, Image Classiﬁcation, Semantic Segmentation

1 INTRODUCTION

OBJECT recognition is one of the most fundamental tasks

in computer vision. It is an important step in a host

of visual challenges, including object detection, semantic

segmentation, and object tracking. Despite this fact, the task

remains an open problem, not least due to the discrep-

ancy among the proportions of different categories. Current

benchmarks such as ImageNet [1], PASCAL VOC [2], COCO

[3], and Cityscapes [4] are carefully collected with balanced

annotations for each category, which contradicts the long-

tailed Zipﬁan distribution in natural images. Although ex-

isting methods have achieved impressive results, we still

can observe performance bottlenecks [5], [6], [7] on various

benchmarks, especially in the non-dominant classes with

fewer samples. As substantiated by recent literature [8], [9],

[10], tail categories are easily overwhelmed by the head

categories while learning on a dataset distributed off balance

and diversely.

Previous approaches can be roughly categorised into two

groups: data resampling [6], [11], [12] and cost-sensitive

learning [13], [14]. These methods address the above prob-

lems by either designing complex sampling strategies or

adjusting loss weights. Although promising, most of these

methods are designed based on categories’ frequency and

often suffer from several drawbacks: (1) those frequency-

•Jingru Tan and Bo Li are with the Tongji University, Shanghai, China.

E-mail: {tjr120,1911030}@tongji.edu.cn

•Xin Lu, Fengwei Yu and Yongqiang Yao are with Sensetime Research,

Shanghai, China.

E-mail: {luxin,yufengwei}@sensetime.com, soundbupt@gmail.com

•Wanli Ouyang is with University of Sydney and Shanghai AI Laboratory.

Email: wanli.ouyang@sydney.edu.au

•Tonghe is with Shanghai AI Laboratory. Email: tonghe90@gmail.com

•∗Corresponding author. †Equal Contribution.

based methods are not robust enough due to widespread

easy negative samples [15] and redundant positive sam-

ples [14]. (2) the accuracy is sensitive to the predeﬁned

hyper-parameters.

In this paper, we tackle the problem of long-tailed recog-

nition from a novel perspective. We start by analyzing the

distribution of the accumulated gradients across different

categories. Speciﬁcally, the gradients of one category con-

sist of two parts: (1) the positive part, deriving from the

samples of the same category, and (2) the negative part,

contributed by other categories. Since the tail categories

have limited positive samples, their positive gradients can

be easily overwhelmed by the negative part. As illustrated

in Fig. 1(bottom row), the gradient ratio of positive to

negative examples are distributed off balance when trained

on long-tailed datasets such as LVIS [6], ImageNet-LT [5],

and ADE20K-LT [7]. The gradients ratio is close to 1 for the

head categories but is close to 0 for the tail categories. We

hypothesize that such gradient imbalance in training is the

main obstacle impeding tail classes from obtaining satisfac-

tory performance. Besides, we also conduct experiments on

the well-balanced datasets such as COCO [3], ImageNet [1],

and ADE20K [7]. It can be observed that all categories have

a gradient ratio close to 1 without introducing any bias

toward positives or negatives. Therefore, we believe that

the gradient ratio can serve as a signiﬁcant indicator of how

balanced a category is trained.

Such a gradient-based indicator provides useful guid-

ance to adjust gradients of the positive and negative parts,

which can be easily plugged into different classiﬁers. To

this end, we adapt it to various loss functions. (1) For

binary cross-entropy loss (BCE), we introduce a gradient-

driven re-weighting mechanism and propose the sigmoid

arXiv:2210.05566v1 [cs.CV] 11 Oct 2022

Fig. 1: Gradient observation on four different types of tasks. Each column is responsible for a speciﬁc task. We deﬁne the

gradient of a sample as the derivative of the loss function with respect to its network output logits. For each category, we

demonstrate the accumulated gradient of positive samples (row 1), gradients of negative samples (row 2), and the gradient

ratio of positive samples to negative samples (row 3). We sorted the category index according to its instance number. And

we align COCO 80 categories with LVIS 1203 categories. The left and right y-axis are for imbalanced and balanced datasets.

equalization loss (Sigmoid-EQL). It treats the overall classi-

ﬁcation problem as a set of independent binary classiﬁcation

tasks. Then the accumulative gradient ratio is used to up-

weight the positive gradients and down-weight the negative

gradients accordingly, aiming to balance the gradients of the

two parts. (2) For cross-entropy loss (CE), we propose the

softmax equalization loss (Softmax-EQL), which calibrates

the decision boundary dynamically based on the statistics of

the gradients. (3) For focal loss (FL) [15], we come up with

the equalized focal loss (EFL) by decoupling the coefﬁcients

in [15] into category-agnostic and category-speciﬁc parts. By

introducing the gradient into the category-speciﬁc parts, the

model is able to focus more on the learning of rare cate-

gories. Those losses do not rely on the pre-computed data

statistics to determine the rebalancing terms. Instead, they

control the training process in a dynamic way. This data-

distribution agnostic property makes them more suitable for

stream and realistic data.

To demonstrate the effectiveness of our proposed

method, comprehensive experiments have been conducted

on various datasets and tasks. For object detection on the

challenging LVIS [6] benchmark, our proposed Sigmoid-

EQL and Softmax-EQL outperform Mask R-CNN [16] by

about 6.4% and 5.7% in terms of AP, respectively. With-

out introducing extra computation overhead, our approach

improves the performance substantially. With the help of

equalization losses, we won the ﬁrst place both in COCO-

LVIS challenge 2019 and 2020 [17]. We also validate the

effectiveness of our proposed EFL on the task of single-state

object detection. Our method achieves 29.2% AP, delivering

signiﬁcant improvements over state-of-the-art results. In ad-

dition to the effectiveness, the equalization losses also show

strong generalization ability when transferring to other

datasets and visual tasks. For example, equalization losses

maintain huge improvements when moving from LVIS to

Openimages [18]without further hyper-parameters tuning.

In Openimages, Sigmoid-EQL and EFL outperform the base-

line CE method by 9.1% AP and 6.6% AP, respectively. For

the image classiﬁcation task, Softmax-EQL achieves state-

of-the-art results on three long-tailed image classiﬁcation

datasets (ImageNet-LT, Place-LT, and iNatrualist2018). We

also evaluate our method on semantic segmentation using

ADE20K [7]. Our proposed Sigmoid-EQL improves the

powerful baseline, DeepLabV3+ [19], by 1.56% and 2.27%

in terms of mIoU and mAcc, respectively, showing strong

generalization of gradient-driven losses to varying tasks.

2 RELATED WORK

Long-tailed Object Recognition. Common solutions for

long-tailed image recognition are data re-sampling and

loss re-weighting. Re-sampling methods under-sample the

head categories [20], [21] or over-sample the tail cate-

gories [11], [12], [22], [23]. Re-weighting methods assign dif-

ferent weights to different categories [14], [24], [25], [26] or

instances [15], [27], [28]. Decoupled training methods [29],

[30] address the classiﬁer imbalance problem with a two-

stage training pipeline by decoupling the learning of rep-

resentation and classiﬁer. In addition, margin calibration

[10], [13], [31], [32] inject category-speciﬁc margins into the

CE loss to re-balance the logits distribution of categories.

Recently, other works address the long-tailed problem from

different perspectives such as transfer learning [5], [33], [34],

supervised contrastive learning [35], [36], [37], [38], ensem-

ble learning [39], [40], [41], [42], [43], and so on. Some works

adapt those ideas to object detection [6], including data re-

sampling [6], [44], loss re-weighting [8], [9], [45], decoupled

training [30], [44], [46], margin calibration [10], [47], incre-

mental learning [48] and causal inference [49]. Despite the

efforts, most of them somehow utilize the sample number

as the indicator of imbalance to design their algorithms. In

contrast, we use the gradient as the indicator. It is more

stable and precise thus reﬂects the models’ training status

better.

Gradient as Indicator. There are some works [15], [27], that

attempt to solve the imbalance problems from the gradient

view. They use the instant gradient to reﬂect the learning

difﬁculty of a sample at a certain moment and determine

its loss contribution dynamically, which can be viewed as

an online version of hard negative example mining [50].

Those methods are designed for the serious foreground-

background imbalance problem. Different from them, we

use accumulative gradients to reﬂect the imbalanced training

status of categories. Our method is designed for long-tailed

object recognition. Meanwhile, our method is complemen-

tary to theirs. We can solve the foreground-background im-

balance problem and foreground-foreground (long-tailed)

problem simultaneously by combining instant gradients and

accumulative gradients indicators.

3 GRADIENT IMBALANCE PROBLEM

In this section, we introduce the imbalanced gradients,

which we believe should be responsible for the inferior

performance of the tail categories. It comes from the entan-

glement of instances and categories, which we will describe

next. Building upon comprehensive experiments, we argue

such gradient statistics can be served as an effective indica-

tor to show the status of category classiﬁers.

Notation Deﬁnition. Suppose we have a training set X=

{xi, yi}Nwith Ccategories. Let the total instance number

over the dataset be Nand N=PC

jnj, where njis the

instance number of category j. For each iteration, we have

a batch of instances Iwith a batch size of B.Y ∈ RB×C

are the one-hot labels of the batch. We adopt a CNN f

with parameter θas the feature extractor. Then the feature

representations of the batch could be computed by f(I;θ).

A linear transformation is used as the classiﬁer to output the

logits: Z ∈ RB×C.Z=WTf(I;θ) + b, where Wdenotes

the classiﬁer weight matrix and bis the bias.

We denote each image as an instance. Wcan be re-

garded as Cclassiﬁers, each of which is responsible for

bi-categorizing instances as one class. Each instance can be

regarded as a positive sample for one speciﬁc category and a

negative sample for the remaining C−1categories. We denote

i∈ {0,1}the label, which equals to 1 if the i-th instance

belongs to the j-th category.

3.1 Entanglement of Instances and Categories

The total number of positive samples Mpos

jand negative

samples Mneg

jfor the j-th classiﬁer can be easily obtained:

Mpos

j=X

i∈X

i, Mneg

j=X

i∈X

(1 −yj

i)(1)

The ratio of the number of positive samples to the

negative samples over the dataset is then:

Mpos

Mneg

j∝nj

N−nj∝1

nj−1(2)

From Eq. 2, we observe Mpos

jMneg

jfor the tail

categories that have a very limited number of instances,

indicating these categories often suffer from the extremely

imbalanced ratio of positive to negative samples. Previous

methods [13], [14] address the problem by applying differ-

ent loss weights or decision margins to different categories

according to their sample numbers. However, they often

fail to generalize well to other datasets because the sample

numbers can not reﬂect the training status of each classiﬁer

well. For example, a large number of easy negatives samples

and some redundant positive samples hardly contribute

to the learning of the model. In contrast, we propose to

use gradient statistics as our metric to indicate whether a

category is in balanced training status. And we conjecture

that there is a similar positive-negative imbalance problem

in the gradient ratios of rare categories.

3.2 Gradient Computation.

We deﬁne the gradient over the batch Ias the derivative

of objective cost function Lwith respect to their logits

Z. The gradient G=∂L

∂Z ∈RB×Cis corresponding to

the gradients of all samples belonging to Ccategories. We

denote the gradient of a certain sample as gj

i. Then the

positive gradients g(t)pos

jand negative gradient g(t)neg

jof

the category jat iteration tcan be computed as follows:

g(t)pos

j=X

i∈I

i|gj

i|, g(t)neg

j=X

i∈I

(1 −yj

i)|gj

i|(3)

The accumulated positive gradients G(T)pos

jand negative

gradients G(T)neg

jat iteration Tcould be deﬁned as:

G(T)pos

t=0

g(t)pos

j, G(T)neg

t=0

g(t)neg

j(4)

For simplicity, we ignore Tin the notations and directly

adopt Gpos

jand Gneg

jas the accumulated positive and

negative gradients, respectively. The accumulated gradient

ratio could be calculated by Gj=Gpos

Gneg

3.3 Gradient Observation

To validate our hypothesis that rare categories suffer from

gradient imbalance problems, we collect gradient statistics

during the training process across a wide spectrum of recog-

nition tasks and datasets, including long-tailed image clas-

siﬁcation (ImageNet-LT [5]), two-stage/single-stage long-

tailed object detection (LVIS [6]), and long-tailed semantic

segmentation (ADE20K [7]). The results are shown in Fig.

1. We consistently observe four key phenomenons: (1) The

positive gradients Gpos follow a long-tailed distribution. (2)

The negative gradients Gneg follow a long-tailed distribu-

tion. (3) The gradient magnitudes of positive and negative

over categories are different, leading their ratio Galso have

the property of long-tailed distribution. (4) The gradient

ratio Gof head categories is close to 1 while the ratio of

tail categories is close to 0.

The x-axis of Fig. 1is sorted by category instance

numbers nj. We notice that the positive gradients have a

positive correlation to the positive sample number, while

the negative gradients do not have a positive correlation

to the negative sample number. This is because tail cat-

egories get scarcely training by the model. Model hardly

predicts those tail categories to be positives so most of their

negative samples are easy samples. Although easy negative

samples have small gradients, the effect accumulated from

a large amount of them is not negligible. The observation

of the gradient ratio proves that the category classiﬁers

with fewer samples suffer from a more serious gradient

imbalance problem between positives and negatives, which

validates our conjecture. For the head categories with abun-

dant training samples, the received gradient ratio of the

corresponding classiﬁer is close to 1, indicating the classiﬁer

gives no inclination to positives or negatives, which we

refer to as a balanced training status. For the tail classiﬁer,

the gradient ratio is close to 0, indicating the classiﬁer is

heavily biased towards negative, which we refer to as an

imbalanced training status. It is worth noting that there is

a vast number of background negative samples in single-

stage object detection, as discussed in [15]. We ﬁnd that the

head categories still have a gradient ratio close to 1 in a well-

trained model, as shown in Fig. 1(column 3). This proves

that the gradient indicator is more stable and reliable than

the sample number.

As illustrated in Fig. 1, more experiments are conducted

on several datasets that have close number of instances

among different categories, including ImageNet [1] and

COCO [3]. We observe the gradients are balanced dis-

tributed and the category classiﬁers have a balanced gra-

dient ratio closing to 1.

By observing the gradient statistics under imbalanced

and balanced data distribution, we conclude that the gradi-

ent (i.e., Gpos

j,Gneg

j) and the gradient ratio (i.e., Gj) could

serve as an important indicator, showing the training status

of categories.

4 THE EQUALIZATION LOSSES

Our central idea is to utilize gradient statistics as an in-

dicator to reﬂect the training status of category classiﬁers

and then adjust their training process dynamically. In this

section, by applying this idea to several loss functions, such

as binary cross-entropy loss, cross-entropy loss, and focal

loss, we introduce a new family of gradient-driven loss

functions, namely the equalization losses.

4.1 Sigmoid Equalization Loss

4.1.1 Binary Cross-Entropy

Binary cross-entropy (BCE) loss estimates the probability of

each category independently using Csigmoid loss functions.

Speciﬁcally, in a batch of instances I, the classiﬁer outputs

the estimated probability P ∈ RB×Cby applying a sigmoid

activation function to the logits Zand P=σ(Z). We deﬁne

the estimated probability of a certain sample as pj

i∈[0,1].

For this sample, the loss term is computed as 1:

BCE(p, y) = (−log(p)if y= 1

−log(1 −p)otherwise (5)

Follow the notation in [15], we deﬁne ptas:

pt=(pif y= 1

1−potherwise (6)

then we can rewrite BCE(p, y) = BCE(pt) = −log(pt). And

the ﬁnal loss contribution could be calculated by summing

up the loss values from all samples:

L(P,Y) = X

i∈I

j=1

BCE(pt)(7)

The probability of each category in the BCE is estimated

independently without cross normalization. This property

makes the binary cross-entropy suitable for tasks that con-

sist of a set of independent sub-tasks, such as object detec-

tion, and multi-label image classiﬁcation.

4.1.2 Gradient-Driven Re-weighting

Under long-tailed distribution, models are in an unbalanced

training status. As mentioned in Section 3.3, the accumulated

gradient ratio Gjcan reﬂect the training status of that cate-

gory. Therefore we adopt it to adjust the training process for

each sub-task in BCE independently and equally. Concretely,

we propose a gradient-driven re-weighting mechanism in

which we up-weight the positive gradients and down-

weight negative gradients for each classiﬁer dynamically.

This re-weighting strategy aims to make the gradient ratio

as close to 1 as possible.

We denote qjas the weight term for positive samples

of category jand rjfor negative samples. We propose the

formulation as:

rj=f(Gj)(8)

qj= 1 + α(1 −rj)(9)

where f(·)is a mapping function to remap the value of

gradient ratio to a more controllable range. Basically, for

a small gradient ratio with imbalanced training status, we

1. For neatness, we ignore the superscripts and subscripts of yand p.

Unless otherwise stated, they are also be ignored in the formula of CE

and focal loss for a certain sample.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1TheEqualizationLosses:Gradient-DrivenTrainingforLong-tailedObjectRecognitionJingruTany,BoLiy,XinLu,YongqiangYao,FengweiYu,TongHe,andWanliOuyangSeniorMember,IEEE,AbstractLong-taildistributioniswidelyspreadinreal-worldapplications.Duetotheextremelysmallratioofinstances,tailcategoriesoftenshowinferi...

展开>> 收起<<

1 The Equalization Losses Gradient-Driven Training for Long-tailed Object Recognition.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 The Equalization Losses Gradient-Driven Training for Long-tailed Object Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: