
3
loss re-weighting. Re-sampling methods under-sample the
head categories [20], [21] or over-sample the tail cate-
gories [11], [12], [22], [23]. Re-weighting methods assign dif-
ferent weights to different categories [14], [24], [25], [26] or
instances [15], [27], [28]. Decoupled training methods [29],
[30] address the classifier imbalance problem with a two-
stage training pipeline by decoupling the learning of rep-
resentation and classifier. In addition, margin calibration
[10], [13], [31], [32] inject category-specific margins into the
CE loss to re-balance the logits distribution of categories.
Recently, other works address the long-tailed problem from
different perspectives such as transfer learning [5], [33], [34],
supervised contrastive learning [35], [36], [37], [38], ensem-
ble learning [39], [40], [41], [42], [43], and so on. Some works
adapt those ideas to object detection [6], including data re-
sampling [6], [44], loss re-weighting [8], [9], [45], decoupled
training [30], [44], [46], margin calibration [10], [47], incre-
mental learning [48] and causal inference [49]. Despite the
efforts, most of them somehow utilize the sample number
as the indicator of imbalance to design their algorithms. In
contrast, we use the gradient as the indicator. It is more
stable and precise thus reflects the models’ training status
better.
Gradient as Indicator. There are some works [15], [27], that
attempt to solve the imbalance problems from the gradient
view. They use the instant gradient to reflect the learning
difficulty of a sample at a certain moment and determine
its loss contribution dynamically, which can be viewed as
an online version of hard negative example mining [50].
Those methods are designed for the serious foreground-
background imbalance problem. Different from them, we
use accumulative gradients to reflect the imbalanced training
status of categories. Our method is designed for long-tailed
object recognition. Meanwhile, our method is complemen-
tary to theirs. We can solve the foreground-background im-
balance problem and foreground-foreground (long-tailed)
problem simultaneously by combining instant gradients and
accumulative gradients indicators.
3 GRADIENT IMBALANCE PROBLEM
In this section, we introduce the imbalanced gradients,
which we believe should be responsible for the inferior
performance of the tail categories. It comes from the entan-
glement of instances and categories, which we will describe
next. Building upon comprehensive experiments, we argue
such gradient statistics can be served as an effective indica-
tor to show the status of category classifiers.
Notation Definition. Suppose we have a training set X=
{xi, yi}Nwith Ccategories. Let the total instance number
over the dataset be Nand N=PC
jnj, where njis the
instance number of category j. For each iteration, we have
a batch of instances Iwith a batch size of B.Y ∈ RB×C
are the one-hot labels of the batch. We adopt a CNN f
with parameter θas the feature extractor. Then the feature
representations of the batch could be computed by f(I;θ).
A linear transformation is used as the classifier to output the
logits: Z ∈ RB×C.Z=WTf(I;θ) + b, where Wdenotes
the classifier weight matrix and bis the bias.
We denote each image as an instance. Wcan be re-
garded as Cclassifiers, each of which is responsible for
bi-categorizing instances as one class. Each instance can be
regarded as a positive sample for one specific category and a
negative sample for the remaining C−1categories. We denote
yj
i∈ {0,1}the label, which equals to 1 if the i-th instance
belongs to the j-th category.
3.1 Entanglement of Instances and Categories
The total number of positive samples Mpos
jand negative
samples Mneg
jfor the j-th classifier can be easily obtained:
Mpos
j=X
i∈X
yj
i, Mneg
j=X
i∈X
(1 −yj
i)(1)
The ratio of the number of positive samples to the
negative samples over the dataset is then:
Mpos
j
Mneg
j∝nj
N−nj∝1
N
nj−1(2)
From Eq. 2, we observe Mpos
jMneg
jfor the tail
categories that have a very limited number of instances,
indicating these categories often suffer from the extremely
imbalanced ratio of positive to negative samples. Previous
methods [13], [14] address the problem by applying differ-
ent loss weights or decision margins to different categories
according to their sample numbers. However, they often
fail to generalize well to other datasets because the sample
numbers can not reflect the training status of each classifier
well. For example, a large number of easy negatives samples
and some redundant positive samples hardly contribute
to the learning of the model. In contrast, we propose to
use gradient statistics as our metric to indicate whether a
category is in balanced training status. And we conjecture
that there is a similar positive-negative imbalance problem
in the gradient ratios of rare categories.
3.2 Gradient Computation.
We define the gradient over the batch Ias the derivative
of objective cost function Lwith respect to their logits
Z. The gradient G=∂L
∂Z ∈RB×Cis corresponding to
the gradients of all samples belonging to Ccategories. We
denote the gradient of a certain sample as gj
i. Then the
positive gradients g(t)pos
jand negative gradient g(t)neg
jof
the category jat iteration tcan be computed as follows:
g(t)pos
j=X
i∈I
yj
i|gj
i|, g(t)neg
j=X
i∈I
(1 −yj
i)|gj
i|(3)
The accumulated positive gradients G(T)pos
jand negative
gradients G(T)neg
jat iteration Tcould be defined as:
G(T)pos
j=
T
X
t=0
g(t)pos
j, G(T)neg
j=
T
X
t=0
g(t)neg
j(4)
For simplicity, we ignore Tin the notations and directly
adopt Gpos
jand Gneg
jas the accumulated positive and
negative gradients, respectively. The accumulated gradient
ratio could be calculated by Gj=Gpos
j
Gneg
j.