accuracy at any operating point. We further introduce a soft structured-prediction loss that dominates
the flat softmax.
While it was already known that top-down approaches provide worse leaf-node predictions than a
flat softmax [31,2], we did not expect this to hold for non-leaf predictions. We hypothesise that the
top-down approaches obtain worse accuracy than the “bottom-up” flat softmax because the coarse
classes can be highly diverse, and thus better learned by a union of distinct classifiers when the
fine-grained labels are available. To support this hypothesis, we show that training a flat softmax
classifier at a lower level and testing at a higher level provides better accuracy than training at
the higher level. Finally, we consider a synthetic out-of-distribution problem where a hierarchical
classifier that explicitly assigns scores to higher-level classes may be expected to have an advantage.
The key contributions of this paper are as follows.
•
We introduce a novel, efficient technique for evaluating hierarchical classifiers that captures the
full trade-off between specificity and correctness using a threshold-based inference rule.
•
We propose two novel loss functions, soft-max-descendant and soft-max-margin, and entertain a
simplification of Deep RTC [
39
] which we refer to as parameter sharing (PS) softmax. While
soft-max-descendant is ineffective, soft-max-margin and PS softmax achieve the best results.
•
We conduct an empirical comparison of loss functions and inference rules using the iNat21-Mini
dataset and its hierarchy of 10,000 species. The naïve softmax is found to be surprisingly effective,
and the simple threshold-based inference rule performs well compared to alternative options.
•
We evaluate the robustness of different methods to unseen leaf-node classes. The top-down
methods are more competitive for unseen classes, while PS softmax is the most effective.
2 Related work
There are several different types of hierarchical classification problem. In the terminology of Silla and
Freitas [
33
], we consider problems with tree structure, Single-Path Labels (SPL) and Non-Mandatory
Leaf-Node Prediction (NMLNP). Tree structure means that there is a unique root node and every
other node has exactly one parent, Single-Path Labels means that a sample cannot belong to two
separate classes (unless one is a superclass of the other) and Non-Mandatory Leaf-Node Prediction
means that the classifier may predict any class in the hierarchy, not just leaf nodes. This work assumes
the hierarchy is known and does not consider the distinct problem of learning the hierarchy.
MLNP with deep learning.
Several recent works have sought to leverage a class hierarchy to
improve leaf-node prediction. Wu et al. [
38
] trained a softmax classifier by optimising a “win” metric
comprising a weighted combination of likelihoods on the path from the root to the label. Bertinetto et
al. [
2
] proposed a similar Hierarchical Cross-Entropy (HXE) loss comprising weighted losses for
the conditional likelihood of each node given its parent, as well as a Soft Label loss that generalised
label smoothing [
27
] to incorporate the hierarchy. Karthik et al. [
20
] demonstrated that a flat softmax
classifier is still effective for MLNP and proposed an alternative inference method, Conditional Risk
Minimisation (CRM). Guo et al. [
18
] performed inference using an RNN starting from the root node.
Other works have proposed architectures that reflect the hierarchy (e.g. [
41
,
1
,
43
]). This paper seeks
to compare loss functions under a simple inference procedure.
Non-MLNP methods.
Comparatively few works have entertained hierarchical classifiers that can
predict arbitrary nodes. Deng et al. [
11
] highlighted the trade-off between specificity and accuracy,
and sought to obtain the most-specific classifier for a given error rate. Davis et al. [
7
,
8
] re-calibrated
a flat softmax classifier within a hierarchy using a held-out set and performed inference by starting
at the most-likely leaf-node and climbing the tree until a threshold was met, comparing methods
at several fixed thresholds. Wu et al. [
39
] proposed the Deep Realistic Taxonomic Classifier (Deep
RTC), which obtains a score for each node by summing over those of its ancestors [
32
] and whose
loss is evaluated at random cuts of the tree. They perform inference by traversing down from the
root until a score threshold is crossed (by default, zero). In the YOLO-9000 detector, Redmon and
Farhadi [
31
] introduced a conditional softmax classifier that resembled the efficiency-driven approach
of Morin and Bengio [
26
]. The model outputs a concatenation of logit vectors that each parametrise
(via softmax) the conditional distribution of children given a parent. It was noted to provide elegant
degradation but the hierarchical predictions were not rigorously evaluated. Brust and Denzler [
4
]
generalised the conditional approach to multi-path labels in a DAG hierarchy by replacing the softmax
with a sigmoid for each class given each of its parents. Inference was performed by seeking the class
2