Hierarchical classification at multiple operating points Jack Valmadre

2025-05-08 0 0 1.72MB 15 页 10玖币
侵权投诉
Hierarchical classification
at multiple operating points
Jack Valmadre
Australian Institute for Machine Learning, University of Adelaide
jack.valmadre@adelaide.edu.au
Abstract
Many classification problems consider classes that form a hierarchy. Classifiers
that are aware of this hierarchy may be able to make confident predictions at a
coarse level despite being uncertain at the fine-grained level. While it is gener-
ally possible to vary the granularity of predictions using a threshold at inference
time, most contemporary work considers only leaf-node prediction, and almost
no prior work has compared methods at multiple operating points. We present an
efficient algorithm to produce operating characteristic curves for any method that
assigns a score to every class in the hierarchy. Applying this technique to evaluate
existing methods reveals that top-down classifiers are dominated by a naïve flat
softmax classifier across the entire operating range. We further propose two novel
loss functions and show that a soft variant of the structured hinge loss is able to
significantly outperform the flat baseline. Finally, we investigate the poor accuracy
of top-down classifiers and demonstrate that they perform relatively well on unseen
classes. Code is available online at https://github.com/jvlmdr/hiercls.
1 Introduction
Many classification problems involve classes that can be recursively grouped into a hierarchy of
progressively larger superclasses. This hierarchy can be represented by a directed graph where the
nodes are classes and the edges define a superset-of relation. Knowledge of the hierarchy can be
useful in many different respects. For example, mistake severity can be quantified using distance in
the graph [
9
,
2
], classifiers can make predictions at a coarse level to avoid an error at the fine-grained
level [
11
,
39
], classes with few labels in a long-tailed distribution can benefit from the examples of
similar classes [39], and a cascade can be used to reduce inference time [26,16,14].
Whereas many recent works consider only leaf-node prediction, we are interested in the setting where
the classifier may predict any class in the hierarchy, including internal nodes. In this setting, prediction
involves an inherent trade-off between specificity and correctness: more general predictions contain
less information but have a greater chance of being correct. The trade-off can typically be controlled
using an inference threshold, analogous to the trade-off between sensitivity and specificity in binary
classification or that between precision and recall in detection. However, while it is standard to
evaluate these problems using trade-off curves, most works on hierarchical classification consider
only a single operating point. It is important to consider the full trade-off curve in order to ensure a
fair comparison and enable the selection of classifiers according to design specifications. This paper
presents an efficient algorithm to obtain trade-off curves for existing hierarchical metrics.
The proposed technique is subsequently used to compare the trade-offs realised by different methods.
Given the effectiveness of deep learning, we focus on loss functions for training differentiable models.
Experiments on the iNat21 [
36
] and ImageNet-1k [
10
] datasets for image classification reveal that
a naïve flat softmax classifier dominates the more elegant top-down classifiers, obtaining better
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10929v2 [cs.LG] 10 Feb 2023
accuracy at any operating point. We further introduce a soft structured-prediction loss that dominates
the flat softmax.
While it was already known that top-down approaches provide worse leaf-node predictions than a
flat softmax [31,2], we did not expect this to hold for non-leaf predictions. We hypothesise that the
top-down approaches obtain worse accuracy than the “bottom-up” flat softmax because the coarse
classes can be highly diverse, and thus better learned by a union of distinct classifiers when the
fine-grained labels are available. To support this hypothesis, we show that training a flat softmax
classifier at a lower level and testing at a higher level provides better accuracy than training at
the higher level. Finally, we consider a synthetic out-of-distribution problem where a hierarchical
classifier that explicitly assigns scores to higher-level classes may be expected to have an advantage.
The key contributions of this paper are as follows.
We introduce a novel, efficient technique for evaluating hierarchical classifiers that captures the
full trade-off between specificity and correctness using a threshold-based inference rule.
We propose two novel loss functions, soft-max-descendant and soft-max-margin, and entertain a
simplification of Deep RTC [
39
] which we refer to as parameter sharing (PS) softmax. While
soft-max-descendant is ineffective, soft-max-margin and PS softmax achieve the best results.
We conduct an empirical comparison of loss functions and inference rules using the iNat21-Mini
dataset and its hierarchy of 10,000 species. The naïve softmax is found to be surprisingly effective,
and the simple threshold-based inference rule performs well compared to alternative options.
We evaluate the robustness of different methods to unseen leaf-node classes. The top-down
methods are more competitive for unseen classes, while PS softmax is the most effective.
2 Related work
There are several different types of hierarchical classification problem. In the terminology of Silla and
Freitas [
33
], we consider problems with tree structure, Single-Path Labels (SPL) and Non-Mandatory
Leaf-Node Prediction (NMLNP). Tree structure means that there is a unique root node and every
other node has exactly one parent, Single-Path Labels means that a sample cannot belong to two
separate classes (unless one is a superclass of the other) and Non-Mandatory Leaf-Node Prediction
means that the classifier may predict any class in the hierarchy, not just leaf nodes. This work assumes
the hierarchy is known and does not consider the distinct problem of learning the hierarchy.
MLNP with deep learning.
Several recent works have sought to leverage a class hierarchy to
improve leaf-node prediction. Wu et al. [
38
] trained a softmax classifier by optimising a “win” metric
comprising a weighted combination of likelihoods on the path from the root to the label. Bertinetto et
al. [
2
] proposed a similar Hierarchical Cross-Entropy (HXE) loss comprising weighted losses for
the conditional likelihood of each node given its parent, as well as a Soft Label loss that generalised
label smoothing [
27
] to incorporate the hierarchy. Karthik et al. [
20
] demonstrated that a flat softmax
classifier is still effective for MLNP and proposed an alternative inference method, Conditional Risk
Minimisation (CRM). Guo et al. [
18
] performed inference using an RNN starting from the root node.
Other works have proposed architectures that reflect the hierarchy (e.g. [
41
,
1
,
43
]). This paper seeks
to compare loss functions under a simple inference procedure.
Non-MLNP methods.
Comparatively few works have entertained hierarchical classifiers that can
predict arbitrary nodes. Deng et al. [
11
] highlighted the trade-off between specificity and accuracy,
and sought to obtain the most-specific classifier for a given error rate. Davis et al. [
7
,
8
] re-calibrated
a flat softmax classifier within a hierarchy using a held-out set and performed inference by starting
at the most-likely leaf-node and climbing the tree until a threshold was met, comparing methods
at several fixed thresholds. Wu et al. [
39
] proposed the Deep Realistic Taxonomic Classifier (Deep
RTC), which obtains a score for each node by summing over those of its ancestors [
32
] and whose
loss is evaluated at random cuts of the tree. They perform inference by traversing down from the
root until a score threshold is crossed (by default, zero). In the YOLO-9000 detector, Redmon and
Farhadi [
31
] introduced a conditional softmax classifier that resembled the efficiency-driven approach
of Morin and Bengio [
26
]. The model outputs a concatenation of logit vectors that each parametrise
(via softmax) the conditional distribution of children given a parent. It was noted to provide elegant
degradation but the hierarchical predictions were not rigorously evaluated. Brust and Denzler [
4
]
generalised the conditional approach to multi-path labels in a DAG hierarchy by replacing the softmax
with a sigmoid for each class given each of its parents. Inference was performed by seeking the class
2
with the maximum likelihood excluding its children, which may predict non-leaf nodes. Most of the
above loss functions will be compared in the main evaluation.
Hierarchical metrics.
There are many ways to measure the accuracy of hierarchical classifiers [
6
,
21
]. For the specific case of NMLNP, it is important to consider metrics for both correctness and
specificity. The Wu-Palmer metrics [
40
] measure precision and recall using the depth of the Lowest
Common Ancestor (LCA). Deng et al. [
11
] proposed to measure specificity using information content
to account for an imbalanced tree, and Zhao et al. [
42
] modified the Wu-Palmer metrics to use
information rather than depth. These are the metrics that we adopt, although our technique can be
applied to construct curves for any simply metric. Examples of other metrics include the fraction of
examples receiving non-root classifications [
7
] and the fraction of true-negative leaf nodes for correct
predictions [39].
Structured prediction.
The max-margin structured-prediction loss has seen historical use for
hierarchical classification with linear SVMs, originally for document classification [
5
] and later to
achieve efficient inference [
34
]. More recently, a soft max-margin loss [
29
,
15
] has been used to train
deep models for sequence-to-sequence learning [
13
] and long-tail classification [
25
]. We believe this
paper is the first to recognise its utility for hierarchical classification with deep learning.
3 Problem definition
3.1 Class hierarchy
Let
Y
be the set of classes, including both leaf nodes and their superclasses. The hierarchy is defined
by a tree with edges
E ⊂ Y2
. The edges define a transitive, non-strict superset relation
over the
classes. It is helpful to think of the classes as sets of samples. A sample
x
that belongs to class
y
also
belongs to the ancestors (superclasses) of
y
; that is,
xy
and
yy0
implies
xy0
. Hierarchical
classification can be seen as multi-label classification, since samples belong to multiple classes
simultaneously. However, samples cannot belong to arbitrary subsets of
Y
, as belonging to one
class implies belonging to its superclasses. Furthermore, we consider only class hierarchies in which
siblings are mutually exclusive; that is, a sample
x
can only belong to two classes
y
and
y0
if one is a
superclass of the other. The problem is therefore to assign a single label
y∈ Y
to an example, with
this label signifying membership in class
y
and all of its superclasses. While any superclass of the
ground-truth label is deemed to be a correct classification, it is preferable for the classifier to predict
the most-specific correct label.
We briefly introduce our notation for the tree. Let
r∈ Y
be the unique root node, let
π(y)∈ Y
be
the unique parent for every node
y Y \ {r}
, let
C(y)⊆ Y
be the children of node
y
, let
L⊆Y
be
the set of leaf nodes, let
A(y)⊆ Y
be the ancestors of node
y
, and let
D(y)⊆ Y
be the descendants
of node
y
. We define the ancestors and descendants inclusively, such that
D(y)∩ A(y) = {y}
. It
will also be useful to introduce
L(y) = L∩D(y)
to denote the set of leaf descendants of node
y
and
S(y) = C(π(y))
to denote its siblings. The superset relation over classes is equivalent to the ancestor
relation in the tree; that is, uviff u∈ A(v).
We focus on models that map a sample
x
to a conditional likelihood
p(y|x)
over all labels in the
hierarchy. If we consider
Y
with its superset relation and mutually exclusivity of siblings as an event
space, then a function p:Y [0,1] must satisfy
u∈ Y :p(u)Pv∈C(u)p(v)(1)
to be a probability function on
Y
. If this holds with equality, we say that the children are exhaustive.
We generally expect that the root node is uninformative p(r)=1.
3.2 Metrics
Metrics
M(y, ˆy)
measure the quality of predicted label
ˆy
for ground-truth label
y
. When making
non-leaf predictions, it is important to capture both correctness and specificity, either using a pair of
metrics or a single combined metric. The simplest such pair are the binary metrics
Correct(y, ˆy) = [ˆyy],Exact(y, ˆy) = [ˆy=y].(2)
If the ground-truth label is not a leaf node, then it is possible for the predicted label to be below it,
ˆyy. This is not considered an error, and we replace ˆyywhere this occurs.
3
摘要:

HierarchicalclassicationatmultipleoperatingpointsJackValmadreAustralianInstituteforMachineLearning,UniversityofAdelaidejack.valmadre@adelaide.edu.auAbstractManyclassicationproblemsconsiderclassesthatformahierarchy.Classiersthatareawareofthishierarchymaybeabletomakecondentpredictionsatacoarseleve...

展开>> 收起<<
Hierarchical classification at multiple operating points Jack Valmadre.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.72MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注