Hierarchical classiﬁcation at multiple operating points Jack Valmadre

2025-05-08 2 0 1.72MB 15 页 10玖币

侵权投诉

Hierarchical classiﬁcation

at multiple operating points

Jack Valmadre

Australian Institute for Machine Learning, University of Adelaide

jack.valmadre@adelaide.edu.au

Abstract

Many classiﬁcation problems consider classes that form a hierarchy. Classiﬁers

that are aware of this hierarchy may be able to make conﬁdent predictions at a

coarse level despite being uncertain at the ﬁne-grained level. While it is gener-

ally possible to vary the granularity of predictions using a threshold at inference

time, most contemporary work considers only leaf-node prediction, and almost

no prior work has compared methods at multiple operating points. We present an

efﬁcient algorithm to produce operating characteristic curves for any method that

assigns a score to every class in the hierarchy. Applying this technique to evaluate

existing methods reveals that top-down classiﬁers are dominated by a naïve ﬂat

softmax classiﬁer across the entire operating range. We further propose two novel

loss functions and show that a soft variant of the structured hinge loss is able to

signiﬁcantly outperform the ﬂat baseline. Finally, we investigate the poor accuracy

of top-down classiﬁers and demonstrate that they perform relatively well on unseen

classes. Code is available online at https://github.com/jvlmdr/hiercls.

1 Introduction

Many classiﬁcation problems involve classes that can be recursively grouped into a hierarchy of

progressively larger superclasses. This hierarchy can be represented by a directed graph where the

nodes are classes and the edges deﬁne a superset-of relation. Knowledge of the hierarchy can be

useful in many different respects. For example, mistake severity can be quantiﬁed using distance in

the graph [

], classiﬁers can make predictions at a coarse level to avoid an error at the ﬁne-grained

level [

], classes with few labels in a long-tailed distribution can beneﬁt from the examples of

similar classes [39], and a cascade can be used to reduce inference time [26,16,14].

Whereas many recent works consider only leaf-node prediction, we are interested in the setting where

the classiﬁer may predict any class in the hierarchy, including internal nodes. In this setting, prediction

involves an inherent trade-off between speciﬁcity and correctness: more general predictions contain

less information but have a greater chance of being correct. The trade-off can typically be controlled

using an inference threshold, analogous to the trade-off between sensitivity and speciﬁcity in binary

classiﬁcation or that between precision and recall in detection. However, while it is standard to

evaluate these problems using trade-off curves, most works on hierarchical classiﬁcation consider

only a single operating point. It is important to consider the full trade-off curve in order to ensure a

fair comparison and enable the selection of classiﬁers according to design speciﬁcations. This paper

presents an efﬁcient algorithm to obtain trade-off curves for existing hierarchical metrics.

The proposed technique is subsequently used to compare the trade-offs realised by different methods.

Given the effectiveness of deep learning, we focus on loss functions for training differentiable models.

Experiments on the iNat21 [

] and ImageNet-1k [

] datasets for image classiﬁcation reveal that

a naïve ﬂat softmax classiﬁer dominates the more elegant top-down classiﬁers, obtaining better

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10929v2 [cs.LG] 10 Feb 2023

accuracy at any operating point. We further introduce a soft structured-prediction loss that dominates

the ﬂat softmax.

While it was already known that top-down approaches provide worse leaf-node predictions than a

ﬂat softmax [31,2], we did not expect this to hold for non-leaf predictions. We hypothesise that the

top-down approaches obtain worse accuracy than the “bottom-up” ﬂat softmax because the coarse

classes can be highly diverse, and thus better learned by a union of distinct classiﬁers when the

ﬁne-grained labels are available. To support this hypothesis, we show that training a ﬂat softmax

classiﬁer at a lower level and testing at a higher level provides better accuracy than training at

the higher level. Finally, we consider a synthetic out-of-distribution problem where a hierarchical

classiﬁer that explicitly assigns scores to higher-level classes may be expected to have an advantage.

The key contributions of this paper are as follows.

•

We introduce a novel, efﬁcient technique for evaluating hierarchical classiﬁers that captures the

full trade-off between speciﬁcity and correctness using a threshold-based inference rule.

•

We propose two novel loss functions, soft-max-descendant and soft-max-margin, and entertain a

simpliﬁcation of Deep RTC [

] which we refer to as parameter sharing (PS) softmax. While

soft-max-descendant is ineffective, soft-max-margin and PS softmax achieve the best results.

•

We conduct an empirical comparison of loss functions and inference rules using the iNat21-Mini

dataset and its hierarchy of 10,000 species. The naïve softmax is found to be surprisingly effective,

and the simple threshold-based inference rule performs well compared to alternative options.

•

We evaluate the robustness of different methods to unseen leaf-node classes. The top-down

methods are more competitive for unseen classes, while PS softmax is the most effective.

2 Related work

There are several different types of hierarchical classiﬁcation problem. In the terminology of Silla and

Freitas [

], we consider problems with tree structure, Single-Path Labels (SPL) and Non-Mandatory

Leaf-Node Prediction (NMLNP). Tree structure means that there is a unique root node and every

other node has exactly one parent, Single-Path Labels means that a sample cannot belong to two

separate classes (unless one is a superclass of the other) and Non-Mandatory Leaf-Node Prediction

means that the classiﬁer may predict any class in the hierarchy, not just leaf nodes. This work assumes

the hierarchy is known and does not consider the distinct problem of learning the hierarchy.

MLNP with deep learning.

Several recent works have sought to leverage a class hierarchy to

improve leaf-node prediction. Wu et al. [

] trained a softmax classiﬁer by optimising a “win” metric

comprising a weighted combination of likelihoods on the path from the root to the label. Bertinetto et

al. [

] proposed a similar Hierarchical Cross-Entropy (HXE) loss comprising weighted losses for

the conditional likelihood of each node given its parent, as well as a Soft Label loss that generalised

label smoothing [

] to incorporate the hierarchy. Karthik et al. [

] demonstrated that a ﬂat softmax

classiﬁer is still effective for MLNP and proposed an alternative inference method, Conditional Risk

Minimisation (CRM). Guo et al. [

] performed inference using an RNN starting from the root node.

Other works have proposed architectures that reﬂect the hierarchy (e.g. [

]). This paper seeks

to compare loss functions under a simple inference procedure.

Non-MLNP methods.

Comparatively few works have entertained hierarchical classiﬁers that can

predict arbitrary nodes. Deng et al. [

] highlighted the trade-off between speciﬁcity and accuracy,

and sought to obtain the most-speciﬁc classiﬁer for a given error rate. Davis et al. [

] re-calibrated

a ﬂat softmax classiﬁer within a hierarchy using a held-out set and performed inference by starting

at the most-likely leaf-node and climbing the tree until a threshold was met, comparing methods

at several ﬁxed thresholds. Wu et al. [

] proposed the Deep Realistic Taxonomic Classiﬁer (Deep

RTC), which obtains a score for each node by summing over those of its ancestors [

] and whose

loss is evaluated at random cuts of the tree. They perform inference by traversing down from the

root until a score threshold is crossed (by default, zero). In the YOLO-9000 detector, Redmon and

Farhadi [

] introduced a conditional softmax classiﬁer that resembled the efﬁciency-driven approach

of Morin and Bengio [

]. The model outputs a concatenation of logit vectors that each parametrise

(via softmax) the conditional distribution of children given a parent. It was noted to provide elegant

degradation but the hierarchical predictions were not rigorously evaluated. Brust and Denzler [

]

generalised the conditional approach to multi-path labels in a DAG hierarchy by replacing the softmax

with a sigmoid for each class given each of its parents. Inference was performed by seeking the class

with the maximum likelihood excluding its children, which may predict non-leaf nodes. Most of the

above loss functions will be compared in the main evaluation.

Hierarchical metrics.

There are many ways to measure the accuracy of hierarchical classiﬁers [

]. For the speciﬁc case of NMLNP, it is important to consider metrics for both correctness and

speciﬁcity. The Wu-Palmer metrics [

] measure precision and recall using the depth of the Lowest

Common Ancestor (LCA). Deng et al. [

] proposed to measure speciﬁcity using information content

to account for an imbalanced tree, and Zhao et al. [

] modiﬁed the Wu-Palmer metrics to use

information rather than depth. These are the metrics that we adopt, although our technique can be

applied to construct curves for any simply metric. Examples of other metrics include the fraction of

examples receiving non-root classiﬁcations [

] and the fraction of true-negative leaf nodes for correct

predictions [39].

Structured prediction.

The max-margin structured-prediction loss has seen historical use for

hierarchical classiﬁcation with linear SVMs, originally for document classiﬁcation [

] and later to

achieve efﬁcient inference [

]. More recently, a soft max-margin loss [

] has been used to train

deep models for sequence-to-sequence learning [

] and long-tail classiﬁcation [

]. We believe this

paper is the ﬁrst to recognise its utility for hierarchical classiﬁcation with deep learning.

3 Problem deﬁnition

3.1 Class hierarchy

Let

be the set of classes, including both leaf nodes and their superclasses. The hierarchy is deﬁned

by a tree with edges

E ⊂ Y2

. The edges deﬁne a transitive, non-strict superset relation

⊇

over the

classes. It is helpful to think of the classes as sets of samples. A sample

that belongs to class

also

belongs to the ancestors (superclasses) of

; that is,

x∈y

and

y⊆y0

implies

x∈y0

. Hierarchical

classiﬁcation can be seen as multi-label classiﬁcation, since samples belong to multiple classes

simultaneously. However, samples cannot belong to arbitrary subsets of

, as belonging to one

class implies belonging to its superclasses. Furthermore, we consider only class hierarchies in which

siblings are mutually exclusive; that is, a sample

can only belong to two classes

and

if one is a

superclass of the other. The problem is therefore to assign a single label

y∈ Y

to an example, with

this label signifying membership in class

and all of its superclasses. While any superclass of the

ground-truth label is deemed to be a correct classiﬁcation, it is preferable for the classiﬁer to predict

the most-speciﬁc correct label.

We brieﬂy introduce our notation for the tree. Let

r∈ Y

be the unique root node, let

π(y)∈ Y

the unique parent for every node

y∈ Y \ {r}

, let

C(y)⊆ Y

be the children of node

, let

L⊆Y

the set of leaf nodes, let

A(y)⊆ Y

be the ancestors of node

, and let

D(y)⊆ Y

be the descendants

of node

. We deﬁne the ancestors and descendants inclusively, such that

D(y)∩ A(y) = {y}

. It

will also be useful to introduce

L(y) = L∩D(y)

to denote the set of leaf descendants of node

and

S(y) = C(π(y))

to denote its siblings. The superset relation over classes is equivalent to the ancestor

relation in the tree; that is, u⊇viff u∈ A(v).

We focus on models that map a sample

to a conditional likelihood

p(y|x)

over all labels in the

hierarchy. If we consider

with its superset relation and mutually exclusivity of siblings as an event

space, then a function p:Y → [0,1] must satisfy

∀u∈ Y :p(u)≥Pv∈C(u)p(v)(1)

to be a probability function on

. If this holds with equality, we say that the children are exhaustive.

We generally expect that the root node is uninformative p(r)=1.

3.2 Metrics

Metrics

M(y, ˆy)

measure the quality of predicted label

ˆy

for ground-truth label

. When making

non-leaf predictions, it is important to capture both correctness and speciﬁcity, either using a pair of

metrics or a single combined metric. The simplest such pair are the binary metrics

Correct(y, ˆy) = [ˆy⊇y],Exact(y, ˆy) = [ˆy=y].(2)

If the ground-truth label is not a leaf node, then it is possible for the predicted label to be below it,

ˆy⊂y. This is not considered an error, and we replace ˆy←ywhere this occurs.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalclassicationatmultipleoperatingpointsJackValmadreAustralianInstituteforMachineLearning,UniversityofAdelaidejack.valmadre@adelaide.edu.auAbstractManyclassicationproblemsconsiderclassesthatformahierarchy.Classiersthatareawareofthishierarchymaybeabletomakecondentpredictionsatacoarseleve...

展开>> 收起<<

Hierarchical classiﬁcation at multiple operating points Jack Valmadre.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical classiﬁcation at multiple operating points Jack Valmadre

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: