Concise and interpretable multi-label rule sets Martino Ciaperoni Department of Computer Science

2025-04-27 0 0 725.25KB 11 页 10玖币
侵权投诉
Concise and interpretable multi-label rule sets
Martino Ciaperoni
Department of Computer Science
Aalto University
Espoo, Finland
martino.ciaperoni@aalto.fi
Han Xiao
Department of Computer Science
Aalto University
Espoo, Finland
han.xiao@aalto.fi
Aristides Gionis
Division of Theoretical Computer Science
KTH Royal Institute of Technology
Stockholm, Sweden
argioni@kth.se
Abstract—Multi-label classification is becoming increasingly
ubiquitous, but not much attention has been paid to interpretabil-
ity. In this paper, we develop a multi-label classifier that can be
represented as a concise set of simple “if-then” rules, and thus,
it offers better interpretability compared to black-box models.
Notably, our method is able to find a small set of relevant patterns
that lead to accurate multi-label classification, while existing
rule-based classifiers are myopic and wasteful in searching rules,
requiring a large number of rules to achieve high accuracy. In
particular, we formulate the problem of choosing multi-label
rules to maximize a target function, which considers not only
discrimination ability with respect to labels, but also diversity.
Accounting for diversity helps to avoid redundancy, and thus, to
control the number of rules in the solution set. To tackle the said
maximization problem we propose a 2-approximation algorithm,
which relies on a novel technique to sample high-quality rules.
In addition to our theoretical analysis, we provide a thorough
experimental evaluation, which indicates that our approach offers
a trade-off between predictive performance and interpretability
that is unmatched in previous work.
I. INTRODUCTION
Machine-learning algorithms are nowadays being used in
almost every domain. While such algorithms are known to
perform well in many tasks, they are often used as “black-
boxes,” i.e., the decision processes involved are too complex
for humans to interpret. The lack of interpretability limits
considerably the level of trust humans put in machine-learning
algorithms and thus, poses a barrier for the wide adoption of
machine-learning techniques in the real world. In an attempt
to overcome this barrier, interpretable and explainable machine
learning have recently emerged as increasingly prominent topics.
In the standard classification setting, the goal is to learn a
classifier that accurately maps data points to two or more
mutually exclusive classes.
In this paper, we focus on a different setting, namely, multi-
label classification. In contrast to the standard setting, in multi-
label classification, a point can be associated with more than
one class at the same time. Though multi-label classification has
been extensively studied, the main focus is still on improving
predictive performance [24]. Significantly less attention has
been paid to interpretability aspects.
Classification rules, due to their simple structure, are gaining
popularity in interpretable multi-label classification literature.
In rule-based approaches, the goal is to learn a set of rules
that captures the most prominent patterns between features
and labels in the data. A rule usually takes the form “{set
of predicates}
{set of labels}.” For a given data point,
a rule would predict the associated labels to be present, if
all the predicates in the rule evaluate to true. Due to the
structural simplicity of rules, classifiers based on a set of rules
are generally considered more interpretable than other types
of classifiers, such as neural networks or even decision trees.
The research question that we bring forward is whether
we can design rule-based multi-label classification methods
that are both accurate and interpretable. BOOMER [19], [20],
a recently-proposed rule-based classifier based on gradient
boosting, gives promising results in accuracy. However, despite
being a rule-based approach, its interpretability is limited due
to producing a set of rules that is both too large and redundant.
In this work, we propose CORSET, a rule-based method that
significantly improves over the state-of-the-art BOOMER. The
improvement is due to (1) reducing rule redundancy, which is
achieved by incorporating a term in our objective that penalizes
for rule overlap, and (2) explicitly limiting the complexity
of rules via a suite of novel sampling schemes. As a result,
our method produces a concise set of interpretable rules. An
illustration of the concept of our approach is given in Fig. 1.
Example.
To illustrate the improvement of CORSET over
BOOMER, we consider as an example the
bibtex
dataset, where
each data point represents a scientific article, with bag-of-words
as features and topics as labels. We first consider predictive
performance as a function of the number of rules. In Fig. 2,
we show the (micro-averaged) balanced
F1
scores, a popular
measure for multi-label classification used throughout this
paper, for both CORSET and BOOMER. Due to the conciseness
of its learned rules, CORSET achieves a score close to
0.36
with
about 100 rules, whereas BOOMER needs over
800
rules to
achieve similar performance. Note that CORSETs performance
starts to drop after about 100 rules, as there are no more
good rules to learn. The drop indicates overfitting, which can
be addressed by standard methods, e.g., cross validation. In
addition, Fig. 3 demonstrates the conciseness of the rules
found by CORSET vs. the ones by BOOMER. Here, we show
a subset of rules as a bipartite graph, where nodes at the
top represent labels and nodes at the bottom represent the
predicates (features). Rules are represented by colors and two
nodes are connected if they are part of the same rule. CORSET
uses fewer rules than BOOMER and rules tend to contain fewer
predicates, resulting in a sparser graph.
arXiv:2210.01533v2 [cs.LG] 7 Nov 2022
1110000
1110001
1111101
0001101
0001100
0 1 0 0 0
0 1 1 0 1
11110
00011
0 1 0 1 0
Features Labels
1
Fig. 1: Illustration of the concept of our approach for multi-
label rule selection. A toy dataset is visualized as a feature
matrix and a label matrix. Four rules are shown as colored
regions. Regions covered by the same rule are connected by a
dashed arrow. The rules in green are chosen because of accuracy,
generality, and diversity. The rules in red are discarded.
Boomer CORSET
0 500 1000 1500 2000
|R|
0
0.1
0.2
0.3
Micro F1Score
overfitting starts here,
no more good rules to learn
Fig. 2: Micro
F1
, as a function of number of rules for CORSET
vs. BOOMER in the bibtex dataset.
Concretely, in this work we make the following contributions.
We frame the problem of learning concise rule sets as an
optimization problem. The problem is NP-hard and our
proposed algorithm CORSET, given a set of rule candidates,
achieves an approximation ratio of 2.
The performance of CORSET depends on the quality of the
candidate rules. To find good rules efficiently, we design
a suite of fast sampling algorithms with probabilistic
guarantees as well as an effective heuristic.
Our experiments show that CORSET achieves competitive
predictive performance compared to the state-of-the-art,
while offering significantly better interpretability.
The rest of this paper is organized as follows. Section II
discusses related work. Section III formalizes the problem we
consider. Section IV illustrates CORSET, omitting the details of
the rule-sampling algorithms it relies on, which are described
in Section V and VI. Afterwards, Section VII analyses the
complexity of CORSET and finally Section VIII presents a
thorough experimental evaluation of CORSET.
II. RELATED WORK
Multi-label classification.
In multi-label classification the goal
is to learn a function that maps input points to one or more
predefined categories. For instance, a song can be associated
with multiple music genres. A plethora of algorithms have
been proposed for this problem; interested readers may refer
to a recent survey [24]. The simplest approaches for multi-
label classification are the so-called transformation methods,
which convert the original problem into multiple single-label
classification problems. The main drawback of these approaches
Fig. 3: An example set of rules returned by our algorithm
(top) and BOOMER (bottom) on the
bibtex
dataset. We depict
all rules for the set of labels {SNA,socialnets,social,networks,
analysis}.
is that they fail to capture label correlations. To overcome this
issue, label power-set approaches map each distinct set of
labels to a unique meta-label, which serves as target label
for a single-label classifier. Clearly, these approaches do not
scale with the number of labels and the pruned problem
transformation method [21] has been proposed as a remedy.
Another line of research focuses on designing ad-hoc multi-
label classification methods by extending existing single-
label algorithms. Examples include adaption from support
vector machines [7],
k
-nearest neighbor classifiers [28], and
perceptrons [6].
Interpretable machine learning.
There is no agreed formal
definition of interpretability, but it can be loosely defined as
the degree to which a human can understand the cause of a
decision [16]. Broadly speaking, interpretability in machine
learning can be achieved by constraining the complexity of
algorithms so that the process behind the decision of the
algorithm is understandable to humans. A related topic is
explainable machine learning, where the goal is to provide
explanations to the predictions of black-box models.
Rule-based approaches to single-label classification.
Re-
search in interpretable machine learning has boomed in
the last years. Rule-based (or associative) approaches have
shown promising potential, because decisions are driven by
a simple set of “if-then” rules. Liu et al. [15] are among
the first to investigate association rule mining for single-label
classification tasks, followed by extensions such as MCAR [22]
and ADA [25]. These approaches are conceptually similar, but
differ in their methodologies for rule learning, ranking, pruning,
and prediction.
Concise rule sets.
Our work pursues for the first time the goal
of designing a multi-label associative classifier for achieving a
given classification performance with the smallest possible num-
ber of rules. A similar objective has been recently considered
in the context of single-label classification. In particular, Zhang
et al. [26] frame the problem of learning a set of classification
rules as an optimization problem with an objective function
combining rule quality and diversity. A
2
-approximation
algorithm is then proposed to solve this problem, which relies
on existing frameworks for max-sum diversification and pattern
sampling. In this paper, we investigate how to extend these
ideas to the multi-label classification setting. The problem of
controlling the number of rules has also been studied for rule
boosting, where learned rules are combined additively [3]. An
extension to multi-label classification represents a possible
direction of future work. In addition to the number of rules,
conciseness of a rule set, and thus interpretability, has been
defined in terms of number of conditions [10] as well as of
the Minimum Description Length principle [8].
Rule-based approaches to multi-label classification.
In gen-
eral, adaptation from the single-label to the multi-label setting
is not trivial and while single-label associative classification
has been studied extensively, relatively few attempts have been
made for associative multi-label classification. In an early work,
Thabtah et al. [23] propose a label ranking-based assignment
method. More recently new approaches have been developed,
and SECO[12] and BOOMER [20] are state-of-the-art in the
current literature of rule-based multi-label classification. The
main limitation of the existing works, addressed in our paper,
is that they use a very large set of highly redundant rules,
which hinders interpretability. We compare our method against
SECO[12] and BOOMER [20] in Section VIII.
Pattern sampling.
Association pattern discovery is challenging
due to the prohibitive size of the pattern space. This challenge
is inherited by rule-based classifiers. To avoid exhaustively
searching the pattern space, efficient pattern-sampling methods
have been proposed [1], [2]. In this work we extend these
sampling methods to efficiently find high-quality candidate
multi-label rules, as discussed in detail in Section V.
III. PROBLEM STATEMENT
At a high level, our objective is to capture the relevant
patterns in the data that best discriminate a set of labels and
are as concise as possible. Next we formally define the problem.
A. Preliminaries
We denote sets and multisets by uppercase letters e.g.,
X
.
For a finite
X
, we denote by
P(X)
its power set. We consider
a binary dataset
D
over a feature set
F
and a label set
L
.
The dataset
D
is a set of data records,
D1, . . . , Dn
. A data
record
D= (F, L)
consists of a set of features
F⊆ F
and
a set of labels
L⊆ L
. We denote by
FD
and
LD
the feature
set and label set of
D
, respectively. Furthermore, we denote
by
|F|
and
|L|
the dimensions of the feature and label space,
respectively, and we denote by
|D|
the total number of data
records. We use
kFk
and
kLk
to refer to the total number of
feature and label occurrences over all data records.
In multi-label classification, the goal is to learn a function
mapping as accurately as possible the features
FD
to one or
more labels
LD
. We use mappings consisting of conjunctive
rules. A conjunctive rule
R= (HT)
consists of a non-
empty feature set
H
(called head) and a non-empty label
set
T
(called tail). The head
H
can be viewed as a predicate
H:{0,1}|F| → {true,false}
, which states whether an instance
F
contains all the features in
H
. If the predicate evaluates to
true
for some instance, the tail
T
of
R
specifies that labels
T
should be predicted as present.
We say that a head
H
matches a data record
D
if
HFD
.
Similarly, a tail
T
matches
D
if
TLD
. We say that a rule
R
covers a data record
D
if
HRFD
and similarly
R
matches
a data record
D
if both
HR
and
TR
match
D
. For a dataset
D
,
we denote the support set of X∈ {H, T, R}by:
D[X] = {D∈ D | Xmatches D}.
The space of all possible rules we consider is
U=P(F)×
P(L)
, i.e., the Cartesian product of the power set of the feature
set and the power set of the label set.
B. Problem formulation
We want to discover rules that are accurate and general,
but also sufficiently different from each other. To capture this
trade-off, we design an objective function that consists of a
quality term
q:U R
measuring the accuracy and generality
of a single rule, and a diversity term
d:U ×U R
measuring
the distance between pairs of rules.
Quality term.
Given a rule
R
and a set of rules
R
, the quality
q(R;R)
of
R
with respect to
R
is the product of two values:
the uncovered area
area(R;R)
, capturing the generality of
R
with respect to R, and its adjusted accuracy a(R),
q(R;R) = area(R;R)·a(R).
Next we describe these two functions. To capture generality,
we first define the coverage of Ras:
cov(R) = {(i, k)|Rmatches Di∈ D and kT}.(1)
In other words, the coverage of a rule is the set of label
occurrences it matches in a dataset. To incorporate what
is already covered by a set of selected rules
R
, we define
uncovered area of Rwith respect to Ras
area(R;R) =
cov(R)\[
R∈R
cov(R)
,(2)
that is, the size of covered label occurrences by
R
after
excluding those already covered by
R
. Thus, a rule
R
is
considered general with respect to Rif area(R;R)is large.
Before introducing the adjusted-accuracy function, we need
some additional notation. Data records whose labels contain
T
are said to be positive with respect to
T
, whilst the remaining
ones are negative. More formally, a tail
T
bi-partitions a dataset
D
into two disjoint sets: a set of positive data records
D+
T=
{D∈ D | TLD}
and a set of negative data records
D
T={D∈ D | T*LD}
. Given a rule
R
, let
PD[R]=|D[R]|
|D[H]|
be the precision of
R
and
PD=|D[T]|
|D|
is the base rate of
摘要:

Conciseandinterpretablemulti-labelrulesetsMartinoCiaperoniDepartmentofComputerScienceAaltoUniversityEspoo,Finlandmartino.ciaperoni@aalto.HanXiaoDepartmentofComputerScienceAaltoUniversityEspoo,Finlandhan.xiao@aalto.AristidesGionisDivisionofTheoreticalComputerScienceKTHRoyalInstituteofTechnologyStoc...

展开>> 收起<<
Concise and interpretable multi-label rule sets Martino Ciaperoni Department of Computer Science.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:725.25KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注