Concise and interpretable multi-label rule sets Martino Ciaperoni Department of Computer Science

2025-04-27 0 0 725.25KB 11 页 10玖币

侵权投诉

Concise and interpretable multi-label rule sets

Martino Ciaperoni

Department of Computer Science

Aalto University

Espoo, Finland

martino.ciaperoni@aalto.ﬁ

Han Xiao

Department of Computer Science

Aalto University

Espoo, Finland

han.xiao@aalto.ﬁ

Aristides Gionis

Division of Theoretical Computer Science

KTH Royal Institute of Technology

Stockholm, Sweden

argioni@kth.se

Abstract—Multi-label classiﬁcation is becoming increasingly

ubiquitous, but not much attention has been paid to interpretabil-

ity. In this paper, we develop a multi-label classiﬁer that can be

represented as a concise set of simple “if-then” rules, and thus,

it offers better interpretability compared to black-box models.

Notably, our method is able to ﬁnd a small set of relevant patterns

that lead to accurate multi-label classiﬁcation, while existing

rule-based classiﬁers are myopic and wasteful in searching rules,

requiring a large number of rules to achieve high accuracy. In

particular, we formulate the problem of choosing multi-label

rules to maximize a target function, which considers not only

discrimination ability with respect to labels, but also diversity.

Accounting for diversity helps to avoid redundancy, and thus, to

control the number of rules in the solution set. To tackle the said

maximization problem we propose a 2-approximation algorithm,

which relies on a novel technique to sample high-quality rules.

In addition to our theoretical analysis, we provide a thorough

experimental evaluation, which indicates that our approach offers

a trade-off between predictive performance and interpretability

that is unmatched in previous work.

I. INTRODUCTION

Machine-learning algorithms are nowadays being used in

almost every domain. While such algorithms are known to

perform well in many tasks, they are often used as “black-

boxes,” i.e., the decision processes involved are too complex

for humans to interpret. The lack of interpretability limits

considerably the level of trust humans put in machine-learning

algorithms and thus, poses a barrier for the wide adoption of

machine-learning techniques in the real world. In an attempt

to overcome this barrier, interpretable and explainable machine

learning have recently emerged as increasingly prominent topics.

In the standard classiﬁcation setting, the goal is to learn a

classiﬁer that accurately maps data points to two or more

mutually exclusive classes.

In this paper, we focus on a different setting, namely, multi-

label classiﬁcation. In contrast to the standard setting, in multi-

label classiﬁcation, a point can be associated with more than

one class at the same time. Though multi-label classiﬁcation has

been extensively studied, the main focus is still on improving

predictive performance [24]. Signiﬁcantly less attention has

been paid to interpretability aspects.

Classiﬁcation rules, due to their simple structure, are gaining

popularity in interpretable multi-label classiﬁcation literature.

In rule-based approaches, the goal is to learn a set of rules

that captures the most prominent patterns between features

and labels in the data. A rule usually takes the form “{set

of predicates}

→

{set of labels}.” For a given data point,

a rule would predict the associated labels to be present, if

all the predicates in the rule evaluate to true. Due to the

structural simplicity of rules, classiﬁers based on a set of rules

are generally considered more interpretable than other types

of classiﬁers, such as neural networks or even decision trees.

The research question that we bring forward is whether

we can design rule-based multi-label classiﬁcation methods

that are both accurate and interpretable. BOOMER [19], [20],

a recently-proposed rule-based classiﬁer based on gradient

boosting, gives promising results in accuracy. However, despite

being a rule-based approach, its interpretability is limited due

to producing a set of rules that is both too large and redundant.

In this work, we propose CORSET, a rule-based method that

signiﬁcantly improves over the state-of-the-art BOOMER. The

improvement is due to (1) reducing rule redundancy, which is

achieved by incorporating a term in our objective that penalizes

for rule overlap, and (2) explicitly limiting the complexity

of rules via a suite of novel sampling schemes. As a result,

our method produces a concise set of interpretable rules. An

illustration of the concept of our approach is given in Fig. 1.

Example.

To illustrate the improvement of CORSET over

BOOMER, we consider as an example the

bibtex

dataset, where

each data point represents a scientiﬁc article, with bag-of-words

as features and topics as labels. We ﬁrst consider predictive

performance as a function of the number of rules. In Fig. 2,

we show the (micro-averaged) balanced

scores, a popular

measure for multi-label classiﬁcation used throughout this

paper, for both CORSET and BOOMER. Due to the conciseness

of its learned rules, CORSET achieves a score close to

0.36

with

about 100 rules, whereas BOOMER needs over

800

rules to

achieve similar performance. Note that CORSET’s performance

starts to drop after about 100 rules, as there are no more

good rules to learn. The drop indicates overﬁtting, which can

be addressed by standard methods, e.g., cross validation. In

addition, Fig. 3 demonstrates the conciseness of the rules

found by CORSET vs. the ones by BOOMER. Here, we show

a subset of rules as a bipartite graph, where nodes at the

top represent labels and nodes at the bottom represent the

predicates (features). Rules are represented by colors and two

nodes are connected if they are part of the same rule. CORSET

uses fewer rules than BOOMER and rules tend to contain fewer

predicates, resulting in a sparser graph.

arXiv:2210.01533v2 [cs.LG] 7 Nov 2022







1110000

1110001

1111101

0001101

0001100













0 1 0 0 0

0 1 1 0 1

11110

00011

0 1 0 1 0







Features Labels

Fig. 1: Illustration of the concept of our approach for multi-

label rule selection. A toy dataset is visualized as a feature

matrix and a label matrix. Four rules are shown as colored

regions. Regions covered by the same rule are connected by a

dashed arrow. The rules in green are chosen because of accuracy,

generality, and diversity. The rules in red are discarded.

Boomer CORSET

0 500 1000 1500 2000

|R|

0.1

0.2

0.3

Micro F1Score

overﬁtting starts here,

no more good rules to learn

Fig. 2: Micro

, as a function of number of rules for CORSET

vs. BOOMER in the bibtex dataset.

Concretely, in this work we make the following contributions.

•

We frame the problem of learning concise rule sets as an

optimization problem. The problem is NP-hard and our

proposed algorithm CORSET, given a set of rule candidates,

achieves an approximation ratio of 2.

•

The performance of CORSET depends on the quality of the

candidate rules. To ﬁnd good rules efﬁciently, we design

a suite of fast sampling algorithms with probabilistic

guarantees as well as an effective heuristic.

•

Our experiments show that CORSET achieves competitive

predictive performance compared to the state-of-the-art,

while offering signiﬁcantly better interpretability.

The rest of this paper is organized as follows. Section II

discusses related work. Section III formalizes the problem we

consider. Section IV illustrates CORSET, omitting the details of

the rule-sampling algorithms it relies on, which are described

in Section V and VI. Afterwards, Section VII analyses the

complexity of CORSET and ﬁnally Section VIII presents a

thorough experimental evaluation of CORSET.

II. RELATED WORK

Multi-label classiﬁcation.

In multi-label classiﬁcation the goal

is to learn a function that maps input points to one or more

predeﬁned categories. For instance, a song can be associated

with multiple music genres. A plethora of algorithms have

been proposed for this problem; interested readers may refer

to a recent survey [24]. The simplest approaches for multi-

label classiﬁcation are the so-called transformation methods,

which convert the original problem into multiple single-label

classiﬁcation problems. The main drawback of these approaches

Fig. 3: An example set of rules returned by our algorithm

(top) and BOOMER (bottom) on the

bibtex

dataset. We depict

all rules for the set of labels {SNA,socialnets,social,networks,

analysis}.

is that they fail to capture label correlations. To overcome this

issue, label power-set approaches map each distinct set of

labels to a unique meta-label, which serves as target label

for a single-label classiﬁer. Clearly, these approaches do not

scale with the number of labels and the pruned problem

transformation method [21] has been proposed as a remedy.

Another line of research focuses on designing ad-hoc multi-

label classiﬁcation methods by extending existing single-

label algorithms. Examples include adaption from support

vector machines [7],

-nearest neighbor classiﬁers [28], and

perceptrons [6].

Interpretable machine learning.

There is no agreed formal

deﬁnition of interpretability, but it can be loosely deﬁned as

the degree to which a human can understand the cause of a

decision [16]. Broadly speaking, interpretability in machine

learning can be achieved by constraining the complexity of

algorithms so that the process behind the decision of the

algorithm is understandable to humans. A related topic is

explainable machine learning, where the goal is to provide

explanations to the predictions of black-box models.

Rule-based approaches to single-label classiﬁcation.

Re-

search in interpretable machine learning has boomed in

the last years. Rule-based (or associative) approaches have

shown promising potential, because decisions are driven by

a simple set of “if-then” rules. Liu et al. [15] are among

the ﬁrst to investigate association rule mining for single-label

classiﬁcation tasks, followed by extensions such as MCAR [22]

and ADA [25]. These approaches are conceptually similar, but

differ in their methodologies for rule learning, ranking, pruning,

and prediction.

Concise rule sets.

Our work pursues for the ﬁrst time the goal

of designing a multi-label associative classiﬁer for achieving a

given classiﬁcation performance with the smallest possible num-

ber of rules. A similar objective has been recently considered

in the context of single-label classiﬁcation. In particular, Zhang

et al. [26] frame the problem of learning a set of classiﬁcation

rules as an optimization problem with an objective function

combining rule quality and diversity. A

-approximation

algorithm is then proposed to solve this problem, which relies

on existing frameworks for max-sum diversiﬁcation and pattern

sampling. In this paper, we investigate how to extend these

ideas to the multi-label classiﬁcation setting. The problem of

controlling the number of rules has also been studied for rule

boosting, where learned rules are combined additively [3]. An

extension to multi-label classiﬁcation represents a possible

direction of future work. In addition to the number of rules,

conciseness of a rule set, and thus interpretability, has been

deﬁned in terms of number of conditions [10] as well as of

the Minimum Description Length principle [8].

Rule-based approaches to multi-label classiﬁcation.

In gen-

eral, adaptation from the single-label to the multi-label setting

is not trivial and while single-label associative classiﬁcation

has been studied extensively, relatively few attempts have been

made for associative multi-label classiﬁcation. In an early work,

Thabtah et al. [23] propose a label ranking-based assignment

method. More recently new approaches have been developed,

and SECO[12] and BOOMER [20] are state-of-the-art in the

current literature of rule-based multi-label classiﬁcation. The

main limitation of the existing works, addressed in our paper,

is that they use a very large set of highly redundant rules,

which hinders interpretability. We compare our method against

SECO[12] and BOOMER [20] in Section VIII.

Pattern sampling.

Association pattern discovery is challenging

due to the prohibitive size of the pattern space. This challenge

is inherited by rule-based classiﬁers. To avoid exhaustively

searching the pattern space, efﬁcient pattern-sampling methods

have been proposed [1], [2]. In this work we extend these

sampling methods to efﬁciently ﬁnd high-quality candidate

multi-label rules, as discussed in detail in Section V.

III. PROBLEM STATEMENT

At a high level, our objective is to capture the relevant

patterns in the data that best discriminate a set of labels and

are as concise as possible. Next we formally deﬁne the problem.

A. Preliminaries

We denote sets and multisets by uppercase letters e.g.,

For a ﬁnite

, we denote by

P(X)

its power set. We consider

a binary dataset

over a feature set

and a label set

The dataset

is a set of data records,

D1, . . . , Dn

. A data

record

D= (F, L)

consists of a set of features

F⊆ F

and

a set of labels

L⊆ L

. We denote by

and

the feature

set and label set of

, respectively. Furthermore, we denote

|F|

and

|L|

the dimensions of the feature and label space,

respectively, and we denote by

|D|

the total number of data

records. We use

kFk

and

kLk

to refer to the total number of

feature and label occurrences over all data records.

In multi-label classiﬁcation, the goal is to learn a function

mapping as accurately as possible the features

to one or

more labels

. We use mappings consisting of conjunctive

rules. A conjunctive rule

R= (H→T)

consists of a non-

empty feature set

(called head) and a non-empty label

set

(called tail). The head

can be viewed as a predicate

H:{0,1}|F| → {true,false}

, which states whether an instance

contains all the features in

. If the predicate evaluates to

true

for some instance, the tail

speciﬁes that labels

should be predicted as present.

We say that a head

matches a data record

H⊆FD

Similarly, a tail

matches

T⊆LD

. We say that a rule

covers a data record

HR⊆FD

and similarly

matches

a data record

if both

and

match

. For a dataset

we denote the support set of X∈ {H, T, R}by:

D[X] = {D∈ D | Xmatches D}.

The space of all possible rules we consider is

U=P(F)×

P(L)

, i.e., the Cartesian product of the power set of the feature

set and the power set of the label set.

B. Problem formulation

We want to discover rules that are accurate and general,

but also sufﬁciently different from each other. To capture this

trade-off, we design an objective function that consists of a

quality term

q:U → R

measuring the accuracy and generality

of a single rule, and a diversity term

d:U ×U → R

measuring

the distance between pairs of rules.

Quality term.

Given a rule

and a set of rules

, the quality

q(R;R)

with respect to

is the product of two values:

the uncovered area

area(R;R)

, capturing the generality of

with respect to R, and its adjusted accuracy a(R),

q(R;R) = area(R;R)·a(R).

Next we describe these two functions. To capture generality,

we ﬁrst deﬁne the coverage of Ras:

cov(R) = {(i, k)|Rmatches Di∈ D and k∈T}.(1)

In other words, the coverage of a rule is the set of label

occurrences it matches in a dataset. To incorporate what

is already covered by a set of selected rules

, we deﬁne

uncovered area of Rwith respect to Ras

area(R;R) = 

cov(R)\[

R∈R

cov(R)

,(2)

that is, the size of covered label occurrences by

after

excluding those already covered by

. Thus, a rule

considered general with respect to Rif area(R;R)is large.

Before introducing the adjusted-accuracy function, we need

some additional notation. Data records whose labels contain

are said to be positive with respect to

, whilst the remaining

ones are negative. More formally, a tail

bi-partitions a dataset

into two disjoint sets: a set of positive data records

D+

{D∈ D | T⊆LD}

and a set of negative data records

D−

T={D∈ D | T*LD}

. Given a rule

, let

PD[R]=|D[R]|

|D[H]|

be the precision of

and

PD=|D[T]|

|D|

is the base rate of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Conciseandinterpretablemulti-labelrulesetsMartinoCiaperoniDepartmentofComputerScienceAaltoUniversityEspoo,Finlandmartino.ciaperoni@aalto.HanXiaoDepartmentofComputerScienceAaltoUniversityEspoo,Finlandhan.xiao@aalto.AristidesGionisDivisionofTheoreticalComputerScienceKTHRoyalInstituteofTechnologyStoc...

展开>> 收起<<

Concise and interpretable multi-label rule sets Martino Ciaperoni Department of Computer Science.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Concise and interpretable multi-label rule sets Martino Ciaperoni Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: