Data Selection A General Principle for Building Small Interpretable Models Abhishek Ghosea

2025-04-27 0 0 1.33MB 9 页 10玖币

侵权投诉

Data Selection: A General Principle for Building Small

Interpretable Models

Abhishek Ghosea

Abstract. We present convincing empirical evidence for an effec-

tive and general strategy for building accurate small models. Such

models are attractive for interpretability and also ﬁnd use in resource-

constrained environments. The strategy is to learn the training distri-

bution and sample accordingly from the provided training data. The

distribution learning algorithm is not a contribution of this work; our

contribution is a rigorous demonstration of the broad utility of this

strategy in various practical settings. We apply it to the tasks of (1)

building cluster explanation trees, (2) prototype-based classiﬁcation,

and (3) classiﬁcation using Random Forests, and show that it im-

proves the accuracy of decades-old weak traditional baselines to be

competitive with specialized modern techniques.

This strategy is also versatile wrt the notion of model size. In the

ﬁrst two tasks, model size is considered to be number of leaves in

the tree and the number of prototypes respectively. In the ﬁnal task

involving Random Forests, the strategy is shown to be effective even

when model size comprises of more than one factor: number of trees

and their maximum depth.

Positive results using multiple datasets are presented that are

shown to be statistically signiﬁcant.

1 Introduction

The application of Machine Learning to various domains often leads

to specialized requirements. One such requirement is small model

size, which is useful in the following scenarios:

1. Interpretability: It is easier for humans to parse models when they

are small. For example, a decision tree with a depth of 5 is likely

easier to understand than one with a depth of 50. Multiple user

studies have established model size as an important factor for in-

terpretability [12, 27, 36]. This is also true for explanations that

are models, where post-hoc interpretability is desired, e.g., local

linear models with a small number of non-zero coefﬁcients as in

LIME [37] or cluster explanation trees with a small number of

leaves [31, 26].

2. Resource-constrained devices: Small models are preferred when

the compute environment is limited in various ways, such as mem-

ory and power [38, 19, 32]. Examples of such environments are

micro-controllers, embedded devices and edge devices.

Of course, in these cases, we also prefer that small models do not

drastically sacriﬁce predictive power. Typically this trade-off be-

tween size and accuracy is controlled in a manner that is speciﬁc to a

model’s formulation, e.g., L1regularization for linear models, early

stopping for Decision Trees. However, in this work we highlight a

universal strategy: learn the training distribution. This can often in-

crease accuracy of small models, and thus addresses this trade-off in

a model-agnostic manner. This was originally demonstrated in Ghose

and Ravindran [17, 18]. Our work signiﬁcantly extends the empirical

evidence to show that such a strategy is indeed general and effective,

and presents new evidence to show that it is also competitive.

1.1 Motivation and Contributions

We empirically show that the strategy of learning the training dis-

tribution improves accuracy of small models in diverse setups. Pre-

vious work that proposed this technique [17, 18] demonstrate such

improvements for a given model; however, they leave an important

question unanswered: are these improvements signiﬁcant enough to

be comparable to other models tailored to a task? Why might we

use this strategy instead of using a contemporary specialized tech-

nique? This is the practical gap this work ﬁlls, where our answer is

afﬁrmative.

We consider the following tasks, for which we set ourselves the

goal of constructing easy-to-interpret models:

1. Building cluster explanation trees.

2. Prototype-based classiﬁcation.

3. Classiﬁcation using Random Forests (RF).

For evaluation on each of these tasks, we follow a common theme:

(a) ﬁrst, we show that a traditional technique is almost always not as

good as newer and specialized techniques, and, (b) then we show that

its performance may be radically improved by learning the training

distribution. Collectively, these evaluations show that the strategy of

learning the training distribution is both general - may be applied to

different tasks, models, notions of model sizes - and effective - results

in competitive performance. These rigorous evaluations in diverse

practical setups is the primary contribution of this work.

Table 1 concisely presents our setup and (one set of) observations.

It lists various techniques compared on a task, including the tradi-

tional technique whose performance we seek to improve (highlighted

in red), and the task-speciﬁc notion of model size (in blue). For all

evaluations multiple trials are run (per dataset and model size com-

bination; the datasets are also listed), and tests for statistical signif-

icance are conducted. The p-value of the improvement in prediction

accuracy (based on a Wilcoxon signed-rank test) - of the traditional

model using the provided training data vs using sampled data based

on a learned distribution - is shown in the last column (smaller num-

bers are better). The general setup is further detailed in Section 4.

arXiv:2210.03921v3 [cs.LG] 27 Apr 2024

Table 1. Experiments at a glance. The methods that are highlighted in red were augmented with training distribution learning in our experiments. The year

in which a method was proposed is also mentioned. Notably, this strategy improves relatively old methods to be at par with newer specialized techniques for

varied notions of model size. What “model size” means for a task is speciﬁed in blue. The p-value from a statistical test comparing the original and augmented

models appears in the last column; smaller is better.

Task Methods Metric Signiﬁcance

Tests Datasets p-value of improvements

(smaller is better)

(1) Explainable Clustering

size = # leaves in a tree

Section 5

Iterative Mistake

Minimization (2020),

ExShallow (2021),

CART (1984)

Cost Ratio

(ratio of F1

scores)

Friedman,

Wilcoxon

(1) avila, (2) Sensorless,

(3) covtype.binary, (4) covtype,

(5) mice-protein

p= 1.4783 ×10−6

(2) Prototype-based

Classiﬁcation

size = # prototypes

Section 6

ProtoNN (2017),

Stochastic Neighbor

Compression (2014),

Fast Condensed Nearest

Neighbor Rule (2005)

RBF Network (1988)

F1-macro Friedman,

Wilcoxon

(1) adult, (2) covtype.binary,

(3) senseit-sei, (4) senseit-aco

(5) phishing

p= 1.699 ×10−4

(3) Classiﬁcation using

Random Forests

size = {tree depth, # trees}

Section 7

Optimal Tree Ensemble (2020),

subforest-by-prediction (2009),

Random Forest (2001)

F1-macro Friedman,

Wilcoxon

(1) Sensorless, (2) heart,

(3) covtype, (4) breast cancer,

(5) ionosphere

p= 1.44 ×10−11

1.2 Organization

This is conceptually a short paper, i.e., has a simple well-deﬁned

central thesis, and is predominantly an empirical paper, i.e., the the-

sis is validated using experiments. We begin with a discussion of

prior work (Section 2), followed by a brief overview of the technique

we use to learn the training distribution (Section 3). The latter pro-

vides relevant context for our experiments. Critical to this paper is

our measurement methodology - this is described in Section 4. The

tasks mentioned in Table 1 are respectively detailed in Sections 5, 6

and 7. We summarize various results and discuss future work in the

Section 8, which concludes the paper.

2 Previous Work

As mentioned, the only works we are aware of that discuss learning

of the training distribution as a means to construct small models are

Ghose and Ravindran [17, 18]. They propose techniques that maybe

seen as forms of “adaptive sampling”: parameters for the training dis-

tribution are iteratively learned by adapting them to maximize held-

out accuracy. In the rest of this paper, we will refer to these tech-

niques as “Compaction by Adaptive Sampling” (COAS).

3 Overview of COAS

COAS iteratively learns the parameters of a training distribution

based on performance on a held-out subset of the data, given a

training algorithm, desired model size ηand an accuracy metric of

interest. It uses Bayesian Optimization (BO) [39] for its iterative

learning. This speciﬁc optimizer choice allows for models with non-

differentiable losses, e.g., decision trees. The output of one run of

COAS is a sample of the training data drawn according to the learned

distribution. For our experiments, the following hyperparameters of

COAS are relevant:

1. Optimization budget, T: this is the number of iterations for which

the optimizer runs. There is no other stopping criteria.

2. The lower and upper bounds for the size of the sample to be re-

turned. This sample size is denoted by Ns.

Reasonable defaults exist for other hyperparameters [17]. We use the

reference library compactem [16] in our experiments. A detailed re-

view of COAS appears in Section A.1 of the Appendix.

4 Measurement

While each task-speciﬁc section contains a detailed discussion on the

experiment setup, we discuss some common aspects here:

1. To compare model families F1,F2,F3, each of which is, say,

used to construct models for different sizes η∈ {2,3}, for

datasets D1, D2, D3, we use the mean rank, and support our

conclusions with statistical tests such as the Friedman [14] and

Wilcoxon signed-rank [40] tests1.

Typically mean rank is used to compare model families based on

their accuracies across datasets - which, ignoring model sizes, may

be visualized as a 3×3table here, with rows representing datasets,

and columns denoting model families - see Figure 1(a). An entry

such as “D2,F3” represents the accuracy (or some other metric)

of a model from family F3on dataset D2. Models are ranked on a

per-dataset basis, i.e., row-wise, and the average ranks (computed

per family, i.e., column-wise) are reported (lower is better). For

statistical tests, the column values are directly used.

However, we have an additional factor here - the model size. To

avoid inventing a custom metric, we assimilate it in the previous

scheme by using the combination of datasets and model sizes as a

row - see Figure 1(b). We think of such combinations as “pseudo-

dataset” entries, i.e., now we have a 6×3table, with rows for

1, D3

1, D2

2, D3

2, D2

3, D3

3, and same columns as before. The entry

for “D2

1,F3” indicates the accuracy of a model of size 2from

family F3on dataset D1.

Effectively, now the comparisons automatically account for model

size since we use pseudo-datasets instead of datasets.Note that no

new datasets are being created - we are merely deﬁning a con-

vention to include model size in the familiar dataset-model cross-

product table.

2. For each model family, model size and dataset combination (es-

sentially a cell in this cross-product table), models are constructed

multiple times (we refer to these as multiple “trials”), and their

scores are averaged. For tasks #1 and #2, ﬁve trials were used,

whereas for task #3, three trials were used.

3. Key results for tasks #1, #2 and #3 appear in Sections 5.3, 6.3 and

7.3 respectively.

1The Wilcoxon signed-rank test was used here since it has been advocated

by various studies for measuring classiﬁcation performance [9, 3, 21].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DataSelection:AGeneralPrincipleforBuildingSmallInterpretableModelsAbhishekGhoseaAbstract.Wepresentconvincingempiricalevidenceforaneffec-tiveandgeneralstrategyforbuildingaccuratesmallmodels.Suchmodelsareattractiveforinterpretabilityandalsofinduseinresource-constrainedenvironments.Thestrategyistolearn...

展开>> 收起<<

Data Selection A General Principle for Building Small Interpretable Models Abhishek Ghosea.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Data Selection A General Principle for Building Small Interpretable Models Abhishek Ghosea

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: