Data Selection A General Principle for Building Small Interpretable Models Abhishek Ghosea

2025-04-27 0 0 1.33MB 9 页 10玖币
侵权投诉
Data Selection: A General Principle for Building Small
Interpretable Models
Abhishek Ghosea
Abstract. We present convincing empirical evidence for an effec-
tive and general strategy for building accurate small models. Such
models are attractive for interpretability and also find use in resource-
constrained environments. The strategy is to learn the training distri-
bution and sample accordingly from the provided training data. The
distribution learning algorithm is not a contribution of this work; our
contribution is a rigorous demonstration of the broad utility of this
strategy in various practical settings. We apply it to the tasks of (1)
building cluster explanation trees, (2) prototype-based classification,
and (3) classification using Random Forests, and show that it im-
proves the accuracy of decades-old weak traditional baselines to be
competitive with specialized modern techniques.
This strategy is also versatile wrt the notion of model size. In the
first two tasks, model size is considered to be number of leaves in
the tree and the number of prototypes respectively. In the final task
involving Random Forests, the strategy is shown to be effective even
when model size comprises of more than one factor: number of trees
and their maximum depth.
Positive results using multiple datasets are presented that are
shown to be statistically significant.
1 Introduction
The application of Machine Learning to various domains often leads
to specialized requirements. One such requirement is small model
size, which is useful in the following scenarios:
1. Interpretability: It is easier for humans to parse models when they
are small. For example, a decision tree with a depth of 5 is likely
easier to understand than one with a depth of 50. Multiple user
studies have established model size as an important factor for in-
terpretability [12, 27, 36]. This is also true for explanations that
are models, where post-hoc interpretability is desired, e.g., local
linear models with a small number of non-zero coefficients as in
LIME [37] or cluster explanation trees with a small number of
leaves [31, 26].
2. Resource-constrained devices: Small models are preferred when
the compute environment is limited in various ways, such as mem-
ory and power [38, 19, 32]. Examples of such environments are
micro-controllers, embedded devices and edge devices.
Of course, in these cases, we also prefer that small models do not
drastically sacrifice predictive power. Typically this trade-off be-
tween size and accuracy is controlled in a manner that is specific to a
model’s formulation, e.g., L1regularization for linear models, early
stopping for Decision Trees. However, in this work we highlight a
universal strategy: learn the training distribution. This can often in-
crease accuracy of small models, and thus addresses this trade-off in
a model-agnostic manner. This was originally demonstrated in Ghose
and Ravindran [17, 18]. Our work significantly extends the empirical
evidence to show that such a strategy is indeed general and effective,
and presents new evidence to show that it is also competitive.
1.1 Motivation and Contributions
We empirically show that the strategy of learning the training dis-
tribution improves accuracy of small models in diverse setups. Pre-
vious work that proposed this technique [17, 18] demonstrate such
improvements for a given model; however, they leave an important
question unanswered: are these improvements significant enough to
be comparable to other models tailored to a task? Why might we
use this strategy instead of using a contemporary specialized tech-
nique? This is the practical gap this work fills, where our answer is
affirmative.
We consider the following tasks, for which we set ourselves the
goal of constructing easy-to-interpret models:
1. Building cluster explanation trees.
2. Prototype-based classification.
3. Classification using Random Forests (RF).
For evaluation on each of these tasks, we follow a common theme:
(a) first, we show that a traditional technique is almost always not as
good as newer and specialized techniques, and, (b) then we show that
its performance may be radically improved by learning the training
distribution. Collectively, these evaluations show that the strategy of
learning the training distribution is both general - may be applied to
different tasks, models, notions of model sizes - and effective - results
in competitive performance. These rigorous evaluations in diverse
practical setups is the primary contribution of this work.
Table 1 concisely presents our setup and (one set of) observations.
It lists various techniques compared on a task, including the tradi-
tional technique whose performance we seek to improve (highlighted
in red), and the task-specific notion of model size (in blue). For all
evaluations multiple trials are run (per dataset and model size com-
bination; the datasets are also listed), and tests for statistical signif-
icance are conducted. The p-value of the improvement in prediction
accuracy (based on a Wilcoxon signed-rank test) - of the traditional
model using the provided training data vs using sampled data based
on a learned distribution - is shown in the last column (smaller num-
bers are better). The general setup is further detailed in Section 4.
arXiv:2210.03921v3 [cs.LG] 27 Apr 2024
Table 1. Experiments at a glance. The methods that are highlighted in red were augmented with training distribution learning in our experiments. The year
in which a method was proposed is also mentioned. Notably, this strategy improves relatively old methods to be at par with newer specialized techniques for
varied notions of model size. What “model size” means for a task is specified in blue. The p-value from a statistical test comparing the original and augmented
models appears in the last column; smaller is better.
Task Methods Metric Significance
Tests Datasets p-value of improvements
(smaller is better)
(1) Explainable Clustering
size = # leaves in a tree
Section 5
Iterative Mistake
Minimization (2020),
ExShallow (2021),
CART (1984)
Cost Ratio
(ratio of F1
scores)
Friedman,
Wilcoxon
(1) avila, (2) Sensorless,
(3) covtype.binary, (4) covtype,
(5) mice-protein
p= 1.4783 ×106
(2) Prototype-based
Classification
size = # prototypes
Section 6
ProtoNN (2017),
Stochastic Neighbor
Compression (2014),
Fast Condensed Nearest
Neighbor Rule (2005)
RBF Network (1988)
F1-macro Friedman,
Wilcoxon
(1) adult, (2) covtype.binary,
(3) senseit-sei, (4) senseit-aco
(5) phishing
p= 1.699 ×104
(3) Classification using
Random Forests
size = {tree depth, # trees}
Section 7
Optimal Tree Ensemble (2020),
subforest-by-prediction (2009),
Random Forest (2001)
F1-macro Friedman,
Wilcoxon
(1) Sensorless, (2) heart,
(3) covtype, (4) breast cancer,
(5) ionosphere
p= 1.44 ×1011
1.2 Organization
This is conceptually a short paper, i.e., has a simple well-defined
central thesis, and is predominantly an empirical paper, i.e., the the-
sis is validated using experiments. We begin with a discussion of
prior work (Section 2), followed by a brief overview of the technique
we use to learn the training distribution (Section 3). The latter pro-
vides relevant context for our experiments. Critical to this paper is
our measurement methodology - this is described in Section 4. The
tasks mentioned in Table 1 are respectively detailed in Sections 5, 6
and 7. We summarize various results and discuss future work in the
Section 8, which concludes the paper.
2 Previous Work
As mentioned, the only works we are aware of that discuss learning
of the training distribution as a means to construct small models are
Ghose and Ravindran [17, 18]. They propose techniques that maybe
seen as forms of “adaptive sampling”: parameters for the training dis-
tribution are iteratively learned by adapting them to maximize held-
out accuracy. In the rest of this paper, we will refer to these tech-
niques as “Compaction by Adaptive Sampling” (COAS).
3 Overview of COAS
COAS iteratively learns the parameters of a training distribution
based on performance on a held-out subset of the data, given a
training algorithm, desired model size ηand an accuracy metric of
interest. It uses Bayesian Optimization (BO) [39] for its iterative
learning. This specific optimizer choice allows for models with non-
differentiable losses, e.g., decision trees. The output of one run of
COAS is a sample of the training data drawn according to the learned
distribution. For our experiments, the following hyperparameters of
COAS are relevant:
1. Optimization budget, T: this is the number of iterations for which
the optimizer runs. There is no other stopping criteria.
2. The lower and upper bounds for the size of the sample to be re-
turned. This sample size is denoted by Ns.
Reasonable defaults exist for other hyperparameters [17]. We use the
reference library compactem [16] in our experiments. A detailed re-
view of COAS appears in Section A.1 of the Appendix.
4 Measurement
While each task-specific section contains a detailed discussion on the
experiment setup, we discuss some common aspects here:
1. To compare model families F1,F2,F3, each of which is, say,
used to construct models for different sizes η∈ {2,3}, for
datasets D1, D2, D3, we use the mean rank, and support our
conclusions with statistical tests such as the Friedman [14] and
Wilcoxon signed-rank [40] tests1.
Typically mean rank is used to compare model families based on
their accuracies across datasets - which, ignoring model sizes, may
be visualized as a 3×3table here, with rows representing datasets,
and columns denoting model families - see Figure 1(a). An entry
such as “D2,F3” represents the accuracy (or some other metric)
of a model from family F3on dataset D2. Models are ranked on a
per-dataset basis, i.e., row-wise, and the average ranks (computed
per family, i.e., column-wise) are reported (lower is better). For
statistical tests, the column values are directly used.
However, we have an additional factor here - the model size. To
avoid inventing a custom metric, we assimilate it in the previous
scheme by using the combination of datasets and model sizes as a
row - see Figure 1(b). We think of such combinations as “pseudo-
dataset” entries, i.e., now we have a 6×3table, with rows for
D2
1, D3
1, D2
2, D3
2, D2
3, D3
3, and same columns as before. The entry
for “D2
1,F3” indicates the accuracy of a model of size 2from
family F3on dataset D1.
Effectively, now the comparisons automatically account for model
size since we use pseudo-datasets instead of datasets.Note that no
new datasets are being created - we are merely defining a con-
vention to include model size in the familiar dataset-model cross-
product table.
2. For each model family, model size and dataset combination (es-
sentially a cell in this cross-product table), models are constructed
multiple times (we refer to these as multiple “trials”), and their
scores are averaged. For tasks #1 and #2, five trials were used,
whereas for task #3, three trials were used.
3. Key results for tasks #1, #2 and #3 appear in Sections 5.3, 6.3 and
7.3 respectively.
1The Wilcoxon signed-rank test was used here since it has been advocated
by various studies for measuring classification performance [9, 3, 21].
摘要:

DataSelection:AGeneralPrincipleforBuildingSmallInterpretableModelsAbhishekGhoseaAbstract.Wepresentconvincingempiricalevidenceforaneffec-tiveandgeneralstrategyforbuildingaccuratesmallmodels.Suchmodelsareattractiveforinterpretabilityandalsofinduseinresource-constrainedenvironments.Thestrategyistolearn...

展开>> 收起<<
Data Selection A General Principle for Building Small Interpretable Models Abhishek Ghosea.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.33MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注