
Table 1. Experiments at a glance. The methods that are highlighted in red were augmented with training distribution learning in our experiments. The year
in which a method was proposed is also mentioned. Notably, this strategy improves relatively old methods to be at par with newer specialized techniques for
varied notions of model size. What “model size” means for a task is specified in blue. The p-value from a statistical test comparing the original and augmented
models appears in the last column; smaller is better.
Task Methods Metric Significance
Tests Datasets p-value of improvements
(smaller is better)
(1) Explainable Clustering
size = # leaves in a tree
Section 5
Iterative Mistake
Minimization (2020),
ExShallow (2021),
CART (1984)
Cost Ratio
(ratio of F1
scores)
Friedman,
Wilcoxon
(1) avila, (2) Sensorless,
(3) covtype.binary, (4) covtype,
(5) mice-protein
p= 1.4783 ×10−6
(2) Prototype-based
Classification
size = # prototypes
Section 6
ProtoNN (2017),
Stochastic Neighbor
Compression (2014),
Fast Condensed Nearest
Neighbor Rule (2005)
RBF Network (1988)
F1-macro Friedman,
Wilcoxon
(1) adult, (2) covtype.binary,
(3) senseit-sei, (4) senseit-aco
(5) phishing
p= 1.699 ×10−4
(3) Classification using
Random Forests
size = {tree depth, # trees}
Section 7
Optimal Tree Ensemble (2020),
subforest-by-prediction (2009),
Random Forest (2001)
F1-macro Friedman,
Wilcoxon
(1) Sensorless, (2) heart,
(3) covtype, (4) breast cancer,
(5) ionosphere
p= 1.44 ×10−11
1.2 Organization
This is conceptually a short paper, i.e., has a simple well-defined
central thesis, and is predominantly an empirical paper, i.e., the the-
sis is validated using experiments. We begin with a discussion of
prior work (Section 2), followed by a brief overview of the technique
we use to learn the training distribution (Section 3). The latter pro-
vides relevant context for our experiments. Critical to this paper is
our measurement methodology - this is described in Section 4. The
tasks mentioned in Table 1 are respectively detailed in Sections 5, 6
and 7. We summarize various results and discuss future work in the
Section 8, which concludes the paper.
2 Previous Work
As mentioned, the only works we are aware of that discuss learning
of the training distribution as a means to construct small models are
Ghose and Ravindran [17, 18]. They propose techniques that maybe
seen as forms of “adaptive sampling”: parameters for the training dis-
tribution are iteratively learned by adapting them to maximize held-
out accuracy. In the rest of this paper, we will refer to these tech-
niques as “Compaction by Adaptive Sampling” (COAS).
3 Overview of COAS
COAS iteratively learns the parameters of a training distribution
based on performance on a held-out subset of the data, given a
training algorithm, desired model size ηand an accuracy metric of
interest. It uses Bayesian Optimization (BO) [39] for its iterative
learning. This specific optimizer choice allows for models with non-
differentiable losses, e.g., decision trees. The output of one run of
COAS is a sample of the training data drawn according to the learned
distribution. For our experiments, the following hyperparameters of
COAS are relevant:
1. Optimization budget, T: this is the number of iterations for which
the optimizer runs. There is no other stopping criteria.
2. The lower and upper bounds for the size of the sample to be re-
turned. This sample size is denoted by Ns.
Reasonable defaults exist for other hyperparameters [17]. We use the
reference library compactem [16] in our experiments. A detailed re-
view of COAS appears in Section A.1 of the Appendix.
4 Measurement
While each task-specific section contains a detailed discussion on the
experiment setup, we discuss some common aspects here:
1. To compare model families F1,F2,F3, each of which is, say,
used to construct models for different sizes η∈ {2,3}, for
datasets D1, D2, D3, we use the mean rank, and support our
conclusions with statistical tests such as the Friedman [14] and
Wilcoxon signed-rank [40] tests1.
Typically mean rank is used to compare model families based on
their accuracies across datasets - which, ignoring model sizes, may
be visualized as a 3×3table here, with rows representing datasets,
and columns denoting model families - see Figure 1(a). An entry
such as “D2,F3” represents the accuracy (or some other metric)
of a model from family F3on dataset D2. Models are ranked on a
per-dataset basis, i.e., row-wise, and the average ranks (computed
per family, i.e., column-wise) are reported (lower is better). For
statistical tests, the column values are directly used.
However, we have an additional factor here - the model size. To
avoid inventing a custom metric, we assimilate it in the previous
scheme by using the combination of datasets and model sizes as a
row - see Figure 1(b). We think of such combinations as “pseudo-
dataset” entries, i.e., now we have a 6×3table, with rows for
D2
1, D3
1, D2
2, D3
2, D2
3, D3
3, and same columns as before. The entry
for “D2
1,F3” indicates the accuracy of a model of size 2from
family F3on dataset D1.
Effectively, now the comparisons automatically account for model
size since we use pseudo-datasets instead of datasets.Note that no
new datasets are being created - we are merely defining a con-
vention to include model size in the familiar dataset-model cross-
product table.
2. For each model family, model size and dataset combination (es-
sentially a cell in this cross-product table), models are constructed
multiple times (we refer to these as multiple “trials”), and their
scores are averaged. For tasks #1 and #2, five trials were used,
whereas for task #3, three trials were used.
3. Key results for tasks #1, #2 and #3 appear in Sections 5.3, 6.3 and
7.3 respectively.
1The Wilcoxon signed-rank test was used here since it has been advocated
by various studies for measuring classification performance [9, 3, 21].