amount of data available for training purposes has boosted the performance in many ML
applications because such an increment adds more information for the learning process,
reduces the epistemic uncertainty and also helps to prevent over-fitting [1]. However,
having more data is problematic for many learning algorithms since most of them have,
at least, an exponential computational complexity concerning the number of training
samples. Moreover, some non-parametric models use a prototype-based approach, where
some training samples (or all of them) are stored in order to be used for new predic-
tions (i.e., the well-known k-nearest neighbors —k-NN, Support Vector Machines [2] and
Gaussian Processes — GP classification and regression methods [3]). In those models,
the predictions are based on the comparison (by applying a distance, a similarity measure
or a kernel function) among the new samples and a set of the stored training samples,
which implies a lot of memory and processing time [4].
The specialized literature shows that there are mainly three approaches to tackle this
problem: parallel/distributed implementations of ML training algorithms [5], model-
based solutions that use variants of the original models and provides approximate so-
lutions (i.e., pruned or cluster-based k-NN, sparse or inducing variables GPs) [6, 7],
and instance selection (IS) methods that reduce the number of samples to be consid-
ered during training [4]. The first alternative is perhaps the most attractive one, but
it requires a considerable academic community effort to get a solution for every single
model that is wanted to be applied in a Big data context. Moreover, sometimes the
training algorithms are, by principle, sequential; thus, the parallel/distributed variants
provide approximate solutions with insurmountable performance degradation. The sec-
ond alternative is similar to the first one, but its purpose is to reduce the computational
complexity of the training process instead of parallelizing it. In this case, variants of
the original models or training algorithms are also required to use the training data in a
different and less demanding way, but most of the time, they also yield to approximate
solutions. The last alternative focuses on the data used for training purposes; in this
case, the solutions reduce the number of samples required to achieve proper model fitting
and performance without modifying the original models or processing all the data. This
strategy could also be combined with the approaches based on parallel or distributed
model training. Notwithstanding their advantages, the methods in the last group must
satisfy two conditions: the samples selected must represent the underlying data distribu-
tion, and their computational cost needs to be light enough to make their use worthwhile.
In this sense, many algorithms for IS proposed in the literature are not suitable for
applications in the context of large volumes of data, due to their complexity is at least
O(nlog n)or greater, and also because they are based on sorting algorithms that, by
design, require all samples fit in one single machine [8]. The computational load of these
algorithms comes from the fact that usually, they must search for critical or close-to-the-
decision frontier instances within the whole dataset, which results in the need to perform
comparisons between each pair of instances in the datasets, or its finding depends on
iterative optimization algorithms [9, 10].
In addition to the computational complexity issue, ML models have to deal with a
skewed data distributions in many real-life situations [11], such as fraudulent transac-
tions, mammogram lesions, or credit defaults detection. In those contexts, the severity
2