
They tend to fit the accuracy curve and use the fitted curve to estimate the training effect with enough
data. However, throughout our work, we investigate how we can learn from existing datasets and make
data budgeting predictions for a brand new dataset. Our experiments show that there are common features
regarding the datasets with similar data budgeting results. Another difference is that existing work on data
scaling laws on pre-training foundation models on natural language and computer vision mostly looked at
web-scale datasets. At the same time, we focus on tabular datasets, which often have smaller sizes and are
more common in science and social science.
We have built a dataset of tabular datasets which contains 383 datasets for binary classification and multi-
classification tasks. The sources of these datasets are from OpenML and Kaggle. We use the built dataset
to verify that we can give the estimation of the two data budgeting problems for a new dataset through
learning from other datasets. Then we propose a new method called Multiple Splitting to make up the lack
of the test data in the research of pilot data and generate an array for each dataset. Later, we employ the
machine learning method to explore the relationship between the basic features of the dataset and its real
data budgeting.
Through the experiments, we find that learning from other datasets can help us make the prediction for the
new datasets. Our empirical evaluation shows that it is possible to perform data budgeting given a small
pilot study dataset with as few as 50 data points. What’s more, we analyze the valuation of different features
in solving these problems. Finally, we find some reasons that may lead to the wrong estimation of the final
performance and some conditions where our methods don’t perform very well.
2 Problem and Method
We formulate the data budgeting problem as: (1) Predicting final performance: what is the saturating
ML performance that we can achieve if given sufficient training data, (2) Predicting the needed amount
of data: the minimum amounts of data to achieve nearly the saturating performance as if we are given
sufficient data. For a machine learning task T, we want to enable the prediction given only a tiny pilot study
dataset (e.g., 50 labeled data, same distribution as whole dataset). The overview of the problem and the
method is shown in Figure 1.
Formally, for one big tabular dataset Dwith training set Dtrain and test set Dtest, we define the final
performance of such dataset as the result of AutoML model trained on Dtrain and tested on Dtest. Though
many AutoML Models with good performance exist, their difference is small compared to our estimation
task. Here we utilize the combination of Autogluon library Erickson et al. (2020) and Auto-sklearn Feurer
et al. (2015) by choosing the larger test result as the final test result. And the needed amount of data is
the minimum amount of data that is sampled from Dtrain where learning from such data can get nearly the
same result as the final performance when tested on Dtest.
Then we sample a few data points as a pilot dataset from each big dataset, and learn the relation between
the pilot dataset and final performance / needed amount of data. In order to quantify the pilot data, we use
an array ~s to represent it where sxrepresents the training performance when training with xdata points.
Then we use the learning models such as linear models ( linear regression model or logistic regression model)
and RandomForest models to find the mapping from ~s to the final performance / needed amount of data.
The generation of ~s Since ~s should be generated with only pilot data, boosting the credibility of each
sxis of great importance. Single Splitting method splits the pilot dataset once by sampling xdata points
from the pilot data as train set and letting the remaining data points as test set. Multiple Splitting
method repeats such operations for many times and calculates the average. Figure 2 suggests the number
of repetitions be at least 500. We know that increasing the number of data for training will improve the
training result. However, if we don’t repeat sampling enough times, we will observe the fluctuation in the
curve where training with more data will not contribute to better results, which means the result we obtain
is not robust. And consider the efficiency, we don’t find much difference between sampling for 500 and 1000
times. Therefore, repeating for 500 times is enough. Formally, for a pilot dataset P,
sx=avgd⊆P,|d|=xResult(T est on P \d, T rain on d)
2