Data Budgeting for Machine Learning Xinyi Zhao xinyizhao991231gmail.com Institute for Interdisciplinary Information Sciences Tsinghua University China

2025-04-27 0 0 1.15MB 14 页 10玖币
侵权投诉
Data Budgeting for Machine Learning
Xinyi Zhao xinyizhao991231@gmail.com
Institute for Interdisciplinary Information Sciences, Tsinghua University, China
Schwarzman College, Tsinghua University, China
Weixin Liang wxliang@cs.stanford.edu
Department of Computer Science, Stanford University, CA, USA
James Zou jamesz@stanford.edu
Department of Computer Science, Stanford University, CA, USA
Department of Biomedical Data Science, Stanford University, CA, USA
Chan Zuckerberg Biohub, San Francisco, CA, USA
Abstract
Data is the fuel powering AI and creates tremendous value for many domains. However,
collecting datasets for AI is a time-consuming, expensive, and complicated endeavor. For
practitioners, data investment remains to be a leap of faith in practice. In this work, we
study the data budgeting problem and formulate it as two sub-problems: predicting (1) what
is the saturating performance if given enough data, and (2) how many data points are needed
to reach near the saturating performance. Different from traditional dataset-independent
methods like PowerLaw, we proposed a learning method to solve data budgeting problems.
To support and systematically evaluate the learning-based method for data budgeting, we
curate a large collection of 383 tabular ML datasets, along with their data vs performance
curves. Our empirical evaluation shows that it is possible to perform data budgeting given
a small pilot study dataset with as few as 50 data points.
1 Introduction
Collecting the appropriate training and evaluation data is often the biggest challenge in developing AI in
practice. While the emerging data-centric AI movement has garnered tremendous interest and excitement
in the research of the best practices of curating, cleaning, annotating, and evaluating datasets for AI, a
critically important piece in the data-for-AI pipeline that is still under-explored is data budgeting: Currently,
the investment of data remains to be a leap of faith: practitioners estimate their AI data budget mostly
based on their experience. Systematic and principled approaches for estimating the number of data points
needed for a given ML task are still lacking.
Throughout this work, we formulate Data Budgeting1as two closely-related research problems: The first
research problem, Final Performance Prediction, is to predict the saturating ML performance that
we can achieve if given sufficient training data. This allows practitioners to gain insights on whether the
ML tasks they are handling are ML feasible in their current forms. The second research problem, Needed
Amount of Data Prediction, is to predict the minimum amounts of data to achieve nearly the saturating
performance as if we are given sufficient data. This allows practitioners to know whether their ML tasks are
practically feasible given their real-world data budget on data collection and data annotation.
Theoretical scientists have made great efforts to find the lower and upper boundaries for data budgeting
problems on whatever common learning models or specific models. In application field, Power-law-related
methods(fitting curves by y=a+b×xc, where yrepresents the test result and xrepresents the number
of data points for training) are prevalent in solving such problems as introduced in Rosenfeld et al. (2019)
and Johnson et al. (2018). Also, many works focus on problems requiring large training datasets such as
Sun et al. (2017) and Kaplan et al. (2020) targeting deep learning models and language models, respectively.
1Github Link:https://github.com/xinyi-zhao/Data-Budgeting
1
arXiv:2210.00987v1 [cs.LG] 3 Oct 2022
They tend to fit the accuracy curve and use the fitted curve to estimate the training effect with enough
data. However, throughout our work, we investigate how we can learn from existing datasets and make
data budgeting predictions for a brand new dataset. Our experiments show that there are common features
regarding the datasets with similar data budgeting results. Another difference is that existing work on data
scaling laws on pre-training foundation models on natural language and computer vision mostly looked at
web-scale datasets. At the same time, we focus on tabular datasets, which often have smaller sizes and are
more common in science and social science.
We have built a dataset of tabular datasets which contains 383 datasets for binary classification and multi-
classification tasks. The sources of these datasets are from OpenML and Kaggle. We use the built dataset
to verify that we can give the estimation of the two data budgeting problems for a new dataset through
learning from other datasets. Then we propose a new method called Multiple Splitting to make up the lack
of the test data in the research of pilot data and generate an array for each dataset. Later, we employ the
machine learning method to explore the relationship between the basic features of the dataset and its real
data budgeting.
Through the experiments, we find that learning from other datasets can help us make the prediction for the
new datasets. Our empirical evaluation shows that it is possible to perform data budgeting given a small
pilot study dataset with as few as 50 data points. What’s more, we analyze the valuation of different features
in solving these problems. Finally, we find some reasons that may lead to the wrong estimation of the final
performance and some conditions where our methods don’t perform very well.
2 Problem and Method
We formulate the data budgeting problem as: (1) Predicting final performance: what is the saturating
ML performance that we can achieve if given sufficient training data, (2) Predicting the needed amount
of data: the minimum amounts of data to achieve nearly the saturating performance as if we are given
sufficient data. For a machine learning task T, we want to enable the prediction given only a tiny pilot study
dataset (e.g., 50 labeled data, same distribution as whole dataset). The overview of the problem and the
method is shown in Figure 1.
Formally, for one big tabular dataset Dwith training set Dtrain and test set Dtest, we define the final
performance of such dataset as the result of AutoML model trained on Dtrain and tested on Dtest. Though
many AutoML Models with good performance exist, their difference is small compared to our estimation
task. Here we utilize the combination of Autogluon library Erickson et al. (2020) and Auto-sklearn Feurer
et al. (2015) by choosing the larger test result as the final test result. And the needed amount of data is
the minimum amount of data that is sampled from Dtrain where learning from such data can get nearly the
same result as the final performance when tested on Dtest.
Then we sample a few data points as a pilot dataset from each big dataset, and learn the relation between
the pilot dataset and final performance / needed amount of data. In order to quantify the pilot data, we use
an array ~s to represent it where sxrepresents the training performance when training with xdata points.
Then we use the learning models such as linear models ( linear regression model or logistic regression model)
and RandomForest models to find the mapping from ~s to the final performance / needed amount of data.
The generation of ~s Since ~s should be generated with only pilot data, boosting the credibility of each
sxis of great importance. Single Splitting method splits the pilot dataset once by sampling xdata points
from the pilot data as train set and letting the remaining data points as test set. Multiple Splitting
method repeats such operations for many times and calculates the average. Figure 2 suggests the number
of repetitions be at least 500. We know that increasing the number of data for training will improve the
training result. However, if we don’t repeat sampling enough times, we will observe the fluctuation in the
curve where training with more data will not contribute to better results, which means the result we obtain
is not robust. And consider the efficiency, we don’t find much difference between sampling for 500 and 1000
times. Therefore, repeating for 500 times is enough. Formally, for a pilot dataset P,
sx=avgdP,|d|=xResult(T est on P \d, T rain on d)
2



 
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
Pilot
data All data
D1
D5
D2
D3
D4
D6
D7
D8
D9
 
Pilot data All data






ab
c


Figure 1: Data Budgeting: Problem and Method. (a) For a machine learning task, when we only have
a few data points, we are curious about what will happen if we obtain more data points. (b) To solve such
problems, we can refer to other existing big datasets. (c) We can abstract the problem as final performance
and the needed amount of data prediction and quantify the pilot data by generating a learning curve related
to the number of data points we use for train. Then we can learn two models to map the learning curves
to final performance and needed amount of data separately. Therefore, we can give the prediction for tasks
with only a few data points.
We choose RandomForest Method as the training method(T rain on d). We explain why we choose Ran-
domForest Method in the appendix: AutoML method doesn’t work when data size is small; AutoML tends
to choose RandomForest-related model as the best model.
Repeat for 10 times Repeat for 100 times Repeat for 500 times Repeat for 1000 times
Train Set Size Train Set Size Train Set Size Train Set Size
F1 macro
0.85
0.80
0.75
0.70
0.65
0.60
0.850.850.85
0.80 0.80 0.80
0.75 0.75 0.75
0.700.700.70
0.65 0.65 0.65
0.60 0.60 0.60
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Figure 2: The curves generated by repeating for different numbers of times. We know that more
data points for train lead to better test results. The fluctuation of the curve means that we haven’t gotten
robust results. We can see that increasing the number of repetitions will decrease it. And considering the
trade-off between the curve quality and time spent, it is reasonable to repeat for 500 times.
Final Performance Prediction For a dataset Dwith training set Dtrain and test set Dtest, we use
Model O(AutoML) to train it and calculate F1macro on Dtest where the metric F1macro is suitable for the
real-world datasets which are commonly unbalanced. Finally, the final performance
OD=F1macro(Dtest)
3
摘要:

DataBudgetingforMachineLearningXinyiZhaoxinyizhao991231@gmail.comInstituteforInterdisciplinaryInformationSciences,TsinghuaUniversity,ChinaSchwarzmanCollege,TsinghuaUniversity,ChinaWeixinLiangwxliang@cs.stanford.eduDepartmentofComputerScience,StanfordUniversity,CA,USAJamesZoujamesz@stanford.eduDepart...

展开>> 收起<<
Data Budgeting for Machine Learning Xinyi Zhao xinyizhao991231gmail.com Institute for Interdisciplinary Information Sciences Tsinghua University China.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.15MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注