Data Budgeting for Machine Learning Xinyi Zhao xinyizhao991231gmail.com Institute for Interdisciplinary Information Sciences Tsinghua University China

2025-04-27 0 0 1.15MB 14 页 10玖币

侵权投诉

Data Budgeting for Machine Learning

Xinyi Zhao xinyizhao991231@gmail.com

Institute for Interdisciplinary Information Sciences, Tsinghua University, China

Schwarzman College, Tsinghua University, China

Weixin Liang wxliang@cs.stanford.edu

Department of Computer Science, Stanford University, CA, USA

James Zou jamesz@stanford.edu

Department of Computer Science, Stanford University, CA, USA

Department of Biomedical Data Science, Stanford University, CA, USA

Chan Zuckerberg Biohub, San Francisco, CA, USA

Abstract

Data is the fuel powering AI and creates tremendous value for many domains. However,

collecting datasets for AI is a time-consuming, expensive, and complicated endeavor. For

practitioners, data investment remains to be a leap of faith in practice. In this work, we

study the data budgeting problem and formulate it as two sub-problems: predicting (1) what

is the saturating performance if given enough data, and (2) how many data points are needed

to reach near the saturating performance. Diﬀerent from traditional dataset-independent

methods like PowerLaw, we proposed a learning method to solve data budgeting problems.

To support and systematically evaluate the learning-based method for data budgeting, we

curate a large collection of 383 tabular ML datasets, along with their data vs performance

curves. Our empirical evaluation shows that it is possible to perform data budgeting given

a small pilot study dataset with as few as 50 data points.

1 Introduction

Collecting the appropriate training and evaluation data is often the biggest challenge in developing AI in

practice. While the emerging data-centric AI movement has garnered tremendous interest and excitement

in the research of the best practices of curating, cleaning, annotating, and evaluating datasets for AI, a

critically important piece in the data-for-AI pipeline that is still under-explored is data budgeting: Currently,

the investment of data remains to be a leap of faith: practitioners estimate their AI data budget mostly

based on their experience. Systematic and principled approaches for estimating the number of data points

needed for a given ML task are still lacking.

Throughout this work, we formulate Data Budgeting1as two closely-related research problems: The ﬁrst

research problem, Final Performance Prediction, is to predict the saturating ML performance that

we can achieve if given suﬃcient training data. This allows practitioners to gain insights on whether the

ML tasks they are handling are ML feasible in their current forms. The second research problem, Needed

Amount of Data Prediction, is to predict the minimum amounts of data to achieve nearly the saturating

performance as if we are given suﬃcient data. This allows practitioners to know whether their ML tasks are

practically feasible given their real-world data budget on data collection and data annotation.

Theoretical scientists have made great eﬀorts to ﬁnd the lower and upper boundaries for data budgeting

problems on whatever common learning models or speciﬁc models. In application ﬁeld, Power-law-related

methods(ﬁtting curves by y=a+b×xc, where yrepresents the test result and xrepresents the number

of data points for training) are prevalent in solving such problems as introduced in Rosenfeld et al. (2019)

and Johnson et al. (2018). Also, many works focus on problems requiring large training datasets such as

Sun et al. (2017) and Kaplan et al. (2020) targeting deep learning models and language models, respectively.

1Github Link:https://github.com/xinyi-zhao/Data-Budgeting

arXiv:2210.00987v1 [cs.LG] 3 Oct 2022

They tend to ﬁt the accuracy curve and use the ﬁtted curve to estimate the training eﬀect with enough

data. However, throughout our work, we investigate how we can learn from existing datasets and make

data budgeting predictions for a brand new dataset. Our experiments show that there are common features

regarding the datasets with similar data budgeting results. Another diﬀerence is that existing work on data

scaling laws on pre-training foundation models on natural language and computer vision mostly looked at

web-scale datasets. At the same time, we focus on tabular datasets, which often have smaller sizes and are

more common in science and social science.

We have built a dataset of tabular datasets which contains 383 datasets for binary classiﬁcation and multi-

classiﬁcation tasks. The sources of these datasets are from OpenML and Kaggle. We use the built dataset

to verify that we can give the estimation of the two data budgeting problems for a new dataset through

learning from other datasets. Then we propose a new method called Multiple Splitting to make up the lack

of the test data in the research of pilot data and generate an array for each dataset. Later, we employ the

machine learning method to explore the relationship between the basic features of the dataset and its real

data budgeting.

Through the experiments, we ﬁnd that learning from other datasets can help us make the prediction for the

new datasets. Our empirical evaluation shows that it is possible to perform data budgeting given a small

pilot study dataset with as few as 50 data points. What’s more, we analyze the valuation of diﬀerent features

in solving these problems. Finally, we ﬁnd some reasons that may lead to the wrong estimation of the ﬁnal

performance and some conditions where our methods don’t perform very well.

2 Problem and Method

We formulate the data budgeting problem as: (1) Predicting ﬁnal performance: what is the saturating

ML performance that we can achieve if given suﬃcient training data, (2) Predicting the needed amount

of data: the minimum amounts of data to achieve nearly the saturating performance as if we are given

suﬃcient data. For a machine learning task T, we want to enable the prediction given only a tiny pilot study

dataset (e.g., 50 labeled data, same distribution as whole dataset). The overview of the problem and the

method is shown in Figure 1.

Formally, for one big tabular dataset Dwith training set Dtrain and test set Dtest, we deﬁne the ﬁnal

performance of such dataset as the result of AutoML model trained on Dtrain and tested on Dtest. Though

many AutoML Models with good performance exist, their diﬀerence is small compared to our estimation

task. Here we utilize the combination of Autogluon library Erickson et al. (2020) and Auto-sklearn Feurer

et al. (2015) by choosing the larger test result as the ﬁnal test result. And the needed amount of data is

the minimum amount of data that is sampled from Dtrain where learning from such data can get nearly the

same result as the ﬁnal performance when tested on Dtest.

Then we sample a few data points as a pilot dataset from each big dataset, and learn the relation between

the pilot dataset and ﬁnal performance / needed amount of data. In order to quantify the pilot data, we use

an array ~s to represent it where sxrepresents the training performance when training with xdata points.

Then we use the learning models such as linear models ( linear regression model or logistic regression model)

and RandomForest models to ﬁnd the mapping from ~s to the ﬁnal performance / needed amount of data.

The generation of ~s Since ~s should be generated with only pilot data, boosting the credibility of each

sxis of great importance. Single Splitting method splits the pilot dataset once by sampling xdata points

from the pilot data as train set and letting the remaining data points as test set. Multiple Splitting

method repeats such operations for many times and calculates the average. Figure 2 suggests the number

of repetitions be at least 500. We know that increasing the number of data for training will improve the

training result. However, if we don’t repeat sampling enough times, we will observe the ﬂuctuation in the

curve where training with more data will not contribute to better results, which means the result we obtain

is not robust. And consider the eﬃciency, we don’t ﬁnd much diﬀerence between sampling for 500 and 1000

times. Therefore, repeating for 500 times is enough. Formally, for a pilot dataset P,

sx=avgd⊆P,|d|=xResult(T est on P \d, T rain on d)







 

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

Pilot

data All data

 

Pilot data All data

















Figure 1: Data Budgeting: Problem and Method. (a) For a machine learning task, when we only have

a few data points, we are curious about what will happen if we obtain more data points. (b) To solve such

problems, we can refer to other existing big datasets. (c) We can abstract the problem as ﬁnal performance

and the needed amount of data prediction and quantify the pilot data by generating a learning curve related

to the number of data points we use for train. Then we can learn two models to map the learning curves

to ﬁnal performance and needed amount of data separately. Therefore, we can give the prediction for tasks

with only a few data points.

We choose RandomForest Method as the training method(T rain on d). We explain why we choose Ran-

domForest Method in the appendix: AutoML method doesn’t work when data size is small; AutoML tends

to choose RandomForest-related model as the best model.

Repeat for 10 times Repeat for 100 times Repeat for 500 times Repeat for 1000 times

Train Set Size Train Set Size Train Set Size Train Set Size

F1 macro

0.85

0.80

0.75

0.70

0.65

0.60

0.850.850.85

0.80 0.80 0.80

0.75 0.75 0.75

0.700.700.70

0.65 0.65 0.65

0.60 0.60 0.60

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90

Figure 2: The curves generated by repeating for diﬀerent numbers of times. We know that more

data points for train lead to better test results. The ﬂuctuation of the curve means that we haven’t gotten

robust results. We can see that increasing the number of repetitions will decrease it. And considering the

trade-oﬀ between the curve quality and time spent, it is reasonable to repeat for 500 times.

Final Performance Prediction For a dataset Dwith training set Dtrain and test set Dtest, we use

Model O(AutoML) to train it and calculate F1macro on Dtest where the metric F1macro is suitable for the

real-world datasets which are commonly unbalanced. Finally, the ﬁnal performance

OD=F1macro(Dtest)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DataBudgetingforMachineLearningXinyiZhaoxinyizhao991231@gmail.comInstituteforInterdisciplinaryInformationSciences,TsinghuaUniversity,ChinaSchwarzmanCollege,TsinghuaUniversity,ChinaWeixinLiangwxliang@cs.stanford.eduDepartmentofComputerScience,StanfordUniversity,CA,USAJamesZoujamesz@stanford.eduDepart...

展开>> 收起<<

Data Budgeting for Machine Learning Xinyi Zhao xinyizhao991231gmail.com Institute for Interdisciplinary Information Sciences Tsinghua University China.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Data Budgeting for Machine Learning Xinyi Zhao xinyizhao991231gmail.com Institute for Interdisciplinary Information Sciences Tsinghua University China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: