Optimizing Data Collection for Machine Learning

2025-04-29 0 0 2.18MB 28 页 10玖币
侵权投诉
Optimizing Data Collection for Machine Learning
Rafid Mahmood1James Lucas1Jose M. Alvarez1Sanja Fidler1,2,3Marc T. Law1
1NVIDIA 2University of Toronto 3Vector Institute
{rmahmood, jalucas, josea, sfidler, marcl}@nvidia.com
Project Page: https://nv-tlabs.github.io/LearnOptimizeCollect/
Abstract
Modern deep learning systems require huge data sets to achieve impressive per-
formance, but there is little guidance on how much or what kind of data to collect.
Over-collecting data incurs unnecessary present costs, while under-collecting may
incur future costs and delay workflows. We propose a new paradigm for modeling
the data collection workflow as a formal optimal data collection problem that al-
lows designers to specify performance targets, collection costs, a time horizon, and
penalties for failing to meet the targets. Additionally, this formulation generalizes
to tasks requiring multiple data sources, such as labeled and unlabeled data used
in semi-supervised learning. To solve our problem, we develop Learn-Optimize-
Collect (LOC), which minimizes expected future collection costs. Finally, we
numerically compare our framework to the conventional baseline of estimating data
requirements by extrapolating from neural scaling laws. We significantly reduce
the risks of failing to meet desired performance targets on several classification,
segmentation, and detection tasks, while maintaining low total collection costs.
1 Introduction
When deploying a deep learning model in an industrial application, designers often mandate that the
model must meet a pre-determined baseline performance, such as a target metric over a validation
data set. For example, an object detector may require a certain minimum mean average precision
before being deployed in a safety-critical setting. One of the most effective ways of meeting target
performances is by collecting more training data for a given model.
Determining how much data is needed to meet performance targets can impact costs and development
delays. Overestimating the data requirement incurs excess costs from collection, cleaning, and
annotation. For instance, annotating segmentation masks for a driving data set takes between
15
to
40
seconds per object. For
100,000
images the annotation could require between
170
and
460
days-equivalent of time [
1
,
2
]. On the other hand, collecting too little data may incur future costs and
workflow delays from having to collect more later. For example, in medical imaging applications,
this means further clinical data acquisition rounds that require expensive clinician time. In the worst
case, designers may even realize that a project is infeasible only after collecting insufficient data.
The growing literature on sample complexity in machine learning has identified neural scaling laws
that scale model performance with data set sizes according to power laws [
3
10
]. For instance, Rosen-
feld et al.
[6]
fit power law functions on the performance statistics of small data sets to extrapolate the
learning curve with more data. In contrast, Mahmood et al.
[2]
consider estimating data requirements
and show that even small errors in a power law model of the learning curve can translate to massively
over- or underestimating how much data is needed. Beyond this, different data sources have different
costs and scale differently with performance [
11
14
]. For example, although unlabeled data may
be easier to collect than labeled data, some semi-supervised learning tasks may need an order of
magnitude more unlabeled data to match the performance of a small labeled set. Thus, collecting
more data based only on estimation will fail to capture uncertainty and collection costs.
arXiv:2210.01234v1 [cs.LG] 3 Oct 2022
Optimize-Collect
Sample bootstrap performance
with subsets of data to learn
data requirement distribution
Use estimated distribution to
optimize collection cost plus
risk of failing to meet V*
Pay c(qt - qt-1) to collect
additional data
Have we
achieved
score V*?
Have we hit
time limit T?
TerminatePay P and
terminate
Yes
No
Yes
Initialize with q0 data
points & target V*
No
Learn Optimize Collect
Figure 1: In the optimal data collection problem, we iteratively determine the amount of data that
we should have, pay to collect the additional data, and then re-evaluate our model. Our approach,
Learn-Optimize-Collect, optimizes for the minimum amount of data q
tto collect.
In this paper, we propose a new paradigm for modeling the data collection workflow as an optimal
data collection problem. Here, a designer must minimize the cost of collecting enough data to obtain a
model capable of a desired performance score. They have multiple collection rounds, where after each
round, they re-evaluate the model and decide how much more data to order. The data has per-sample
costs and moreover, the designer pays a penalty if they fail to meet the target score within a finite
horizon. Using this formal framework, we develop an optimization approach for minimizing the
expected future collection costs and show that this problem can be optimized in each collection round
via gradient descent. Furthermore, our optimization problem immediately generalizes to decisions
over multiple data sources (e.g., unlabeled, long-tail, cross-domain, synthetic) that have different
costs and impacts on performance. Finally, we demonstrate the value of optimization over naïvely
estimating data set requirements (e.g., [2]) for several machine learning tasks and data sets.
Our contributions are as follows. (1) We propose the optimal data collection problem in machine
learning, which formalizes data collection workflows. (2) We introduce Learn-Optimize-Collect
(LOC), a learning-and-optimizing framework that minimizes future collection costs, can be solved
via gradient descent, and has analytic solutions in some settings. (3) We generalize the data collection
problem and LOC to a multi-variate setting where different types of data have different costs. To the
best of our knowledge, this is the first exploration of data collection with general multiple data sets
in machine learning, covering for example, semi-supervised and long-tail learning. (4) We perform
experiments over classification, segmentation, and detection tasks to show, on average, approximately
a2×reduction in the chances of failing to meet performance targets, versus estimation baselines.
2 Related work
Neural Scaling Laws.
According to the neural scaling law literature, the performance of a model
on a validation set scales with the size of the training data set qvia a power law Vθ0qθ1[5, 6, 8–
10
,
15
19
]. Hestness et al.
[5]
observe this property over vision, language, and audio tasks, Bahri et al.
[9]
develop a theoretical relationship under assumptions on over-parametrization and the Lipschitz
continuity of the loss, model, and data, and Rosenfeld et al.
[6]
estimate power laws using smaller
data sets and models to extrapolate future performance. Multi-variate scaling laws have also been
considered for some specific tasks, for example in transfer learning from synthetic to real data
sets [
11
]. Finally, Mahmood et al.
[2]
explore data collection by estimating the minimum amount
of data needed to meet a given target performance over multiple rounds. Our paper extends these
prior studies by developing an optimization problem to minimize the expected total cost of data
collected. Specifically, we incorporate the uncertainty in any regression estimate of data requirements
and further generalize to multiple data sources with different costs.
Active Learning.
In active learning, a model sequentially collects data by selecting new subsets of
an unlabeled data pool to label under a pre-determined labeling budget that replenishes after each
round [
20
24
]. In contrast, our work focuses on systematically determining an optimal collection
budget. After determining how much data to collect, we can use active learning techniques to collect
the desired amount of data.
Statistical Learning Theory.
Theoretical analysis of the sample complexity of machine learning
models is typically only tight asymptotically, but some recent work have empirically analyzed these
2
relationships [
25
,
26
]. Particularly, Bisla et al.
[10]
study generalization bounds for deep neural
networks, provide empirical validation, and suggest using them to estimate data requirements. In
contrast, our paper formally explores the consequences of collection costs on data requirements.
Optimal Experiment Design.
The topic of how to collect data, select samples, and design scientific
experiments or controlled trials is well-studied in econometrics [
27
29
]. For example, Bertsimas
et al.
[30]
optimize the assignment of samples into control and trial groups to minimize inter-group
variances. Most recently, Carneiro et al.
[31]
optimize how many samples and covariates to collect in
a statistical experiment by minimizing a treatment effect estimation error or maximizing
t
-test power.
However, our focus on industrial machine learning applications differs from experiment design by
having target performance metrics and continual rounds of collection and modeling.
3 Main Problem
In this section, we give a motivating example before introducing the formal data collection problem.
We include a table of notation in Appendix A.
Motivating Example.
A startup is developing an object detector for use in autonomous vehicles
within the next
T= 5
years. Their model must achieve a mean Average Precision greater than
V=
95%
on a pre-determined validation set or else they will lose an expected profit of
P= $1,000,000
.
Collecting training data requires employing drivers to record videos and annotators to label the data,
where the marginal cost of obtaining each image is approximately
c= $1
. In order to manage annual
finances, the startup must plan how much data to collect at the beginning of each year.
Let
zp(z)
be data drawn from a distribution
p
. For instance,
z:= (x, y)
may correspond to
images
x
and labels
y
. Consider a prediction problem for which we train a model with a data set
D
of points sampled from p(z). Let V(D)be a score function evaluating the model trained on D.
Optimal Data Collection.
We possess an initial data set
Dq0:= {zi}q0
i=1
of
q0
points; we omit the
subscript on
D
referring to its size when it is obvious. Our problem is defined by a target score
V> V (Dq0), a cost-per-sample cof collection, a horizon of Trounds, and a penalty Pfor failure.
At the end of each round
t∈ {1, . . . , T }
, let
qt
be the current amount of data collected. Our goal is
to minimize the total collection cost while building a model that can achieve the target score:
min
q1,...,qT
c(qTq0) + P1{V(DqT)< V }s.t. q0q1≤ · · · ≤ qT
= min
q1,...,qT
c
T
X
t=1
(qtqt1) + P1{V(DqT)< V }s.t. q0q1≤ · · · ≤ qT(1)
The collection cost is measured by the difference in data set size between the final and the 0-th round
c(qTq0) = cPT
t=1(qtqt1)
, Because we collect data iteratively over multiple rounds (see
Figure 1), we break (1) into the sum of differences per round. Specifically in each round, we
1.
Decide to grow the data set to
qtqt1
points by sampling
ˆ
D:= {ˆzi}qtqt1
i=1 p(z)
. Pay
a cost c(qtqt1)and update D D ∪ ˆ
D.
2. Train the model and evaluate the score. If V(D)V, then terminate.
3. If t=T, then pay the penalty Pand terminate. Otherwise, repeat for the next round.
The model score typically increases monotonically with data set size [
5
,
6
]. This means that the
minimum cost strategy for
(1)
is to collect just enough data such that
V(DqT) = V
. We can
estimate this minimum data requirement by modeling the score function as a stochastic process. Let
Vq:= V(Dq)
and let
{Vq}qZ+
be a stochastic process whose indices represent training set sizes
in different rounds. Then, collecting data in each round yields a sequence of subsampled data sets
Dqt1⊂ Dqtand their performances V(Dqt). The minimum data requirement is the stopping time
D:= arg min
q
{q|VqV}.(2)
which is a random variable giving the first time that we pass the target. Note that
q
1=· · · =q
T=D
is a minimum cost solution to the optimal data collection problem, incurring a total cost
c(Dq0)1
.
1We assume that c(Dq0)< P , since otherwise the optimal strategy would be to collect no data.
3
Estimating
D
using past observations of the learning curve is difficult since we have only
T
rounds.
Further, Mahmood et al.
[2]
empirically show that small errors in fitting the learning curve can cause
massive over- or under-collection. Thus, robust policies must capture the uncertainty of estimation.
4 Learn-Optimize-Collect (LOC)
Our solution approach, which we refer to as Learn-Optimize-Collect (LOC), minimizes the total
collection cost while incorporating the uncertainty of estimating
D
. Although
D
is a discrete
random variable, it is realized typically on the order of thousands or greater. To simplify our problem
and ensure differentiability, we assume that Dis continuous and has a well-defined density.
Assumption 1.
The random variable
D
is absolutely continuous and has a cumulative density
function (CDF) F(q)and probability density function (PDF) f(q) := dF (q)/dq.
In Section 4.1, we first develop an optimization model when given access to the CDF
f(q)
and PDF
F(q)
. In Section 4.2, we estimate these distributions and combine them with the optimization model.
Finally in Section 4.3, we delineate our optimization approach from prior regression methods.
4.1 Optimization Model
We propose an optimization problem that for any
t
, can simultaneously solve for the optimal amounts
of data to collect
qt, . . . , qT
in all future rounds. Consider
t= 1
and to develop intuition, suppose we
know a priori the exact stopping time D. Then, problem (1) can be re-written as
min
q1,···qT
L(q1, . . . , qT;D) s.t. q0q1≤ · · · ≤ qT(3)
where the objective function is defined recursively as follows
L(q1, . . . , qT;D) := c(q1q0) + 1{q1< D}c(q2q1) + 1{q2< D}c(q3q2). . .
· · · +1{qT1< D}c(qTqT1) + P1{qT< D}· · · 
=c
T
X
t=1
(qtqt1)
t1
Y
s=1
1{qs< D}+P
T
Y
t=1
1{qs< D}
=c
T
X
t=1
(qtqt1)1{qt1< D}+P1{qT< D}.
The objective differs slightly from
(1)
due to the indicator terms, which ensure that once we collect
enough data, we terminate the problem. The second line follows from gathering the terms. The third
line follows from observing that q1q2≤ · · · ≤ qTare constrained.
In practice, we do not know
D
a priori since it is an unobserved random variable. Instead, suppose we
have access to the CDF
F(q)
. Then, we take the expectation over the objective
E[L(q1, . . . , qT;D)]
to formulate a stochastic optimization problem for determining how much data to collect:
min
q1,···qT
c
T
X
t=1
(qtqt1) (1 F(qt1)) + P(1 F(qT)) s.t. q0q1≤ · · · ≤ qT.(4)
Note that the collection variables should be discrete
q1, . . . , qTZ+
, but similar to the modeling
of
D
, we relax the integrality requirement, optimize over continuous variables, and round the final
solutions. Furthermore, although problem
(4)
is constrained, we can re-formulate it with variables
dt:= qtqt1
; this consequently replaces the current constraints with only non-negativity constraints
dt0. Finally due to Assumption 1, problem (4) can be optimized via gradient descent.
4.2 Learning and Optimizing the Data Requirement
Solving problem
(4)
requires access to the true distribution
F(q)
, which we do not have in reality. In
each round, given a current training data set
Dqt
of
qt
points, we must estimate these distribution
functions F(q)and f(q)and then incorporate them into our optimization problem.
4
Given a current data set
Dqt
, we may sample an increasing sequence of
R
subsets
Dqt/R ⊂ D2qt/R
· · · ⊂ Dqt
, fit our model to each subset, and compute the scores to obtain a data set of the learning
curve
R:= {(rqt/R, V (Drqt/R))}R
r=1
. In order to model the distribution of
D
, we can take
B
bootstrap resamples of
R
to fit a series of regression functions and obtain corresponding estimates
{ˆ
Db}B
b=1
. Given a set of estimates of the data requirement, we estimate the PDF via Kernel Density
Estimation (KDE). Finally to fit the CDF, we numerically integrate the PDF.
In our complete framework, LOC, we first estimate
F(q)
and
f(q)
. We then use these models to solve
problem
(4)
. Note that in the
t
-th round of collection, we fix the prior decision variables
q1,...qt1
constant. Finally, we collect data as determined by the optimal solution
q
t
to problem
(4)
. Full details
of the learning and optimization steps, including the complete Algorithm, are in Appendix B.
4.3 Comparison to Mahmood et al. [2]
Our prediction model extends the previous approach of Mahmood et al.
[2]
, who consider only
point estimation of
D
. They (i) build the set
R
, (ii) fit a parametric function
ˆv(q;θ)
to
R
via
least-squares minimization, and (iii) solve for
ˆ
D= arg minq{q|ˆv(q;θ)V}
. They use several
parametric functions from the neural scaling law literature, including the power law function (i.e.,
ˆv(q;θ) := θ0qθ1+θ2
[
2
,
8
] where
θ:= {θ0, θ1, θ2}
), and use an ad hoc correction factor obtained
by trial and error on past tasks to help decrease the failure rate. Instead, we take bootstrap samples of
R
to fit multiple regression functions, estimate a distribution for
ˆ
D
, and incorporate them into our
novel optimization model. Finally, we show in the next two sections that our optimization problem
has analytic solutions and extends to multiple sources.
5 Analytic Solutions for the T= 1 Setting
In this section, we explore analytic solutions for problem
(4)
. The unobservable
D
and sequential
decision-making nature suggest this problem can be formulated as a Partially Observable Markov
Decision Process (POMDP) with an infinite state and action space (see Appendix C.1), but such
problems rarely permit exact solution methods [
32
]. Nonetheless, we can derive exact solutions for
the simple case of a single T= 1 round, re-stated below
min
q1
c(q1q0) + P(1 F(q1)) s.t. q0q1(5)
Theorem 1.
Assume
F(q)
is strictly increasing and continuous. If there exists
q1q0,ˆ0
where
c
PF(q1)F(q0)
q1q0
,ˆ1F(q0), P =c/f(F1(1 ˆ)) (6)
then there exists an
1F(q0)
that satisfies
P=c/f(F1(1 ))
and an optimal solution to
the corresponding problem (5) is q
1:= F1(1 ). Otherwise, the optimal solution is q
1:= q0.
When the penalty Pis specified via a failure risk , the optimal solution to problem (5) is equal to a
quantile of the distribution of D. We defer the proof and some auxiliary results to Appendix C.2.
Theorem 1 further provides guidelines on choosing values for the cost and penalty parameters. While
c
is the dollar-value cost per-sample, which includes acquisition, cleaning, and annotation,
P
can
reflect their inherent regret or opportunity cost of failing to meet their target score. A designer can
accept a risk
of failing to collect enough data
Pr{q< D}=
. From Theorem 1, their optimal
strategy should be to collect F1(1 )points, which is also the optimal solution to problem (5).
6 The Multi-variate LOC: Collecting Data from Multiple Sources
So far, we have assumed that a designer only chooses how much data to collect and must pay a
fixed per-sample collection cost. We now explore the multi-variate extension of the data collection
problem where there are different types of data with different costs. For example, consider long-tail
learning where samples for some rare classes are harder to obtain and thus, more expensive [
33
],
semi-supervised learning where labeling data may cost more than collecting unlabeled data [
34
], or
domain adaptation where a source data set is easier to obtain than a target set [
35
]. In this section, we
highlight our main formulation and defer the complete multi-variate LOC to Appendix D.
5
摘要:

OptimizingDataCollectionforMachineLearningRadMahmood1JamesLucas1JoseM.Alvarez1SanjaFidler1;2;3MarcT.Law11NVIDIA2UniversityofToronto3VectorInstitute{rmahmood,jalucas,josea,sfidler,marcl}@nvidia.comProjectPage:https://nv-tlabs.github.io/LearnOptimizeCollect/AbstractModerndeeplearningsystemsrequirehug...

展开>> 收起<<
Optimizing Data Collection for Machine Learning.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.18MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注