Optimizing Data Collection for Machine Learning

2025-04-29 0 0 2.18MB 28 页 10玖币

侵权投诉

Raﬁd Mahmood1James Lucas1Jose M. Alvarez1Sanja Fidler1,2,3Marc T. Law1

1NVIDIA 2University of Toronto 3Vector Institute

{rmahmood, jalucas, josea, sfidler, marcl}@nvidia.com

Project Page: https://nv-tlabs.github.io/LearnOptimizeCollect/

Abstract

Modern deep learning systems require huge data sets to achieve impressive per-

formance, but there is little guidance on how much or what kind of data to collect.

Over-collecting data incurs unnecessary present costs, while under-collecting may

incur future costs and delay workﬂows. We propose a new paradigm for modeling

the data collection workﬂow as a formal optimal data collection problem that al-

lows designers to specify performance targets, collection costs, a time horizon, and

penalties for failing to meet the targets. Additionally, this formulation generalizes

to tasks requiring multiple data sources, such as labeled and unlabeled data used

in semi-supervised learning. To solve our problem, we develop Learn-Optimize-

Collect (LOC), which minimizes expected future collection costs. Finally, we

numerically compare our framework to the conventional baseline of estimating data

requirements by extrapolating from neural scaling laws. We signiﬁcantly reduce

the risks of failing to meet desired performance targets on several classiﬁcation,

segmentation, and detection tasks, while maintaining low total collection costs.

1 Introduction

When deploying a deep learning model in an industrial application, designers often mandate that the

model must meet a pre-determined baseline performance, such as a target metric over a validation

data set. For example, an object detector may require a certain minimum mean average precision

before being deployed in a safety-critical setting. One of the most effective ways of meeting target

performances is by collecting more training data for a given model.

Determining how much data is needed to meet performance targets can impact costs and development

delays. Overestimating the data requirement incurs excess costs from collection, cleaning, and

annotation. For instance, annotating segmentation masks for a driving data set takes between

seconds per object. For

100,000

images the annotation could require between

170

and

460

days-equivalent of time [

]. On the other hand, collecting too little data may incur future costs and

workﬂow delays from having to collect more later. For example, in medical imaging applications,

this means further clinical data acquisition rounds that require expensive clinician time. In the worst

case, designers may even realize that a project is infeasible only after collecting insufﬁcient data.

The growing literature on sample complexity in machine learning has identiﬁed neural scaling laws

that scale model performance with data set sizes according to power laws [

–

]. For instance, Rosen-

feld et al.

[6]

ﬁt power law functions on the performance statistics of small data sets to extrapolate the

learning curve with more data. In contrast, Mahmood et al.

[2]

consider estimating data requirements

and show that even small errors in a power law model of the learning curve can translate to massively

over- or underestimating how much data is needed. Beyond this, different data sources have different

costs and scale differently with performance [

–

]. For example, although unlabeled data may

be easier to collect than labeled data, some semi-supervised learning tasks may need an order of

magnitude more unlabeled data to match the performance of a small labeled set. Thus, collecting

more data based only on estimation will fail to capture uncertainty and collection costs.

arXiv:2210.01234v1 [cs.LG] 3 Oct 2022

Optimize-Collect

Sample bootstrap performance

with subsets of data to learn

data requirement distribution

Use estimated distribution to

optimize collection cost plus

risk of failing to meet V*

Pay c(qt - qt-1) to collect

additional data

Have we

achieved

score V*?

Have we hit

time limit T?

TerminatePay P and

terminate

Yes

Initialize with q0 data

points & target V*

Learn Optimize Collect

Figure 1: In the optimal data collection problem, we iteratively determine the amount of data that

we should have, pay to collect the additional data, and then re-evaluate our model. Our approach,

Learn-Optimize-Collect, optimizes for the minimum amount of data q∗

tto collect.

In this paper, we propose a new paradigm for modeling the data collection workﬂow as an optimal

data collection problem. Here, a designer must minimize the cost of collecting enough data to obtain a

model capable of a desired performance score. They have multiple collection rounds, where after each

round, they re-evaluate the model and decide how much more data to order. The data has per-sample

costs and moreover, the designer pays a penalty if they fail to meet the target score within a ﬁnite

horizon. Using this formal framework, we develop an optimization approach for minimizing the

expected future collection costs and show that this problem can be optimized in each collection round

via gradient descent. Furthermore, our optimization problem immediately generalizes to decisions

over multiple data sources (e.g., unlabeled, long-tail, cross-domain, synthetic) that have different

costs and impacts on performance. Finally, we demonstrate the value of optimization over naïvely

estimating data set requirements (e.g., [2]) for several machine learning tasks and data sets.

Our contributions are as follows. (1) We propose the optimal data collection problem in machine

learning, which formalizes data collection workﬂows. (2) We introduce Learn-Optimize-Collect

(LOC), a learning-and-optimizing framework that minimizes future collection costs, can be solved

via gradient descent, and has analytic solutions in some settings. (3) We generalize the data collection

problem and LOC to a multi-variate setting where different types of data have different costs. To the

best of our knowledge, this is the ﬁrst exploration of data collection with general multiple data sets

in machine learning, covering for example, semi-supervised and long-tail learning. (4) We perform

experiments over classiﬁcation, segmentation, and detection tasks to show, on average, approximately

a2×reduction in the chances of failing to meet performance targets, versus estimation baselines.

2 Related work

Neural Scaling Laws.

According to the neural scaling law literature, the performance of a model

on a validation set scales with the size of the training data set qvia a power law V∝θ0qθ1[5, 6, 8–

–

]. Hestness et al.

[5]

observe this property over vision, language, and audio tasks, Bahri et al.

[9]

develop a theoretical relationship under assumptions on over-parametrization and the Lipschitz

continuity of the loss, model, and data, and Rosenfeld et al.

[6]

estimate power laws using smaller

data sets and models to extrapolate future performance. Multi-variate scaling laws have also been

considered for some speciﬁc tasks, for example in transfer learning from synthetic to real data

sets [

]. Finally, Mahmood et al.

[2]

explore data collection by estimating the minimum amount

of data needed to meet a given target performance over multiple rounds. Our paper extends these

prior studies by developing an optimization problem to minimize the expected total cost of data

collected. Speciﬁcally, we incorporate the uncertainty in any regression estimate of data requirements

and further generalize to multiple data sources with different costs.

Active Learning.

In active learning, a model sequentially collects data by selecting new subsets of

an unlabeled data pool to label under a pre-determined labeling budget that replenishes after each

round [

–

]. In contrast, our work focuses on systematically determining an optimal collection

budget. After determining how much data to collect, we can use active learning techniques to collect

the desired amount of data.

Statistical Learning Theory.

Theoretical analysis of the sample complexity of machine learning

models is typically only tight asymptotically, but some recent work have empirically analyzed these

relationships [

]. Particularly, Bisla et al.

[10]

study generalization bounds for deep neural

networks, provide empirical validation, and suggest using them to estimate data requirements. In

contrast, our paper formally explores the consequences of collection costs on data requirements.

Optimal Experiment Design.

The topic of how to collect data, select samples, and design scientiﬁc

experiments or controlled trials is well-studied in econometrics [

–

]. For example, Bertsimas

et al.

[30]

optimize the assignment of samples into control and trial groups to minimize inter-group

variances. Most recently, Carneiro et al.

[31]

optimize how many samples and covariates to collect in

a statistical experiment by minimizing a treatment effect estimation error or maximizing

-test power.

However, our focus on industrial machine learning applications differs from experiment design by

having target performance metrics and continual rounds of collection and modeling.

3 Main Problem

In this section, we give a motivating example before introducing the formal data collection problem.

We include a table of notation in Appendix A.

Motivating Example.

A startup is developing an object detector for use in autonomous vehicles

within the next

T= 5

years. Their model must achieve a mean Average Precision greater than

V∗=

95%

on a pre-determined validation set or else they will lose an expected proﬁt of

P= $1,000,000

Collecting training data requires employing drivers to record videos and annotators to label the data,

where the marginal cost of obtaining each image is approximately

c= $1

. In order to manage annual

ﬁnances, the startup must plan how much data to collect at the beginning of each year.

Let

z∼p(z)

be data drawn from a distribution

. For instance,

z:= (x, y)

may correspond to

images

and labels

. Consider a prediction problem for which we train a model with a data set

of points sampled from p(z). Let V(D)be a score function evaluating the model trained on D.

Optimal Data Collection.

We possess an initial data set

Dq0:= {zi}q0

i=1

points; we omit the

subscript on

referring to its size when it is obvious. Our problem is deﬁned by a target score

V∗> V (Dq0), a cost-per-sample cof collection, a horizon of Trounds, and a penalty Pfor failure.

At the end of each round

t∈ {1, . . . , T }

, let

be the current amount of data collected. Our goal is

to minimize the total collection cost while building a model that can achieve the target score:

min

q1,...,qT

c(qT−q0) + P1{V(DqT)< V ∗}s.t. q0≤q1≤ · · · ≤ qT

= min

q1,...,qT

t=1

(qt−qt−1) + P1{V(DqT)< V ∗}s.t. q0≤q1≤ · · · ≤ qT(1)

The collection cost is measured by the difference in data set size between the ﬁnal and the 0-th round

c(qT−q0) = cPT

t=1(qt−qt−1)

, Because we collect data iteratively over multiple rounds (see

Figure 1), we break (1) into the sum of differences per round. Speciﬁcally in each round, we

Decide to grow the data set to

qt≥qt−1

points by sampling

D:= {ˆzi}qt−qt−1

i=1 ∼p(z)

. Pay

a cost c(qt−qt−1)and update D ← D ∪ ˆ

2. Train the model and evaluate the score. If V(D)≥V∗, then terminate.

3. If t=T, then pay the penalty Pand terminate. Otherwise, repeat for the next round.

The model score typically increases monotonically with data set size [

]. This means that the

minimum cost strategy for

(1)

is to collect just enough data such that

V(DqT) = V∗

. We can

estimate this minimum data requirement by modeling the score function as a stochastic process. Let

Vq:= V(Dq)

and let

{Vq}q∈Z+

be a stochastic process whose indices represent training set sizes

in different rounds. Then, collecting data in each round yields a sequence of subsampled data sets

Dqt−1⊂ Dqtand their performances V(Dqt). The minimum data requirement is the stopping time

D∗:= arg min

{q|Vq≥V∗}.(2)

which is a random variable giving the ﬁrst time that we pass the target. Note that

q∗

1=· · · =q∗

T=D∗

is a minimum cost solution to the optimal data collection problem, incurring a total cost

c(D∗−q0)1

1We assume that c(D∗−q0)< P , since otherwise the optimal strategy would be to collect no data.

Estimating

D∗

using past observations of the learning curve is difﬁcult since we have only

rounds.

Further, Mahmood et al.

[2]

empirically show that small errors in ﬁtting the learning curve can cause

massive over- or under-collection. Thus, robust policies must capture the uncertainty of estimation.

4 Learn-Optimize-Collect (LOC)

Our solution approach, which we refer to as Learn-Optimize-Collect (LOC), minimizes the total

collection cost while incorporating the uncertainty of estimating

D∗

. Although

D∗

is a discrete

random variable, it is realized typically on the order of thousands or greater. To simplify our problem

and ensure differentiability, we assume that D∗is continuous and has a well-deﬁned density.

Assumption 1.

The random variable

D∗

is absolutely continuous and has a cumulative density

function (CDF) F(q)and probability density function (PDF) f(q) := dF (q)/dq.

In Section 4.1, we ﬁrst develop an optimization model when given access to the CDF

f(q)

and PDF

F(q)

. In Section 4.2, we estimate these distributions and combine them with the optimization model.

Finally in Section 4.3, we delineate our optimization approach from prior regression methods.

4.1 Optimization Model

We propose an optimization problem that for any

, can simultaneously solve for the optimal amounts

of data to collect

qt, . . . , qT

in all future rounds. Consider

t= 1

and to develop intuition, suppose we

know a priori the exact stopping time D∗. Then, problem (1) can be re-written as

min

q1,···qT

L(q1, . . . , qT;D∗) s.t. q0≤q1≤ · · · ≤ qT(3)

where the objective function is deﬁned recursively as follows

L(q1, . . . , qT;D∗) := c(q1−q0) + 1{q1< D∗}c(q2−q1) + 1{q2< D∗}c(q3−q2). . .

· · · +1{qT−1< D∗}c(qT−qT−1) + P1{qT< D∗}· · · 

t=1

(qt−qt−1)

t−1

s=1

1{qs< D∗}+P

t=1

1{qs< D∗}

t=1

(qt−qt−1)1{qt−1< D∗}+P1{qT< D∗}.

The objective differs slightly from

(1)

due to the indicator terms, which ensure that once we collect

enough data, we terminate the problem. The second line follows from gathering the terms. The third

line follows from observing that q1≤q2≤ · · · ≤ qTare constrained.

In practice, we do not know

D∗

a priori since it is an unobserved random variable. Instead, suppose we

have access to the CDF

F(q)

. Then, we take the expectation over the objective

E[L(q1, . . . , qT;D∗)]

to formulate a stochastic optimization problem for determining how much data to collect:

min

q1,···qT

t=1

(qt−qt−1) (1 −F(qt−1)) + P(1 −F(qT)) s.t. q0≤q1≤ · · · ≤ qT.(4)

Note that the collection variables should be discrete

q1, . . . , qT∈Z+

, but similar to the modeling

D∗

, we relax the integrality requirement, optimize over continuous variables, and round the ﬁnal

solutions. Furthermore, although problem

(4)

is constrained, we can re-formulate it with variables

dt:= qt−qt−1

; this consequently replaces the current constraints with only non-negativity constraints

dt≥0. Finally due to Assumption 1, problem (4) can be optimized via gradient descent.

4.2 Learning and Optimizing the Data Requirement

Solving problem

(4)

requires access to the true distribution

F(q)

, which we do not have in reality. In

each round, given a current training data set

Dqt

points, we must estimate these distribution

functions F(q)and f(q)and then incorporate them into our optimization problem.

Given a current data set

Dqt

, we may sample an increasing sequence of

subsets

Dqt/R ⊂ D2qt/R ⊂

· · · ⊂ Dqt

, ﬁt our model to each subset, and compute the scores to obtain a data set of the learning

curve

R:= {(rqt/R, V (Drqt/R))}R

r=1

. In order to model the distribution of

D∗

, we can take

bootstrap resamples of

to ﬁt a series of regression functions and obtain corresponding estimates

{ˆ

Db}B

b=1

. Given a set of estimates of the data requirement, we estimate the PDF via Kernel Density

Estimation (KDE). Finally to ﬁt the CDF, we numerically integrate the PDF.

In our complete framework, LOC, we ﬁrst estimate

F(q)

and

f(q)

. We then use these models to solve

problem

(4)

. Note that in the

-th round of collection, we ﬁx the prior decision variables

q1,...qt−1

constant. Finally, we collect data as determined by the optimal solution

q∗

to problem

(4)

. Full details

of the learning and optimization steps, including the complete Algorithm, are in Appendix B.

4.3 Comparison to Mahmood et al. [2]

Our prediction model extends the previous approach of Mahmood et al.

[2]

, who consider only

point estimation of

D∗

. They (i) build the set

, (ii) ﬁt a parametric function

ˆv(q;θ)

via

least-squares minimization, and (iii) solve for

D= arg minq{q|ˆv(q;θ)≥V∗}

. They use several

parametric functions from the neural scaling law literature, including the power law function (i.e.,

ˆv(q;θ) := θ0qθ1+θ2

[

] where

θ:= {θ0, θ1, θ2}

), and use an ad hoc correction factor obtained

by trial and error on past tasks to help decrease the failure rate. Instead, we take bootstrap samples of

to ﬁt multiple regression functions, estimate a distribution for

, and incorporate them into our

novel optimization model. Finally, we show in the next two sections that our optimization problem

has analytic solutions and extends to multiple sources.

5 Analytic Solutions for the T= 1 Setting

In this section, we explore analytic solutions for problem

(4)

. The unobservable

D∗

and sequential

decision-making nature suggest this problem can be formulated as a Partially Observable Markov

Decision Process (POMDP) with an inﬁnite state and action space (see Appendix C.1), but such

problems rarely permit exact solution methods [

]. Nonetheless, we can derive exact solutions for

the simple case of a single T= 1 round, re-stated below

min

c(q1−q0) + P(1 −F(q1)) s.t. q0≤q1(5)

Theorem 1.

Assume

F(q)

is strictly increasing and continuous. If there exists

q1≥q0,ˆ≥0

where

P≤F(q1)−F(q0)

q1−q0

,ˆ≤1−F(q0), P =c/f(F−1(1 −ˆ)) (6)

then there exists an

≤1−F(q0)

that satisﬁes

P=c/f(F−1(1 −))

and an optimal solution to

the corresponding problem (5) is q∗

1:= F−1(1 −). Otherwise, the optimal solution is q∗

1:= q0.

When the penalty Pis speciﬁed via a failure risk , the optimal solution to problem (5) is equal to a

quantile of the distribution of D∗. We defer the proof and some auxiliary results to Appendix C.2.

Theorem 1 further provides guidelines on choosing values for the cost and penalty parameters. While

is the dollar-value cost per-sample, which includes acquisition, cleaning, and annotation,

can

reﬂect their inherent regret or opportunity cost of failing to meet their target score. A designer can

accept a risk



of failing to collect enough data

Pr{q∗< D∗}=

. From Theorem 1, their optimal

strategy should be to collect F−1(1 −)points, which is also the optimal solution to problem (5).

6 The Multi-variate LOC: Collecting Data from Multiple Sources

So far, we have assumed that a designer only chooses how much data to collect and must pay a

ﬁxed per-sample collection cost. We now explore the multi-variate extension of the data collection

problem where there are different types of data with different costs. For example, consider long-tail

learning where samples for some rare classes are harder to obtain and thus, more expensive [

semi-supervised learning where labeling data may cost more than collecting unlabeled data [

], or

domain adaptation where a source data set is easier to obtain than a target set [

]. In this section, we

highlight our main formulation and defer the complete multi-variate LOC to Appendix D.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimizingDataCollectionforMachineLearningRadMahmood1JamesLucas1JoseM.Alvarez1SanjaFidler1;2;3MarcT.Law11NVIDIA2UniversityofToronto3VectorInstitute{rmahmood,jalucas,josea,sfidler,marcl}@nvidia.comProjectPage:https://nv-tlabs.github.io/LearnOptimizeCollect/AbstractModerndeeplearningsystemsrequirehug...

展开>> 收起<<

Optimizing Data Collection for Machine Learning.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimizing Data Collection for Machine Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: