
Optimize-Collect
Sample bootstrap performance
with subsets of data to learn
data requirement distribution
Use estimated distribution to
optimize collection cost plus
risk of failing to meet V*
Pay c(qt - qt-1) to collect
additional data
Have we
achieved
score V*?
Have we hit
time limit T?
TerminatePay P and
terminate
Yes
No
Yes
Initialize with q0 data
points & target V*
No
Learn Optimize Collect
Figure 1: In the optimal data collection problem, we iteratively determine the amount of data that
we should have, pay to collect the additional data, and then re-evaluate our model. Our approach,
Learn-Optimize-Collect, optimizes for the minimum amount of data q∗
tto collect.
In this paper, we propose a new paradigm for modeling the data collection workflow as an optimal
data collection problem. Here, a designer must minimize the cost of collecting enough data to obtain a
model capable of a desired performance score. They have multiple collection rounds, where after each
round, they re-evaluate the model and decide how much more data to order. The data has per-sample
costs and moreover, the designer pays a penalty if they fail to meet the target score within a finite
horizon. Using this formal framework, we develop an optimization approach for minimizing the
expected future collection costs and show that this problem can be optimized in each collection round
via gradient descent. Furthermore, our optimization problem immediately generalizes to decisions
over multiple data sources (e.g., unlabeled, long-tail, cross-domain, synthetic) that have different
costs and impacts on performance. Finally, we demonstrate the value of optimization over naïvely
estimating data set requirements (e.g., [2]) for several machine learning tasks and data sets.
Our contributions are as follows. (1) We propose the optimal data collection problem in machine
learning, which formalizes data collection workflows. (2) We introduce Learn-Optimize-Collect
(LOC), a learning-and-optimizing framework that minimizes future collection costs, can be solved
via gradient descent, and has analytic solutions in some settings. (3) We generalize the data collection
problem and LOC to a multi-variate setting where different types of data have different costs. To the
best of our knowledge, this is the first exploration of data collection with general multiple data sets
in machine learning, covering for example, semi-supervised and long-tail learning. (4) We perform
experiments over classification, segmentation, and detection tasks to show, on average, approximately
a2×reduction in the chances of failing to meet performance targets, versus estimation baselines.
2 Related work
Neural Scaling Laws.
According to the neural scaling law literature, the performance of a model
on a validation set scales with the size of the training data set qvia a power law V∝θ0qθ1[5, 6, 8–
10
,
15
–
19
]. Hestness et al.
[5]
observe this property over vision, language, and audio tasks, Bahri et al.
[9]
develop a theoretical relationship under assumptions on over-parametrization and the Lipschitz
continuity of the loss, model, and data, and Rosenfeld et al.
[6]
estimate power laws using smaller
data sets and models to extrapolate future performance. Multi-variate scaling laws have also been
considered for some specific tasks, for example in transfer learning from synthetic to real data
sets [
11
]. Finally, Mahmood et al.
[2]
explore data collection by estimating the minimum amount
of data needed to meet a given target performance over multiple rounds. Our paper extends these
prior studies by developing an optimization problem to minimize the expected total cost of data
collected. Specifically, we incorporate the uncertainty in any regression estimate of data requirements
and further generalize to multiple data sources with different costs.
Active Learning.
In active learning, a model sequentially collects data by selecting new subsets of
an unlabeled data pool to label under a pre-determined labeling budget that replenishes after each
round [
20
–
24
]. In contrast, our work focuses on systematically determining an optimal collection
budget. After determining how much data to collect, we can use active learning techniques to collect
the desired amount of data.
Statistical Learning Theory.
Theoretical analysis of the sample complexity of machine learning
models is typically only tight asymptotically, but some recent work have empirically analyzed these
2