tf.data service: A Case for Disaggregating ML Input Data Processing SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA
visited exactly once per training epoch [
21
,
54
,
55
], we design
the data sharing and coordinated reads features of tf.data
service with more relaxed data visitation guarantees. We
nd that relaxing data visitation to at-most-once (instead of
exactly-once) guarantees generally has negligible impact on
model accuracy, as models are typically trained with larger
datasets than in academic settings. Relaxing data visitation
guarantees simplies the design of ephemeral data sharing,
coordinated reads, and fault tolerance mechanisms (e.g., com-
pared to other ML data caching systems [30, 45, 55]).
We present the rst open-source disaggregated data pro-
cessing framework for ML workloads and show it has three
key advantages for large-scale ML training jobs: 1) horizon-
tal scale-out to right-size CPU/RAM host resources for data
processing, 2) ephemeral data sharing to avoid redundant
preprocessing among fully or partially concurrent jobs, and
3) coordinated reads to avoid stragglers in distributed train-
ing that can arise due to dierences in input data sizes. Our
design is inspired by lessons from deploying tf.data service in
production, including the ability to relax data visitation guar-
antees without impacting model accuracy. Since launching
tf.data service open source in 2020, the system has already
been used as the foundation for several research projects,
including data echoing [
13
], Cachew [
30
], and FastFlow [
71
].
2 BACKGROUND AND RELATED WORK
ML input data processing. Input data processing involves
an “extract transform load” (ETL) pipeline. Jobs read source
data from storage, process data on-the-y, and load batches
of transformed data to ML accelerators for model training
or inference. Common data transformations include decom-
pressing data, parsing le formats, extracting features, and
batching elements. It is also common to add randomness
to input data (e.g., randomly sample, augment, and shue
elements) to improve model generalization [15, 16, 18, 64].
While model training or inference typically executes on ex-
pensive, specialized hardware, user-dened input data trans-
formations execute on general purpose CPUs. Their relative
cost dierence makes it particularly important to ensure that
accelerators operate at high utilization [
4
,
23
]. If the ML com-
putation does not have to wait for data, the job is considered
model-bound. Otherwise, the job is considered input-bound.
Input-bound jobs are more problematic as they leave valu-
able ML hardware underutilized, however model-bound jobs
can also be costly by leaving CPUs underutilized. Removing
input bottlenecks is critical as input-bound jobs hog valuable
hardware for extended periods of time, incurring signicant
costs and signicant delays for the job itself as well as other
jobs that are waiting for the hardware resources to free up.
Data processing frameworks. Generic data processing
frameworks, such as Beam [
1
], Flume [
2
] and Spark [
76
] are
often used for oine ML data processing (i.e. processing that
takes place prior to any ML compute and handles data clean-
ing, feature engineering, normalization, etc.). These generic
frameworks are not suitable for online ML data processing
(i.e. on-the-y data processing during a training job that han-
dles data augmentation, shuing, batching, etc.), primarily
due to the overhead they impose. For instance, Spark Stream-
ing recommends batching work at 50ms granularity [
65
]
while ML training jobs often have step times below 1ms.
Other issues stem from API mismatches between generic
data processing frameworks and ML training frameworks, a
lack of holistic optimization opportunities to improve run-
time performance (as the preprocessing and ML compute
frameworks are disjoint), and the diculty of accommodat-
ing ML-specic preprocessing requirements (such as relaxed
data visitation guarantees, which we describe below).
ML-specic data frameworks, used for online data prepro-
cessing, include PyTorch DataLoader [
14
], NVIDIA DALI [
31
],
and tf.data [
56
]. We focus on tf.data as it is widely used in-
ternally at Google and in open-source Tensorow programs.
tf.data provides an API and a runtime for creating and exe-
cuting ecient ML data processing pipelines. With tf.data,
users can express and execute input data pipelines using
a variety of libraries that integrate with ML frameworks
like PyTorch [
58
] and Tensorow [
3
]. It provides generic
operators (e.g.
map
,
filter
,
prefetch
, etc.) and an autotun-
ing feature that automatically adjusts runtime conguration
knobs (e.g. parallelism, prefetching, etc.) [22].
Colocated vs. disaggregated data processing. Tradition-
ally, ML frameworks have colocated input data processing
and ML computations on the same machine [
14
,
57
,
67
]. How-
ever, feeding powerful modern accelerators with data at su-
cient throughput requires more CPU and memory resources
than are often locally available [
30
,
77
]. This motivates a
disaggregated processing mode, where data processing exe-
cutes on separate machines whose resources can be scaled
independently from expensive ML accelerators. The ability
to right-size resource allocations is critical to satisfy the wide
variety of resource requirements in ML jobs (see Figure 2)
and ensure that all allocated resources — both costly ML
accelerators and CPU/MEM — remain highly utilized.
Data visitation guarantees. In the ML community, it is
customary to train models with exactly-once data visitation
semantics for each epoch [
7
,
8
,
21
,
54
]. This means that every
sample in the dataset is visited exactly once before any sam-
ple is re-visited during training. For small datasets, deviating
from exactly-once guarantees can skew the data distribution,
potentially leading to a less generalizable or lower accuracy
model [
21
,
54
]. In §3, we will discuss how relaxing data vis-
itation guarantees is possible for production ML jobs, as