tf.data service A Case for Disaggregating ML Input Data Processing

2025-05-02 0 0 1.17MB 18 页 10玖币
侵权投诉
tf.data service: A Case for Disaggregating ML Input
Data Processing
Andrew Audibert
Google
Yang Chen
Google
Dan Graur
ETH Zürich
Ana Klimovic
ETH Zürich
Jiří Šimša
Google
Chandramohan A.
Thekkath
Google
ABSTRACT
Machine learning (ML) computations commonly execute on
expensive specialized hardware, such as GPUs and TPUs,
which provide high FLOPs and performance-per-watt. For
cost eciency, it is essential to keep these accelerators highly
utilized. This requires preprocessing input data at the rate
at which the accelerators can ingest and perform ML com-
putations on the data. To avoid data stalls, the host CPU
and RAM required for input data processing per accelerator
core used for ML computations varies across jobs. Hence,
the traditional approach of processing input data on ML ac-
celerator hosts with a xed hardware ratio leads to either
under-utilizing the accelerators or the host CPU and RAM.
In this paper, we address these concerns by building a disag-
gregated ML data processing system.
We present tf.data service, an open-source disaggregated
input data processing service built on top of tf.data in Tensor-
Flow. We show that disaggregating data preprocessing has
three key advantages for large-scale ML training jobs. First,
the service can horizontally scale-out to right-size CPU/RAM
host resources for data processing in each job, saving 32
×
training time and 26
×
cost, on average. Second, the ser-
vice can share ephemeral preprocessed data results across
jobs, to optimize CPU usage and reduce redundant compu-
tations. Finally, the service supports coordinated reads, a
technique that avoids stragglers due to dierent input sizes
in distributed training, reducing training time by 2.2
×
, on
average. Our design is inspired by lessons learned from de-
ploying tf.data service in production, including relaxing data
visitation guarantees without impacting model accuracy.
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. Copyrights for third-
party components of this work must be honored. For all other uses, contact
the owner/author(s).
SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA
©2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0387-4/23/11.
https://doi.org/10.1145/3620678.3624666
ACM Reference Format:
Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša,
and Chandramohan A. Thekkath. 2023. tf.data service: A Case for
Disaggregating ML Input Data Processing. In ACM Symposium on
Cloud Computing (SoCC ’23), October 30–November 1, 2023, Santa
Cruz, CA, USA. ACM, New York, NY, USA, 18 pages. https://doi.org/
10.1145/3620678.3624666
1 INTRODUCTION
The ubiquity of machine learning (ML) has led to widespread
deployment of specialized hardware to accelerate ML com-
putations (e.g., GPUs and TPUs). While vastly improving the
performance per watt of ML training and inference [
39
,
40
],
hardware accelerators are signicantly more expensive than
traditional CPU servers [
4
,
23
]. Hence, achieving energy and
cost eciency benets requires keeping hardware accelera-
tors highly utilized.
Operating ML accelerators at high utilization requires
feeding fresh batches of preprocessed data at the rate at
which ML computations executing on accelerators request
new data. Data preprocessing typically executes on CPU host
resources of ML servers, as data transformations consist of
user-dened functions [
56
]. This makes it right-sizing host
CPU/RAM (for data preprocessing) to ML accelerator cores
(for ML model computations) imperative, in order to avoid
data preprocessing stalls and maximize resource eciency.
However, ML jobs require diverse ratios of host CPU/RAM
and ML accelerator resources [
30
,
55
]. Figure 1 shows the dis-
tributions of host CPU and memory resource requirements
for data preprocessing across ML jobs at Google, normalized
to the peak usage for each resource. As the distributions are
heavy-tailed, picking a particular CPU and memory cong-
uration for ML accelerator hosts satisfy the requirements
of a limited set of jobs at a particular point
𝑝
on the x-axes
of the CDFs. This would leave many jobs on the left of
𝑝
underutilizing CPU/RAM and many jobs on the right of
𝑝
with insucient CPU/RAM for data preprocessing, idling ML
accelerators. Hence, provisioning ML jobs with a one-size-
ts-all ratio of host CPU/RAM and ML accelerator resources
is inecient for most jobs. Meanwhile, datacenter providers
arXiv:2210.14826v3 [cs.LG] 2 Jan 2024
SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA Audibert A., Chen Y., Graur D., Klimovic A., Šimša J., and Thekkath C.
(a) CPU usage (b) Memory capacity usage (c) Memory bandwidth usage
Figure 1: CDFs of normalized ML host resource usage for over 73k colocated processing jobs running in Google
datacenters over a 24 hour period. The takeaway is that host resource requirements vary widely across ML jobs.
Figure 2: RetinaNet [48] CPU/MEM usage when train-
ing on COCO [49] using a GCP TPU v2-8 VM [24].
must typically limit server heterogeneity to simplify resource
management and server maintenance at scale [42].
As today’s ML accelerator servers are typically provi-
sioned with generous amounts of CPU and RAM [
5
,
24
],
one could improve resource utilization by loaning spare
CPU/RAM on ML accelerator hosts to other CPU workloads.
However, colocating workloads with ML data preprocessing
is challenging. Figure 2 shows that data preprocessing has
highly bursty CPU usage as batches of data are processed and
loaded to the accelerator. Even if the average CPU utilization
on many ML accelerator hosts may be low, the bursty nature
of ML data preprocessing makes it dicult to share host
resources with other jobs without signicant interference.
An alternative approach is to disaggregate resources for
data preprocessing and ML computations. Disaggregation
enables allocating CPU/RAM and ML accelerators indepen-
dently for each job to meet its unique requirements. Just as
disaggregating storage from compute is known to improve
resource eciency in datacenters [
17
,
42
,
60
], disaggregat-
ing ML accelerators from CPU/RAM enables right-sizing
resources for ML workloads. For example, Meta has reported
that their internal closed-source ML data preprocessing sys-
tem, DPP [
77
], relies on scaling out data preprocessing to dis-
aggregated CPU hosts to meet the data ingestion demands of
large-scale recommender system model training [
51
]. How-
ever, disaggregating ML data processing does not come for
free: it requires recruiting potentially many remote nodes
for data processing, dealing with failures of these distributed
nodes, and sending large volumes of preprocessed data over
the network. There is a need for a data processing platform
that optimizes these tradeos to support resource-ecient
ML input data preprocessing.
We present the design and implementation of the rst
open-source disaggregated data processing framework, tf.data
service, available through the Tensorow GitHub project [
68
].
We evaluate tf.data service on a variety of production ML
workloads at Google from vision and natural language pro-
cessing (NLP) domains. We show that the primary advantage
of disaggregation is the ability to horizontally scale-out data
processing to remote workers to remove data stalls, which
improves training throughput by 31.7
×
and reduces cost by
26.2
×
, on average, for input-bound jobs. The cost savings
of maximizing accelerator utilization and reducing training
time signicantly outweigh the cost of using extra CPU hosts
for data processing.
We also show that the disaggregated architecture of tf.data
service has two additional advantages. First, enabling ML
jobs to fetch preprocessed data from remote hosts enables
sharing intermediate preprocessing results between a set of
jobs that execute in overlapping time windows. This ephemeral
data sharing feature saves CPU cycles by avoiding redundant
computations. Second, the disaggregated system architec-
ture of tf.data service enables coordinating data ingestion
across clients in distributed ML training jobs to reduce strag-
glers and improve training time and cost. For example, NLP
models typically ingest data with a wide variety of input
sizes, which can lead to stragglers among training clients.
The coordinated reads feature in tf.data service allows all ML
accelerator clients in a job to fetch data from remote workers
that prepare batches of data with similar size per training
iteration, resulting in up to 2
×
overall speedup due to less
data padding and synchronization overheads.
Contrary to the conventional wisdom of enforcing strict
data ordering guarantees and ensuring that each example is
tf.data service: A Case for Disaggregating ML Input Data Processing SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA
visited exactly once per training epoch [
21
,
54
,
55
], we design
the data sharing and coordinated reads features of tf.data
service with more relaxed data visitation guarantees. We
nd that relaxing data visitation to at-most-once (instead of
exactly-once) guarantees generally has negligible impact on
model accuracy, as models are typically trained with larger
datasets than in academic settings. Relaxing data visitation
guarantees simplies the design of ephemeral data sharing,
coordinated reads, and fault tolerance mechanisms (e.g., com-
pared to other ML data caching systems [30, 45, 55]).
We present the rst open-source disaggregated data pro-
cessing framework for ML workloads and show it has three
key advantages for large-scale ML training jobs: 1) horizon-
tal scale-out to right-size CPU/RAM host resources for data
processing, 2) ephemeral data sharing to avoid redundant
preprocessing among fully or partially concurrent jobs, and
3) coordinated reads to avoid stragglers in distributed train-
ing that can arise due to dierences in input data sizes. Our
design is inspired by lessons from deploying tf.data service in
production, including the ability to relax data visitation guar-
antees without impacting model accuracy. Since launching
tf.data service open source in 2020, the system has already
been used as the foundation for several research projects,
including data echoing [
13
], Cachew [
30
], and FastFlow [
71
].
2 BACKGROUND AND RELATED WORK
ML input data processing. Input data processing involves
an “extract transform load” (ETL) pipeline. Jobs read source
data from storage, process data on-the-y, and load batches
of transformed data to ML accelerators for model training
or inference. Common data transformations include decom-
pressing data, parsing le formats, extracting features, and
batching elements. It is also common to add randomness
to input data (e.g., randomly sample, augment, and shue
elements) to improve model generalization [15, 16, 18, 64].
While model training or inference typically executes on ex-
pensive, specialized hardware, user-dened input data trans-
formations execute on general purpose CPUs. Their relative
cost dierence makes it particularly important to ensure that
accelerators operate at high utilization [
4
,
23
]. If the ML com-
putation does not have to wait for data, the job is considered
model-bound. Otherwise, the job is considered input-bound.
Input-bound jobs are more problematic as they leave valu-
able ML hardware underutilized, however model-bound jobs
can also be costly by leaving CPUs underutilized. Removing
input bottlenecks is critical as input-bound jobs hog valuable
hardware for extended periods of time, incurring signicant
costs and signicant delays for the job itself as well as other
jobs that are waiting for the hardware resources to free up.
Data processing frameworks. Generic data processing
frameworks, such as Beam [
1
], Flume [
2
] and Spark [
76
] are
often used for oine ML data processing (i.e. processing that
takes place prior to any ML compute and handles data clean-
ing, feature engineering, normalization, etc.). These generic
frameworks are not suitable for online ML data processing
(i.e. on-the-y data processing during a training job that han-
dles data augmentation, shuing, batching, etc.), primarily
due to the overhead they impose. For instance, Spark Stream-
ing recommends batching work at 50ms granularity [
65
]
while ML training jobs often have step times below 1ms.
Other issues stem from API mismatches between generic
data processing frameworks and ML training frameworks, a
lack of holistic optimization opportunities to improve run-
time performance (as the preprocessing and ML compute
frameworks are disjoint), and the diculty of accommodat-
ing ML-specic preprocessing requirements (such as relaxed
data visitation guarantees, which we describe below).
ML-specic data frameworks, used for online data prepro-
cessing, include PyTorch DataLoader [
14
], NVIDIA DALI [
31
],
and tf.data [
56
]. We focus on tf.data as it is widely used in-
ternally at Google and in open-source Tensorow programs.
tf.data provides an API and a runtime for creating and exe-
cuting ecient ML data processing pipelines. With tf.data,
users can express and execute input data pipelines using
a variety of libraries that integrate with ML frameworks
like PyTorch [
58
] and Tensorow [
3
]. It provides generic
operators (e.g.
map
,
filter
,
prefetch
, etc.) and an autotun-
ing feature that automatically adjusts runtime conguration
knobs (e.g. parallelism, prefetching, etc.) [22].
Colocated vs. disaggregated data processing. Tradition-
ally, ML frameworks have colocated input data processing
and ML computations on the same machine [
14
,
57
,
67
]. How-
ever, feeding powerful modern accelerators with data at su-
cient throughput requires more CPU and memory resources
than are often locally available [
30
,
77
]. This motivates a
disaggregated processing mode, where data processing exe-
cutes on separate machines whose resources can be scaled
independently from expensive ML accelerators. The ability
to right-size resource allocations is critical to satisfy the wide
variety of resource requirements in ML jobs (see Figure 2)
and ensure that all allocated resources — both costly ML
accelerators and CPU/MEM — remain highly utilized.
Data visitation guarantees. In the ML community, it is
customary to train models with exactly-once data visitation
semantics for each epoch [
7
,
8
,
21
,
54
]. This means that every
sample in the dataset is visited exactly once before any sam-
ple is re-visited during training. For small datasets, deviating
from exactly-once guarantees can skew the data distribution,
potentially leading to a less generalizable or lower accuracy
model [
21
,
54
]. In §3, we will discuss how relaxing data vis-
itation guarantees is possible for production ML jobs, as
SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA Audibert A., Chen Y., Graur D., Klimovic A., Šimša J., and Thekkath C.
they generally train on signicantly larger datasets that are
continuously updated [36].
Prior work on optimizing input processing. Several ap-
proaches have been proposed to alleviate input data stalls in
ML training jobs. Plumber [
44
] and tf.data’s autotuning har-
ness [
67
] dynamically tune software parallelism and memory
buer sizes to maximize performance on a given training
node. However, this tuning does not scale resources beyond a
single node. NVIDIA DALI supports ooading data process-
ing to GPUs [
31
]. This alleviates CPU bottlenecks but may
lead to GPU resource contention among input data transfor-
mation tasks and model training tasks. Other works cache
and reuse (transformed) data across epochs [
13
,
46
] or across
jobs [
30
,
45
,
55
], trading storage capacity to save CPU cycles
for repeated data processing. However, caching solutions
are still not guaranteed to eliminate data stalls. In-memory
caching solutions have capacity limitations since ML datasets
often exceed the size of a training node’s RAM [
67
]. SSD-
based caching requires high parallelism to avoid I/O bottle-
necks when reading large volumes of data from storage.
Meta’s closed source Distributed Data Processing (DPP)
service [
77
] proposes horizontally scaling data workers with
disaggregated data processing. Zhao et. al [
77
] characterize
the compute, memory, and network resource requirements
of data processing compared to model training for recom-
mender system models at Meta and advocated for disaggrega-
tion to scale-out data processing and avoid input bottlenecks.
However, they do not quantify the performance and cost ben-
ets of disaggregated data processing compared to colocated
data processing. In contrast, we provide the rst quantitative
analysis of how much disaggregated data processing actually
improves the overall performance and cost of production ML
workloads. We also go beyond DPP’s design and show the
benets of disaggregation besides horizontal scaling, such
as enabling ephemeral data sharing and coordinated reads.
FastFlow [
71
] builds on top of tf.data service, leveraging
its ability to preprocess data on both local and remote re-
sources. FastFlow extends our disaggregation mechanism
to support splitting data preprocessing between local and
remote workers at specic points in an input pipeline. The
system selects a pipeline split that maximizes throughput
in a xed-size tf.data service deployment, i.e., the system
decides which portion of an input pipeline to process locally
vs. remotely. FastFlow is complementary to our work and
further shows the benets that exible resource allocation
with disaggregation can bring to ML data processing.
3 TF.DATA SERVICE DESIGN
We present tf.data service, a system that enables disaggre-
gating and distributing data processing for ML workloads.
tf.data service integrates seamlessly with ML frameworks,
such as TensorFlow or PyTorch, and is particularly designed
to meet the data processing needs of ML training jobs. We
design tf.data service with three key principles in mind:
(1) Eliminate input data stalls. Eliminating data stalls
is key to maximizing throughput, which is an important
performance metric for ML training jobs. As we will demon-
strate in §4, using extra CPU/RAM resources to alleviate data
preprocessing stalls is often cost-ecient for input-bound
jobs as it helps them complete faster and hence consume
expensive hardware accelerators for less time.
(2) Avoid redundant data preprocessing. Input data
pipelines are often repeatedly executed in ML clusters, for
example in hyperparameter tuning or model search work-
ows [
41
,
56
]. Since data preprocessing can be compute,
memory, and power-intensive [
56
,
77
], reducing redundant
data preprocessing (i.e., sharing intermediate results) across
jobs is important for hardware and energy eciency.
(3) Simplify the design by relaxing constraints. While
guaranteeing exactly-once data visitation is customary when
benchmarking ML training on academic datasets [
54
,
55
],
we nd that production ML training jobs allow for more
relaxed data visitation guarantees with negligible impact
on accuracy. These jobs train on vast volumes of data that
are continuously updated [
21
,
36
]. Some datasets are large
enough that training on all data may not be practical as the
model can safely converge on a subset of the entire dataset
without the risk of overtting [
21
]. Hence, we nd that at-
most-once data visitation is sucient for most jobs (i.e., each
sample is visited at most once per epoch)
1
. We can leverage
relaxed data ordering guarantees to simplify fault-tolerance.
With the above principles in mind, we design tf.data ser-
vice as a disaggregated system that enables: eliminating input
data bottlenecks by horizontally scaling out data workers
(§3.1), reducing CPU usage for data processing by sharing
ephemeral data transformations between jobs (§3.5), and
avoiding stragglers in distributed training by coordinating
reads between multiple clients (§3.6).
3.1 System Architecture
Figure 3 shows the tf.data service architecture. The core sys-
tem is composed of a dispatcher, which manages various
metadata, and a pool of workers, which perform data pre-
processing. The dispatcher and workers are managed by an
ML job orchestrator.Clients execute ML computations (e.g.,
model training) on servers equipped with accelerators. Dis-
tributed training jobs consist of multiple clients. Source data
1
Further relaxing visitation guarantees to zero-once-or-more can produce a
high quality model in some cases, as long as the random transformations
applied to the source data to ensure a suciently diverse and representative
distribution [13, 46].
摘要:

tf.dataservice:ACaseforDisaggregatingMLInputDataProcessingAndrewAudibertGoogleYangChenGoogleDanGraurETHZürichAnaKlimovicETHZürichJiříŠimšaGoogleChandramohanA.ThekkathGoogleABSTRACTMachinelearning(ML)computationscommonlyexecuteonexpensivespecializedhardware,suchasGPUsandTPUs,whichprovidehighFLOPsandp...

展开>> 收起<<
tf.data service A Case for Disaggregating ML Input Data Processing.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.17MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注