tf.data service A Case for Disaggregating ML Input Data Processing

2025-05-02 0 0 1.17MB 18 页 10玖币

侵权投诉

tf.data service: A Case for Disaggregating ML Input

Data Processing

Andrew Audibert

Google

Yang Chen

Google

Dan Graur

ETH Zürich

Ana Klimovic

ETH Zürich

Jiří Šimša

Google

Chandramohan A.

Thekkath

Google

ABSTRACT

Machine learning (ML) computations commonly execute on

expensive specialized hardware, such as GPUs and TPUs,

which provide high FLOPs and performance-per-watt. For

cost eciency, it is essential to keep these accelerators highly

utilized. This requires preprocessing input data at the rate

at which the accelerators can ingest and perform ML com-

putations on the data. To avoid data stalls, the host CPU

and RAM required for input data processing per accelerator

core used for ML computations varies across jobs. Hence,

the traditional approach of processing input data on ML ac-

celerator hosts with a xed hardware ratio leads to either

under-utilizing the accelerators or the host CPU and RAM.

In this paper, we address these concerns by building a disag-

gregated ML data processing system.

We present tf.data service, an open-source disaggregated

input data processing service built on top of tf.data in Tensor-

Flow. We show that disaggregating data preprocessing has

three key advantages for large-scale ML training jobs. First,

the service can horizontally scale-out to right-size CPU/RAM

host resources for data processing in each job, saving 32

training time and 26

cost, on average. Second, the ser-

vice can share ephemeral preprocessed data results across

jobs, to optimize CPU usage and reduce redundant compu-

tations. Finally, the service supports coordinated reads, a

technique that avoids stragglers due to dierent input sizes

in distributed training, reducing training time by 2.2

, on

average. Our design is inspired by lessons learned from de-

ploying tf.data service in production, including relaxing data

visitation guarantees without impacting model accuracy.

Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for prot or commercial advantage and that copies

bear this notice and the full citation on the rst page. Copyrights for third-

party components of this work must be honored. For all other uses, contact

the owner/author(s).

SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA

ACM ISBN 979-8-4007-0387-4/23/11.

https://doi.org/10.1145/3620678.3624666

ACM Reference Format:

Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša,

and Chandramohan A. Thekkath. 2023. tf.data service: A Case for

Disaggregating ML Input Data Processing. In ACM Symposium on

Cloud Computing (SoCC ’23), October 30–November 1, 2023, Santa

Cruz, CA, USA. ACM, New York, NY, USA, 18 pages. https://doi.org/

10.1145/3620678.3624666

1 INTRODUCTION

The ubiquity of machine learning (ML) has led to widespread

deployment of specialized hardware to accelerate ML com-

putations (e.g., GPUs and TPUs). While vastly improving the

performance per watt of ML training and inference [

hardware accelerators are signicantly more expensive than

traditional CPU servers [

]. Hence, achieving energy and

cost eciency benets requires keeping hardware accelera-

tors highly utilized.

Operating ML accelerators at high utilization requires

feeding fresh batches of preprocessed data at the rate at

which ML computations executing on accelerators request

new data. Data preprocessing typically executes on CPU host

resources of ML servers, as data transformations consist of

user-dened functions [

]. This makes it right-sizing host

CPU/RAM (for data preprocessing) to ML accelerator cores

(for ML model computations) imperative, in order to avoid

data preprocessing stalls and maximize resource eciency.

However, ML jobs require diverse ratios of host CPU/RAM

and ML accelerator resources [

]. Figure 1 shows the dis-

tributions of host CPU and memory resource requirements

for data preprocessing across ML jobs at Google, normalized

to the peak usage for each resource. As the distributions are

heavy-tailed, picking a particular CPU and memory cong-

uration for ML accelerator hosts satisfy the requirements

of a limited set of jobs at a particular point

𝑝

on the x-axes

of the CDFs. This would leave many jobs on the left of

𝑝

underutilizing CPU/RAM and many jobs on the right of

𝑝

with insucient CPU/RAM for data preprocessing, idling ML

accelerators. Hence, provisioning ML jobs with a one-size-

ts-all ratio of host CPU/RAM and ML accelerator resources

is inecient for most jobs. Meanwhile, datacenter providers

arXiv:2210.14826v3 [cs.LG] 2 Jan 2024

SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA Audibert A., Chen Y., Graur D., Klimovic A., Šimša J., and Thekkath C.

(a) CPU usage (b) Memory capacity usage (c) Memory bandwidth usage

Figure 1: CDFs of normalized ML host resource usage for over 73k colocated processing jobs running in Google

datacenters over a 24 hour period. The takeaway is that host resource requirements vary widely across ML jobs.

Figure 2: RetinaNet [48] CPU/MEM usage when train-

ing on COCO [49] using a GCP TPU v2-8 VM [24].

must typically limit server heterogeneity to simplify resource

management and server maintenance at scale [42].

As today’s ML accelerator servers are typically provi-

sioned with generous amounts of CPU and RAM [

one could improve resource utilization by loaning spare

CPU/RAM on ML accelerator hosts to other CPU workloads.

However, colocating workloads with ML data preprocessing

is challenging. Figure 2 shows that data preprocessing has

highly bursty CPU usage as batches of data are processed and

loaded to the accelerator. Even if the average CPU utilization

on many ML accelerator hosts may be low, the bursty nature

of ML data preprocessing makes it dicult to share host

resources with other jobs without signicant interference.

An alternative approach is to disaggregate resources for

data preprocessing and ML computations. Disaggregation

enables allocating CPU/RAM and ML accelerators indepen-

dently for each job to meet its unique requirements. Just as

disaggregating storage from compute is known to improve

resource eciency in datacenters [

], disaggregat-

ing ML accelerators from CPU/RAM enables right-sizing

resources for ML workloads. For example, Meta has reported

that their internal closed-source ML data preprocessing sys-

tem, DPP [

], relies on scaling out data preprocessing to dis-

aggregated CPU hosts to meet the data ingestion demands of

large-scale recommender system model training [

]. How-

ever, disaggregating ML data processing does not come for

free: it requires recruiting potentially many remote nodes

for data processing, dealing with failures of these distributed

nodes, and sending large volumes of preprocessed data over

the network. There is a need for a data processing platform

that optimizes these tradeos to support resource-ecient

ML input data preprocessing.

We present the design and implementation of the rst

open-source disaggregated data processing framework, tf.data

service, available through the Tensorow GitHub project [

We evaluate tf.data service on a variety of production ML

workloads at Google from vision and natural language pro-

cessing (NLP) domains. We show that the primary advantage

of disaggregation is the ability to horizontally scale-out data

processing to remote workers to remove data stalls, which

improves training throughput by 31.7

and reduces cost by

26.2

, on average, for input-bound jobs. The cost savings

of maximizing accelerator utilization and reducing training

time signicantly outweigh the cost of using extra CPU hosts

for data processing.

We also show that the disaggregated architecture of tf.data

service has two additional advantages. First, enabling ML

jobs to fetch preprocessed data from remote hosts enables

sharing intermediate preprocessing results between a set of

jobs that execute in overlapping time windows. This ephemeral

data sharing feature saves CPU cycles by avoiding redundant

computations. Second, the disaggregated system architec-

ture of tf.data service enables coordinating data ingestion

across clients in distributed ML training jobs to reduce strag-

glers and improve training time and cost. For example, NLP

models typically ingest data with a wide variety of input

sizes, which can lead to stragglers among training clients.

The coordinated reads feature in tf.data service allows all ML

accelerator clients in a job to fetch data from remote workers

that prepare batches of data with similar size per training

iteration, resulting in up to 2

overall speedup due to less

data padding and synchronization overheads.

Contrary to the conventional wisdom of enforcing strict

data ordering guarantees and ensuring that each example is

tf.data service: A Case for Disaggregating ML Input Data Processing SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA

visited exactly once per training epoch [

], we design

the data sharing and coordinated reads features of tf.data

service with more relaxed data visitation guarantees. We

nd that relaxing data visitation to at-most-once (instead of

exactly-once) guarantees generally has negligible impact on

model accuracy, as models are typically trained with larger

datasets than in academic settings. Relaxing data visitation

guarantees simplies the design of ephemeral data sharing,

coordinated reads, and fault tolerance mechanisms (e.g., com-

pared to other ML data caching systems [30, 45, 55]).

We present the rst open-source disaggregated data pro-

cessing framework for ML workloads and show it has three

key advantages for large-scale ML training jobs: 1) horizon-

tal scale-out to right-size CPU/RAM host resources for data

processing, 2) ephemeral data sharing to avoid redundant

preprocessing among fully or partially concurrent jobs, and

3) coordinated reads to avoid stragglers in distributed train-

ing that can arise due to dierences in input data sizes. Our

design is inspired by lessons from deploying tf.data service in

production, including the ability to relax data visitation guar-

antees without impacting model accuracy. Since launching

tf.data service open source in 2020, the system has already

been used as the foundation for several research projects,

including data echoing [

], Cachew [

], and FastFlow [

2 BACKGROUND AND RELATED WORK

ML input data processing. Input data processing involves

an “extract transform load” (ETL) pipeline. Jobs read source

data from storage, process data on-the-y, and load batches

of transformed data to ML accelerators for model training

or inference. Common data transformations include decom-

pressing data, parsing le formats, extracting features, and

batching elements. It is also common to add randomness

to input data (e.g., randomly sample, augment, and shue

elements) to improve model generalization [15, 16, 18, 64].

While model training or inference typically executes on ex-

pensive, specialized hardware, user-dened input data trans-

formations execute on general purpose CPUs. Their relative

cost dierence makes it particularly important to ensure that

accelerators operate at high utilization [

]. If the ML com-

putation does not have to wait for data, the job is considered

model-bound. Otherwise, the job is considered input-bound.

Input-bound jobs are more problematic as they leave valu-

able ML hardware underutilized, however model-bound jobs

can also be costly by leaving CPUs underutilized. Removing

input bottlenecks is critical as input-bound jobs hog valuable

hardware for extended periods of time, incurring signicant

costs and signicant delays for the job itself as well as other

jobs that are waiting for the hardware resources to free up.

Data processing frameworks. Generic data processing

frameworks, such as Beam [

], Flume [

] and Spark [

] are

often used for oine ML data processing (i.e. processing that

takes place prior to any ML compute and handles data clean-

ing, feature engineering, normalization, etc.). These generic

frameworks are not suitable for online ML data processing

(i.e. on-the-y data processing during a training job that han-

dles data augmentation, shuing, batching, etc.), primarily

due to the overhead they impose. For instance, Spark Stream-

ing recommends batching work at 50ms granularity [

]

while ML training jobs often have step times below 1ms.

Other issues stem from API mismatches between generic

data processing frameworks and ML training frameworks, a

lack of holistic optimization opportunities to improve run-

time performance (as the preprocessing and ML compute

frameworks are disjoint), and the diculty of accommodat-

ing ML-specic preprocessing requirements (such as relaxed

data visitation guarantees, which we describe below).

ML-specic data frameworks, used for online data prepro-

cessing, include PyTorch DataLoader [

], NVIDIA DALI [

and tf.data [

]. We focus on tf.data as it is widely used in-

ternally at Google and in open-source Tensorow programs.

tf.data provides an API and a runtime for creating and exe-

cuting ecient ML data processing pipelines. With tf.data,

users can express and execute input data pipelines using

a variety of libraries that integrate with ML frameworks

like PyTorch [

] and Tensorow [

]. It provides generic

operators (e.g.

map

filter

prefetch

, etc.) and an autotun-

ing feature that automatically adjusts runtime conguration

knobs (e.g. parallelism, prefetching, etc.) [22].

Colocated vs. disaggregated data processing. Tradition-

ally, ML frameworks have colocated input data processing

and ML computations on the same machine [

]. How-

ever, feeding powerful modern accelerators with data at su-

cient throughput requires more CPU and memory resources

than are often locally available [

]. This motivates a

disaggregated processing mode, where data processing exe-

cutes on separate machines whose resources can be scaled

independently from expensive ML accelerators. The ability

to right-size resource allocations is critical to satisfy the wide

variety of resource requirements in ML jobs (see Figure 2)

and ensure that all allocated resources — both costly ML

accelerators and CPU/MEM — remain highly utilized.

Data visitation guarantees. In the ML community, it is

customary to train models with exactly-once data visitation

semantics for each epoch [

]. This means that every

sample in the dataset is visited exactly once before any sam-

ple is re-visited during training. For small datasets, deviating

from exactly-once guarantees can skew the data distribution,

potentially leading to a less generalizable or lower accuracy

model [

]. In §3, we will discuss how relaxing data vis-

itation guarantees is possible for production ML jobs, as

SoCC ’23, October 30–November 1, 2023, Santa Cruz, CA, USA Audibert A., Chen Y., Graur D., Klimovic A., Šimša J., and Thekkath C.

they generally train on signicantly larger datasets that are

continuously updated [36].

Prior work on optimizing input processing. Several ap-

proaches have been proposed to alleviate input data stalls in

ML training jobs. Plumber [

] and tf.data’s autotuning har-

ness [

] dynamically tune software parallelism and memory

buer sizes to maximize performance on a given training

node. However, this tuning does not scale resources beyond a

single node. NVIDIA DALI supports ooading data process-

ing to GPUs [

]. This alleviates CPU bottlenecks but may

lead to GPU resource contention among input data transfor-

mation tasks and model training tasks. Other works cache

and reuse (transformed) data across epochs [

] or across

jobs [

], trading storage capacity to save CPU cycles

for repeated data processing. However, caching solutions

are still not guaranteed to eliminate data stalls. In-memory

caching solutions have capacity limitations since ML datasets

often exceed the size of a training node’s RAM [

]. SSD-

based caching requires high parallelism to avoid I/O bottle-

necks when reading large volumes of data from storage.

Meta’s closed source Distributed Data Processing (DPP)

service [

] proposes horizontally scaling data workers with

disaggregated data processing. Zhao et. al [

] characterize

the compute, memory, and network resource requirements

of data processing compared to model training for recom-

mender system models at Meta and advocated for disaggrega-

tion to scale-out data processing and avoid input bottlenecks.

However, they do not quantify the performance and cost ben-

ets of disaggregated data processing compared to colocated

data processing. In contrast, we provide the rst quantitative

analysis of how much disaggregated data processing actually

improves the overall performance and cost of production ML

workloads. We also go beyond DPP’s design and show the

benets of disaggregation besides horizontal scaling, such

as enabling ephemeral data sharing and coordinated reads.

FastFlow [

] builds on top of tf.data service, leveraging

its ability to preprocess data on both local and remote re-

sources. FastFlow extends our disaggregation mechanism

to support splitting data preprocessing between local and

remote workers at specic points in an input pipeline. The

system selects a pipeline split that maximizes throughput

in a xed-size tf.data service deployment, i.e., the system

decides which portion of an input pipeline to process locally

vs. remotely. FastFlow is complementary to our work and

further shows the benets that exible resource allocation

with disaggregation can bring to ML data processing.

3 TF.DATA SERVICE DESIGN

We present tf.data service, a system that enables disaggre-

gating and distributing data processing for ML workloads.

tf.data service integrates seamlessly with ML frameworks,

such as TensorFlow or PyTorch, and is particularly designed

to meet the data processing needs of ML training jobs. We

design tf.data service with three key principles in mind:

(1) Eliminate input data stalls. Eliminating data stalls

is key to maximizing throughput, which is an important

performance metric for ML training jobs. As we will demon-

strate in §4, using extra CPU/RAM resources to alleviate data

preprocessing stalls is often cost-ecient for input-bound

jobs as it helps them complete faster and hence consume

expensive hardware accelerators for less time.

(2) Avoid redundant data preprocessing. Input data

pipelines are often repeatedly executed in ML clusters, for

example in hyperparameter tuning or model search work-

ows [

]. Since data preprocessing can be compute,

memory, and power-intensive [

], reducing redundant

data preprocessing (i.e., sharing intermediate results) across

jobs is important for hardware and energy eciency.

(3) Simplify the design by relaxing constraints. While

guaranteeing exactly-once data visitation is customary when

benchmarking ML training on academic datasets [

we nd that production ML training jobs allow for more

relaxed data visitation guarantees with negligible impact

on accuracy. These jobs train on vast volumes of data that

are continuously updated [

]. Some datasets are large

enough that training on all data may not be practical as the

model can safely converge on a subset of the entire dataset

without the risk of overtting [

]. Hence, we nd that at-

most-once data visitation is sucient for most jobs (i.e., each

sample is visited at most once per epoch)

. We can leverage

relaxed data ordering guarantees to simplify fault-tolerance.

With the above principles in mind, we design tf.data ser-

vice as a disaggregated system that enables: eliminating input

data bottlenecks by horizontally scaling out data workers

(§3.1), reducing CPU usage for data processing by sharing

ephemeral data transformations between jobs (§3.5), and

avoiding stragglers in distributed training by coordinating

reads between multiple clients (§3.6).

3.1 System Architecture

Figure 3 shows the tf.data service architecture. The core sys-

tem is composed of a dispatcher, which manages various

metadata, and a pool of workers, which perform data pre-

processing. The dispatcher and workers are managed by an

ML job orchestrator.Clients execute ML computations (e.g.,

model training) on servers equipped with accelerators. Dis-

tributed training jobs consist of multiple clients. Source data

Further relaxing visitation guarantees to zero-once-or-more can produce a

high quality model in some cases, as long as the random transformations

applied to the source data to ensure a suciently diverse and representative

distribution [13, 46].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

tf.dataservice:ACaseforDisaggregatingMLInputDataProcessingAndrewAudibertGoogleYangChenGoogleDanGraurETHZürichAnaKlimovicETHZürichJiříŠimšaGoogleChandramohanA.ThekkathGoogleABSTRACTMachinelearning(ML)computationscommonlyexecuteonexpensivespecializedhardware,suchasGPUsandTPUs,whichprovidehighFLOPsandp...

展开>> 收起<<

tf.data service A Case for Disaggregating ML Input Data Processing.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

tf.data service A Case for Disaggregating ML Input Data Processing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: