DreamShard Generalizable Embedding Table Placement for Recommender Systems Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1

2025-05-03 0 0 1.76MB 39 页 10玖币

侵权投诉

DreamShard: Generalizable Embedding Table

Placement for Recommender Systems

Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1

Bhargav Bhushanam2Yuandong Tian2Arun Kejariwal2Xia Hu1

1Rice University 2Meta Platforms, Inc. 3Texas A&M University

{daochen.zha,Zirui.Liu,khlai,Xia.Hu}@rice.edu

{lofe,bbhushanam,yuandong,akejariwal}@fb.com

qytan@tamu.edu

Abstract

We study embedding table placement for distributed recommender systems, which

aims to partition and place the tables on multiple hardware devices (e.g., GPUs)

to balance the computation and communication costs. Although prior work has

explored learning-based approaches for the device placement of computational

graphs, embedding table placement remains to be a challenging problem because of

1) the operation fusion of embedding tables, and 2) the generalizability requirement

on unseen placement tasks with different numbers of tables and/or devices. To

this end, we present DreamShard, a reinforcement learning (RL) approach for

embedding table placement. DreamShard achieves the reasoning of operation

fusion and generalizability with 1) a cost network to directly predict the costs

of the fused operation, and 2) a policy network that is efﬁciently trained on an

estimated Markov decision process (MDP) without real GPU execution, where

the states and the rewards are estimated with the cost network. Equipped with

sum and max representation reductions, the two networks can directly generalize

to any unseen tasks with different numbers of tables and/or devices without ﬁne-

tuning. Extensive experiments show that DreamShard substantially outperforms

the existing human expert and RNN-based strategies with up to 19% speedup over

the strongest baseline on large-scale synthetic tables and our production tables.

The code is available at https://github.com/daochenzha/dreamshard.

1 Introduction

Embedding learning is a commonly used technique to deal with categorical features in deep rec-

ommendation models by mapping sparse features into dense vectors [

]. However, the

embedding tables can be extremely large due to the large feature sizes [

]. For example, in the

YouTube recommendation model, a single categorical feature contains tens of millions of video

IDs [7]; the Meta recommendation model demands multi-terabyte memory [8]. Distributed training

has been adopted to place the tables on multiple hardware devices such as GPUs [

However, even with distributed training, the embedding tables are often still the efﬁciency bottlenecks.

For instance, embedding lookup is shown to dominate the training throughput in the Meta recommen-

dation model [

]. In our internal production model, which has hundreds of tables, embedding lookup

accounts for 48% and 65% of the total computation and communication costs, respectively.

How the embedding tables are placed can signiﬁcantly impact the costs. Figure 1 shows the traces

of different placement strategies on a task of placing 50 tables on 4 devices. Typically, embedding

lookup consists of four stages. In the forward pass, the sparse indices are mapped into dense

vectors (forward computation), which are then sent to the target devices (forward communication).

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02023v1 [cs.LG] 5 Oct 2022

Forward

Computation Backward

Computation

Forward

Communication Backward

Communication

GPU1

GPU2

GPU3

GPU4

Device

Timestamp (millisecond)

10 20 30 40 50 60

56.6

(a) Random placement

GPU1

GPU2

GPU3

GPU4

Device

Timestamp (millisecond)

10 20 30 40 50 60

42.8

(b) The existing best human expert strategy

GPU1

GPU2

GPU3

GPU4

Device

Timestamp (millisecond)

10 20 30 40 50 60

35.95

Figure 1: Visualization of random placement,

the existing best human expert strategy, and

DreamShard on a task of placing 50 tables on

4 GPUs. The dense computations and commu-

nications are omitted in the traces because they

do not have an imbalance issue. We provide

more visualizations in Appendix L.

In the backward pass, the gradients of the embed-

ding vectors are sent back from the target devices

(backward communication) and applied to the em-

bedding vectors (backward computation). The ta-

bles will easily lead to imbalances if not carefully

partitioned. The random placement in Figure 1a

is bottlenecked by GPU2 with a 56.6 milliseconds

latency, while the more balanced placements in Fig-

ure 1b and 1c signiﬁcantly reduce the costs to 42.8

and 35.95 milliseconds, respectively. This work

asks: given a set of embedding tables, how can we

identify the best placement of the tables to balance

the costs?

Device placement is essentially a partition problem,

which is one of the classical NP-hard combinatorial

optimization problems [

]. A recent line of re-

search uses reinforcement learning (RL) for device

placement of computational graphs [

]. For example, [

] proposed to train

an RNN controller with content-based attention to

predict the placement. Other studies advanced [

]

in different ways, such as using hierarchical mod-

els [

], more sophisticated RL algorithms [

and graph neural networks [16].

However, embedding table placement remains to be

an open and challenging problem due to the oper-

ation fusion [

] of tables and the generalizability

requirement.

Modern embedding implementa-

tions (e.g., FBGEMM [22]), use a single operation

to subsume multiple tables for acceleration. The

speedup of the fused operation over the sum of

the single-table operation costs is not constant and

depends on the characteristics of the fused tables

(e.g., table dimensions). Our analysis ﬁnds that the

speedups vary signiﬁcantly across different table

combinations, ranging from 1X to 3X (Figure 12 in Appendix A.3.2). Thus, we not only need to

reason about cost balance but also how the tables should be fused to maximize the speedup.

real-world scenarios, the adopted embedding tables and the available devices can change frequently

(e.g., machine learning engineers/researchers may conduct concurrent experiments with various table

combinations and numbers of devices). Thus, a practical algorithm should generalize to tasks with

unseen tables, different numbers of tables, and different numbers of devices. It is non-trivial to

achieve this with the existing device placement approaches.

To this end, we introduce DreamShard, an RL approach for embedding table placement. DreamShard

achieves the reasoning of operation fusion and generalizability with two novel ideas.

It learns a

cost network to directly predict the costs of the fused operations. Speciﬁcally, the network takes

as input the table features (e.g., table dimension) of each single-table and outputs the computation

and communication costs.

It trains a policy network by interacting with an estimated Markov

decision process (MDP) without real GPU execution, where the states and the rewards are estimated

by the predictions of the cost network. Equipped with sum reductions for the table representations

and max reductions for the device representations, the two networks can directly generalize to unseen

placement tasks with different numbers of tables and/or devices without ﬁne-tuning.

Extensive experiments show that DreamShard outperforms the existing human expert and RNN-

based [

] strategies on open-sourced synthetic tables [

] and our production tables, achieving up

to 19% speedup over the strongest baseline. Moreover, it can generalize to unseen tasks that have

different numbers of tables and/or devices with neglectable performance drop (< 0.5 milliseconds).

Additionally, its inference is very efﬁcient. It can place hundreds of tables in less than one second.

2 Generalizable Embedding Table Placement Problem

The embedding table placement problem seeks a device placement

of all the tables such that the over-

all cost (in terms of execution time) is minimized (we provide a background for the distributed training

of recommendation models in Appendix A.1). Consider

embedding tables

{e1,e2, ..., eM}

and

devices, where

ei∈RN

denotes the table features that characterize the embedding lookup patterns.

In our work, we use 21 table features, including hash size, dimension, table size, pooling factor,

and distribution (their deﬁnitions are provided in Appendix A.2). A placement

a= [a1, a2, ..., aM]

where

ai∈ {1,2, ..., D}

, assigns each table to a device. Let

c(a)

denote the cost measured on

GPUs. The goal of embedding table placement is to ﬁnd the

such that

c(a)

is minimized. Due

to the NP-hardness of the partition problem [

], identifying the exact solution demands extensive

computational overhead. Thus, the state-of-the-art algorithms often approximate the optimal parti-

tion via sampling with RL [

]. However, sampling remains expensive because obtaining

c(a)

requires running operations on GPUs. Given that the embedding tables and the available devices can

frequently change, we wish to approximate the best awithout GPU execution.

Motivated by this, we study the generalizable embedding table placement (GETP) problem. Let

be the space of all the embedding tables. A placement task can be denoted as

Ti= (Ei, Di)

where

Ei⊆ E

is a set of tables, and

is the number of devices. Given

Ntrain

training tasks

Ttrain ={T1, T2, ..., TNtrain }

, and

Ntest

testing tasks

Ttest ={T1, T2, ..., TNtest }

, the goal is to train a

placement policy based on

Ttrain

(GPU execution is allowed during training) such that the learned

policy can minimize the costs for the tasks in Ttest without GPU execution.

3 DreamShard Framework

...

Final State

Estimated MDP

RL Agent

Policy

Network

Device 1

Device 2

St ate t=0 St ate t=1

Real Hardwar e

Sampled

Action

Estimated

St ate Estim ated

Rewar d

Placem ent

Cost

Data

Figure 2: DreamShard framework. The agent

interacts with the estimated MDP, which is

trained with the cost data collected from GPUs.

We present DreamShard, an RL framework based

on estimated MDP, to tackle the GETP problem.

An overview of the framework is shown in Figure 2.

The key idea is to formulate the table placement

process as an MDP (Section 3.1) and train a cost net-

work to estimate its states and rewards (Section 3.2).

A policy network with a tailored generalizable net-

work architecture is trained by efﬁciently interact-

ing with the estimated MDP (Section 3.3). The

two networks are updated iteratively to improve the

state/reward estimation and the placement policy.

3.1 MDP Formulation

Given embedding tables

{e1,e2, ..., eM}

and

devices, we aim to generate a placement

[a1, a2, ..., aM]

. The key idea is to place the tables one by one at each step, where the state character-

izes the tables that have been placed so far, the action is the device ID, and the reward represents the

execution time on GPUs. Speciﬁcally, at a step

, the state

st={st,d}D

d=1

is all the table features

of the tables placed on all the devices, where

st,d ={ei|i∈ Pd}

denotes all the table features

corresponding to device

(

is the set of table IDs that have been placed on device

). We further

augment the raw features with cost features which are obtained by collecting the operation computa-

tion and communication times from GPUs (Appendix A.3 provides a comprehensive analysis of the

cost features). Formally, the augmented state is deﬁned as

est={st,{qt,d}D

d=1}

, where

qt,d ∈R3

has

three elements representing forward computation time, backward computation time, and backward

communication time for the current operation in device

(we provide detailed explanations of why

forward communication time is excluded in Appendix A.4). We ﬁnd that the augmented cost features

can signiﬁcantly boost the performance, evidenced by the ablations in Table 3. The action

at∈ At

is an integer specifying the device ID, where

is the set of legal actions at step

. A device ID is

considered legal if placing the current table on the corresponding device does not cause a memory

explosion. The reward

for all the intermediate steps, and the reward at the ﬁnal step

is the

negative of the cost, i.e., rM=−c(a), which encourages the agent to achieve lower cost.

In this work, we focus on GPU devices, where all the GPU devices are identical, which is the most common

conﬁguration in our production. We defer the mixed scenarios of both GPUs and CPUs to future work.

Table to be Placed Placed to Device 1 Placed to Device 2

Action

Device 1

Action

Device 2

Action

Device 2

Action

Device 1 ... Action

Device 1

Step t=0 Step t=1 Step t=2 Step t=3 Step t=8

Final

Placem ent

Unplaced Table

Figure 3: MDP formulation of embedding table placement.

Device 1 Device 2

Shared Feature Extraction MLP

Sum Sum

Shared Policy Head MLP

Softmax Layer

Device 1 Device 2

Shared Feature Extraction MLP

Sum Sum

Overall Cost Head

Max

Predicted Over all Cost

(Rewar d)

Forward

Head

Backward

Head Comm.

Head

Policy Networ k

Cost Netw or k

Predicted

Backwar d Cost Pr edicted

Com m. Cost Predicted

For war d Cost

Cost Featur es Cost Featur es

Cost

Network Cost

Network

Predicted Cost

Featur es (State)

Figure 4: DreamShard’s cost network (left) and policy network (right).

The procedure is illustrated for an example task of placing

tables

{e1,e2, ..., e8}

devices

in Figure 3. At step

, no table has been placed so

s0={{},{}}

and the augmented state

es0={s0,{q0,0,q0,1}}

, where both

q0,0

and

q0,1

are zero vectors (i.e.,

[0,0,0]

) since all the

computation and communication times are 0 as well. Then the action

a0= 1

makes the MDP transit

to the next state

s1={{e1},{}}

with its corresponding augmented state

es1={s1,{q1,0,q1,1}}

where

q1,0

becomes a non-zero vector containing the computation and communication costs by

running

{e1}

and

{}

on GPUs. We repeat the above process, and ﬁnally at step

, we have

s8={{e1,e4,e7,e8},{e2,e3,e5,e6}}

. The corresponding

q8,1

, and

q8,2

are the measured times of

running

{e1,e4,e7,e8}

and

{e2,e3,e5,e6}

on two devices. The action sequence

a= [a1, a2, ..., a8]

is the generated placement, which is then evaluated on GPUs to obtain the reward.

Discussion 1.

The MDP enjoys two desirable properties.

The legal action

can guarantee that

the generated placement satisﬁes the memory constraints.

The one-by-one placement enables the

agent to be generalized across different numbers of tables. For example, an agent trained on an MDP

with very few tables can be applied to another MDP with more tables by simply executing more steps.

Discussion 2.

A straightforward idea to solve the MDP is to greedily place the current table on the

device with the lowest cost at each step, where the cost function can be one of or a combination of

the state features (e.g., the sum of the table dimensions, or the sum of all the cost features). However,

greedy heuristics are often sub-optimal. Thus, we seek a learning-based algorithm to explore various

placement possibilities and make comprehensive decisions based on all the state features.

3.2 Learning an Estimated MDP

Interacting with the above MDP is computationally expensive since obtaining the cost features and

the reward requires GPU execution. Motivated by world models [

], we build an estimated

MDP by approximating the cost features and the reward with a cost network. Let

fcost

denote the cost

network.

fcost

takes as input the raw table features

, and predicts cost features

{qt,d}D

d=1

and the

overall cost

c(a)

fcost

is trained with mean squared error (MSE) loss using the cost data collected

from the GPUs. Once trained, it can predict the cost features or the reward with a single forward pass

without GPU execution. However, it is non-trivial to design the architecture of

fcost

because it needs

Algorithm 1 Training of DreamShard

1: Input:

Training tasks

Ttrain

, the number of data collection steps

Ncollect

, the number of cost

network update steps

Ncost

, the batch size for updating the cost network

Nbatch

, the number RL

update steps NRL, the number of considered episodes in each RL update Nepisode

2: Initialize a cost network, a policy network, and a buffer

3: for iteration = 1, 2, ... until convergence do

4: for step = 1, 2, ... Ncollect do Collect cost data from GPUs

5: Randomly sample a training task from Ttrain

6: Generate a placement by interacting with the estimated MDP using the policy network

7: Evaluate the placement on the hardware and store the collected cost data to the buffer

8: end for

9: for step = 1, 2, ... Ncost do Update the cost network (no GPU execution)

10: Randomly sample Nbatch cost data from the buffer

11: Update the cost network based on MSE loss (Eq. 1 in Appendix B.4.1)

12: end for

13: for step = 1, 2, ... NRL do Update the policy network (no GPU execution)

14: Randomly sample a training task from Ttrain

15: Collect Nepisode episodes by interacting with the estimated MDP using the policy network

16: Update the policy network based on the policy gradient loss (Eq. 2 in Appendix B.4.1)

17: end for

18: end for

to accommodate different numbers of devices (i.e.,

can have variable sizes), and different numbers

of tables in each device (i.e., st,d can have variable lengths).

The left-hand side of Figure 4 shows DreamShard’s generalizable design of

fcost

, which is based on

two key ideas.

First,

it uses a shared MLP to map raw table features into table representations. For

any unseen tables, this MLP can be directly applied to extract table representations.

Second,

it enables

a ﬁxed-dimension representation for each device with sum reductions (i.e., the element-wise sum of

the table representations in the device), and similarly for the overall representation across devices

with max reductions (Appendix B.3 compares different reduction choices and ﬁnds that this sum-max

combination leads to the most accurate prediction). The reduced representations are then followed by

multiple MLP heads for cost predictions. For unseen tasks with different numbers of tables and/or

devices, the reductions will always lead to ﬁxed-dimension device/overall representations, so that the

prediction heads can be directly applied. Appendix B.1 provides more details.

3.3 Training the Policy Network on the Estimated MDP

Generalizable policy network architecture.

Let

be the policy network.

maps the augmented

state

est={st,{qt,d}D

d=1}

to action

, i.e.,

at=π(est)

also adopts a generalizable design, shown

in the right-hand side of Figure 4. Like

fcost

uses a shared MLP and sum reductions to produce

a ﬁxed-dimension representation, which is then concatenated with the cost features to obtain the

device representation. To accommodate the potentially variable action space (i.e., the number of

available devices may vary), a shared MLP will process each device representation separately to

obtain a conﬁdence score, followed by a Softmax layer to produce action probabilities. This design

allows πto generalize across different numbers of devices. Appendix B.2 provides more details.

Training and inference.

Algorithm 1 summarizes the training procedure of DreamShard, which

iteratively executes the following:

collect cost data from GPUs based on the placements generated

by the current policy,

update the cost network with the previously collected cost data, and

update the policy network by interacting with the current estimated MDP. Throughout the training

process, the estimated MDP gradually becomes more accurate, and the resultant policy network tends

to generate better placements. Appendix B.4.2 provides more details of the training procedure. For

the inference, the trained cost network and policy network can be directly applied to unseen tasks to

generate placements without GPU execution, which is summarized by Algorithm 2 in Appendix B.4.3.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DreamShard:GeneralizableEmbeddingTablePlacementforRecommenderSystemsDaochenZha1LouisFeng2QiaoyuTan3ZiruiLiu1Kwei-HerngLai1BhargavBhushanam2YuandongTian2ArunKejariwal2XiaHu11RiceUniversity2MetaPlatforms,Inc.3TexasA&MUniversity{daochen.zha,Zirui.Liu,khlai,Xia.Hu}@rice.edu{lofe,bbhushanam,yuandong,akej...

展开>> 收起<<

DreamShard Generalizable Embedding Table Placement for Recommender Systems Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1.pdf

共39页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DreamShard Generalizable Embedding Table Placement for Recommender Systems Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: