DreamShard Generalizable Embedding Table Placement for Recommender Systems Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1

2025-05-03 0 0 1.76MB 39 页 10玖币
侵权投诉
DreamShard: Generalizable Embedding Table
Placement for Recommender Systems
Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1
Bhargav Bhushanam2Yuandong Tian2Arun Kejariwal2Xia Hu1
1Rice University 2Meta Platforms, Inc. 3Texas A&M University
{daochen.zha,Zirui.Liu,khlai,Xia.Hu}@rice.edu
{lofe,bbhushanam,yuandong,akejariwal}@fb.com
qytan@tamu.edu
Abstract
We study embedding table placement for distributed recommender systems, which
aims to partition and place the tables on multiple hardware devices (e.g., GPUs)
to balance the computation and communication costs. Although prior work has
explored learning-based approaches for the device placement of computational
graphs, embedding table placement remains to be a challenging problem because of
1) the operation fusion of embedding tables, and 2) the generalizability requirement
on unseen placement tasks with different numbers of tables and/or devices. To
this end, we present DreamShard, a reinforcement learning (RL) approach for
embedding table placement. DreamShard achieves the reasoning of operation
fusion and generalizability with 1) a cost network to directly predict the costs
of the fused operation, and 2) a policy network that is efficiently trained on an
estimated Markov decision process (MDP) without real GPU execution, where
the states and the rewards are estimated with the cost network. Equipped with
sum and max representation reductions, the two networks can directly generalize
to any unseen tasks with different numbers of tables and/or devices without fine-
tuning. Extensive experiments show that DreamShard substantially outperforms
the existing human expert and RNN-based strategies with up to 19% speedup over
the strongest baseline on large-scale synthetic tables and our production tables.
The code is available at https://github.com/daochenzha/dreamshard.
1 Introduction
Embedding learning is a commonly used technique to deal with categorical features in deep rec-
ommendation models by mapping sparse features into dense vectors [
1
,
2
,
3
,
4
,
5
]. However, the
embedding tables can be extremely large due to the large feature sizes [
6
]. For example, in the
YouTube recommendation model, a single categorical feature contains tens of millions of video
IDs [7]; the Meta recommendation model demands multi-terabyte memory [8]. Distributed training
has been adopted to place the tables on multiple hardware devices such as GPUs [
3
,
6
,
9
,
10
,
11
].
However, even with distributed training, the embedding tables are often still the efficiency bottlenecks.
For instance, embedding lookup is shown to dominate the training throughput in the Meta recommen-
dation model [
8
]. In our internal production model, which has hundreds of tables, embedding lookup
accounts for 48% and 65% of the total computation and communication costs, respectively.
How the embedding tables are placed can significantly impact the costs. Figure 1 shows the traces
of different placement strategies on a task of placing 50 tables on 4 devices. Typically, embedding
lookup consists of four stages. In the forward pass, the sparse indices are mapped into dense
vectors (forward computation), which are then sent to the target devices (forward communication).
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02023v1 [cs.LG] 5 Oct 2022
Forward
Computation Backward
Computation
Forward
Communication Backward
Communication
GPU1
GPU2
GPU3
GPU4
Device
Timestamp (millisecond)
10 20 30 40 50 60
56.6
(a) Random placement
GPU1
GPU2
GPU3
GPU4
Device
Timestamp (millisecond)
10 20 30 40 50 60
42.8
(b) The existing best human expert strategy
GPU1
GPU2
GPU3
GPU4
Device
Timestamp (millisecond)
10 20 30 40 50 60
35.95
(c) DreamShard
Figure 1: Visualization of random placement,
the existing best human expert strategy, and
DreamShard on a task of placing 50 tables on
4 GPUs. The dense computations and commu-
nications are omitted in the traces because they
do not have an imbalance issue. We provide
more visualizations in Appendix L.
In the backward pass, the gradients of the embed-
ding vectors are sent back from the target devices
(backward communication) and applied to the em-
bedding vectors (backward computation). The ta-
bles will easily lead to imbalances if not carefully
partitioned. The random placement in Figure 1a
is bottlenecked by GPU2 with a 56.6 milliseconds
latency, while the more balanced placements in Fig-
ure 1b and 1c significantly reduce the costs to 42.8
and 35.95 milliseconds, respectively. This work
asks: given a set of embedding tables, how can we
identify the best placement of the tables to balance
the costs?
Device placement is essentially a partition problem,
which is one of the classical NP-hard combinatorial
optimization problems [
12
]. A recent line of re-
search uses reinforcement learning (RL) for device
placement of computational graphs [
13
,
14
,
15
,
16
,
17
,
18
,
19
,
20
]. For example, [
13
] proposed to train
an RNN controller with content-based attention to
predict the placement. Other studies advanced [
13
]
in different ways, such as using hierarchical mod-
els [
14
], more sophisticated RL algorithms [
15
],
and graph neural networks [16].
However, embedding table placement remains to be
an open and challenging problem due to the oper-
ation fusion [
21
] of tables and the generalizability
requirement.
1)
Modern embedding implementa-
tions (e.g., FBGEMM [22]), use a single operation
to subsume multiple tables for acceleration. The
speedup of the fused operation over the sum of
the single-table operation costs is not constant and
depends on the characteristics of the fused tables
(e.g., table dimensions). Our analysis finds that the
speedups vary significantly across different table
combinations, ranging from 1X to 3X (Figure 12 in Appendix A.3.2). Thus, we not only need to
reason about cost balance but also how the tables should be fused to maximize the speedup.
2)
In
real-world scenarios, the adopted embedding tables and the available devices can change frequently
(e.g., machine learning engineers/researchers may conduct concurrent experiments with various table
combinations and numbers of devices). Thus, a practical algorithm should generalize to tasks with
unseen tables, different numbers of tables, and different numbers of devices. It is non-trivial to
achieve this with the existing device placement approaches.
To this end, we introduce DreamShard, an RL approach for embedding table placement. DreamShard
achieves the reasoning of operation fusion and generalizability with two novel ideas.
1)
It learns a
cost network to directly predict the costs of the fused operations. Specifically, the network takes
as input the table features (e.g., table dimension) of each single-table and outputs the computation
and communication costs.
2)
It trains a policy network by interacting with an estimated Markov
decision process (MDP) without real GPU execution, where the states and the rewards are estimated
by the predictions of the cost network. Equipped with sum reductions for the table representations
and max reductions for the device representations, the two networks can directly generalize to unseen
placement tasks with different numbers of tables and/or devices without fine-tuning.
Extensive experiments show that DreamShard outperforms the existing human expert and RNN-
based [
13
] strategies on open-sourced synthetic tables [
23
] and our production tables, achieving up
to 19% speedup over the strongest baseline. Moreover, it can generalize to unseen tasks that have
different numbers of tables and/or devices with neglectable performance drop (< 0.5 milliseconds).
Additionally, its inference is very efficient. It can place hundreds of tables in less than one second.
2
2 Generalizable Embedding Table Placement Problem
The embedding table placement problem seeks a device placement
1
of all the tables such that the over-
all cost (in terms of execution time) is minimized (we provide a background for the distributed training
of recommendation models in Appendix A.1). Consider
M
embedding tables
{e1,e2, ..., eM}
and
D
devices, where
eiRN
denotes the table features that characterize the embedding lookup patterns.
In our work, we use 21 table features, including hash size, dimension, table size, pooling factor,
and distribution (their definitions are provided in Appendix A.2). A placement
a= [a1, a2, ..., aM]
,
where
ai∈ {1,2, ..., D}
, assigns each table to a device. Let
c(a)
denote the cost measured on
GPUs. The goal of embedding table placement is to find the
a
such that
c(a)
is minimized. Due
to the NP-hardness of the partition problem [
12
], identifying the exact solution demands extensive
computational overhead. Thus, the state-of-the-art algorithms often approximate the optimal parti-
tion via sampling with RL [
24
,
13
]. However, sampling remains expensive because obtaining
c(a)
requires running operations on GPUs. Given that the embedding tables and the available devices can
frequently change, we wish to approximate the best awithout GPU execution.
Motivated by this, we study the generalizable embedding table placement (GETP) problem. Let
E
be the space of all the embedding tables. A placement task can be denoted as
Ti= (Ei, Di)
,
where
Ei⊆ E
is a set of tables, and
Di
is the number of devices. Given
Ntrain
training tasks
Ttrain ={T1, T2, ..., TNtrain }
, and
Ntest
testing tasks
Ttest ={T1, T2, ..., TNtest }
, the goal is to train a
placement policy based on
Ttrain
(GPU execution is allowed during training) such that the learned
policy can minimize the costs for the tasks in Ttest without GPU execution.
3 DreamShard Framework
...
Final State
Estimated MDP
RL Agent
Policy
Network
Device 1
Device 2
St ate t=0 St ate t=1
Real Hardwar e
Sampled
Action
Estimated
St ate Estim ated
Rewar d
Placem ent
Cost
Data
Figure 2: DreamShard framework. The agent
interacts with the estimated MDP, which is
trained with the cost data collected from GPUs.
We present DreamShard, an RL framework based
on estimated MDP, to tackle the GETP problem.
An overview of the framework is shown in Figure 2.
The key idea is to formulate the table placement
process as an MDP (Section 3.1) and train a cost net-
work to estimate its states and rewards (Section 3.2).
A policy network with a tailored generalizable net-
work architecture is trained by efficiently interact-
ing with the estimated MDP (Section 3.3). The
two networks are updated iteratively to improve the
state/reward estimation and the placement policy.
3.1 MDP Formulation
Given embedding tables
{e1,e2, ..., eM}
and
D
devices, we aim to generate a placement
a=
[a1, a2, ..., aM]
. The key idea is to place the tables one by one at each step, where the state character-
izes the tables that have been placed so far, the action is the device ID, and the reward represents the
execution time on GPUs. Specifically, at a step
t
, the state
st={st,d}D
d=1
is all the table features
of the tables placed on all the devices, where
st,d ={ei|i∈ Pd}
denotes all the table features
corresponding to device
d
(
Pd
is the set of table IDs that have been placed on device
d
). We further
augment the raw features with cost features which are obtained by collecting the operation computa-
tion and communication times from GPUs (Appendix A.3 provides a comprehensive analysis of the
cost features). Formally, the augmented state is defined as
est={st,{qt,d}D
d=1}
, where
qt,d R3
has
three elements representing forward computation time, backward computation time, and backward
communication time for the current operation in device
d
(we provide detailed explanations of why
forward communication time is excluded in Appendix A.4). We find that the augmented cost features
can significantly boost the performance, evidenced by the ablations in Table 3. The action
at∈ At
is an integer specifying the device ID, where
At
is the set of legal actions at step
t
. A device ID is
considered legal if placing the current table on the corresponding device does not cause a memory
explosion. The reward
rt
is
0
for all the intermediate steps, and the reward at the final step
M
is the
negative of the cost, i.e., rM=c(a), which encourages the agent to achieve lower cost.
1
In this work, we focus on GPU devices, where all the GPU devices are identical, which is the most common
configuration in our production. We defer the mixed scenarios of both GPUs and CPUs to future work.
3
Table to be Placed Placed to Device 1 Placed to Device 2
Action
Device 1
Action
Device 2
Action
Device 2
Action
Device 1 ... Action
Device 1
Step t=0 Step t=1 Step t=2 Step t=3 Step t=8
Final
Placem ent
Unplaced Table
Figure 3: MDP formulation of embedding table placement.
Device 1 Device 2
Shared Feature Extraction MLP
Sum Sum
Shared Policy Head MLP
Softmax Layer
Device 1 Device 2
Shared Feature Extraction MLP
Sum Sum
Overall Cost Head
Max
Predicted Over all Cost
(Rewar d)
Forward
Head
Backward
Head Comm.
Head
Policy Networ k
Cost Netw or k
Predicted
Backwar d Cost Pr edicted
Com m. Cost Predicted
For war d Cost
Cost Featur es Cost Featur es
Cost
Network Cost
Network
Predicted Cost
Featur es (State)
Figure 4: DreamShard’s cost network (left) and policy network (right).
The procedure is illustrated for an example task of placing
8
tables
{e1,e2, ..., e8}
on
2
devices
in Figure 3. At step
0
, no table has been placed so
s0={{},{}}
and the augmented state
es0={s0,{q0,0,q0,1}}
, where both
q0,0
and
q0,1
are zero vectors (i.e.,
[0,0,0]
) since all the
computation and communication times are 0 as well. Then the action
a0= 1
makes the MDP transit
to the next state
s1={{e1},{}}
with its corresponding augmented state
es1={s1,{q1,0,q1,1}}
,
where
q1,0
becomes a non-zero vector containing the computation and communication costs by
running
{e1}
and
{}
on GPUs. We repeat the above process, and finally at step
8
, we have
s8={{e1,e4,e7,e8},{e2,e3,e5,e6}}
. The corresponding
q8,1
, and
q8,2
are the measured times of
running
{e1,e4,e7,e8}
and
{e2,e3,e5,e6}
on two devices. The action sequence
a= [a1, a2, ..., a8]
is the generated placement, which is then evaluated on GPUs to obtain the reward.
Discussion 1.
The MDP enjoys two desirable properties.
1)
The legal action
At
can guarantee that
the generated placement satisfies the memory constraints.
2)
The one-by-one placement enables the
agent to be generalized across different numbers of tables. For example, an agent trained on an MDP
with very few tables can be applied to another MDP with more tables by simply executing more steps.
Discussion 2.
A straightforward idea to solve the MDP is to greedily place the current table on the
device with the lowest cost at each step, where the cost function can be one of or a combination of
the state features (e.g., the sum of the table dimensions, or the sum of all the cost features). However,
greedy heuristics are often sub-optimal. Thus, we seek a learning-based algorithm to explore various
placement possibilities and make comprehensive decisions based on all the state features.
3.2 Learning an Estimated MDP
Interacting with the above MDP is computationally expensive since obtaining the cost features and
the reward requires GPU execution. Motivated by world models [
25
,
26
], we build an estimated
MDP by approximating the cost features and the reward with a cost network. Let
fcost
denote the cost
network.
fcost
takes as input the raw table features
st
, and predicts cost features
{qt,d}D
d=1
and the
overall cost
c(a)
.
fcost
is trained with mean squared error (MSE) loss using the cost data collected
from the GPUs. Once trained, it can predict the cost features or the reward with a single forward pass
without GPU execution. However, it is non-trivial to design the architecture of
fcost
because it needs
4
Algorithm 1 Training of DreamShard
1: Input:
Training tasks
Ttrain
, the number of data collection steps
Ncollect
, the number of cost
network update steps
Ncost
, the batch size for updating the cost network
Nbatch
, the number RL
update steps NRL, the number of considered episodes in each RL update Nepisode
2: Initialize a cost network, a policy network, and a buffer
3: for iteration = 1, 2, ... until convergence do
4: for step = 1, 2, ... Ncollect do Collect cost data from GPUs
5: Randomly sample a training task from Ttrain
6: Generate a placement by interacting with the estimated MDP using the policy network
7: Evaluate the placement on the hardware and store the collected cost data to the buffer
8: end for
9: for step = 1, 2, ... Ncost do Update the cost network (no GPU execution)
10: Randomly sample Nbatch cost data from the buffer
11: Update the cost network based on MSE loss (Eq. 1 in Appendix B.4.1)
12: end for
13: for step = 1, 2, ... NRL do Update the policy network (no GPU execution)
14: Randomly sample a training task from Ttrain
15: Collect Nepisode episodes by interacting with the estimated MDP using the policy network
16: Update the policy network based on the policy gradient loss (Eq. 2 in Appendix B.4.1)
17: end for
18: end for
to accommodate different numbers of devices (i.e.,
st
can have variable sizes), and different numbers
of tables in each device (i.e., st,d can have variable lengths).
The left-hand side of Figure 4 shows DreamShard’s generalizable design of
fcost
, which is based on
two key ideas.
First,
it uses a shared MLP to map raw table features into table representations. For
any unseen tables, this MLP can be directly applied to extract table representations.
Second,
it enables
a fixed-dimension representation for each device with sum reductions (i.e., the element-wise sum of
the table representations in the device), and similarly for the overall representation across devices
with max reductions (Appendix B.3 compares different reduction choices and finds that this sum-max
combination leads to the most accurate prediction). The reduced representations are then followed by
multiple MLP heads for cost predictions. For unseen tasks with different numbers of tables and/or
devices, the reductions will always lead to fixed-dimension device/overall representations, so that the
prediction heads can be directly applied. Appendix B.1 provides more details.
3.3 Training the Policy Network on the Estimated MDP
Generalizable policy network architecture.
Let
π
be the policy network.
π
maps the augmented
state
est={st,{qt,d}D
d=1}
to action
at
, i.e.,
at=π(est)
.
π
also adopts a generalizable design, shown
in the right-hand side of Figure 4. Like
fcost
,
π
uses a shared MLP and sum reductions to produce
a fixed-dimension representation, which is then concatenated with the cost features to obtain the
device representation. To accommodate the potentially variable action space (i.e., the number of
available devices may vary), a shared MLP will process each device representation separately to
obtain a confidence score, followed by a Softmax layer to produce action probabilities. This design
allows πto generalize across different numbers of devices. Appendix B.2 provides more details.
Training and inference.
Algorithm 1 summarizes the training procedure of DreamShard, which
iteratively executes the following:
1)
collect cost data from GPUs based on the placements generated
by the current policy,
2)
update the cost network with the previously collected cost data, and
3)
update the policy network by interacting with the current estimated MDP. Throughout the training
process, the estimated MDP gradually becomes more accurate, and the resultant policy network tends
to generate better placements. Appendix B.4.2 provides more details of the training procedure. For
the inference, the trained cost network and policy network can be directly applied to unseen tasks to
generate placements without GPU execution, which is summarized by Algorithm 2 in Appendix B.4.3.
5
摘要:

DreamShard:GeneralizableEmbeddingTablePlacementforRecommenderSystemsDaochenZha1LouisFeng2QiaoyuTan3ZiruiLiu1Kwei-HerngLai1BhargavBhushanam2YuandongTian2ArunKejariwal2XiaHu11RiceUniversity2MetaPlatforms,Inc.3TexasA&MUniversity{daochen.zha,Zirui.Liu,khlai,Xia.Hu}@rice.edu{lofe,bbhushanam,yuandong,akej...

展开>> 收起<<
DreamShard Generalizable Embedding Table Placement for Recommender Systems Daochen Zha1Louis Feng2Qiaoyu Tan3Zirui Liu1Kwei-Herng Lai1.pdf

共39页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:39 页 大小:1.76MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 39
客服
关注