Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

2025-04-30 0 0 1.11MB 17 页 10玖币
侵权投诉
Accelerating Transfer Learning with Near-Data
Computation on Cloud Object Stores
Diana Petrescu
EPFL
Lausanne, CH
diana.petrescu@epfl.ch
Arsany Guirguis
EPFL
Lausanne, CH
arsany.guirguis91@gmail.com
Do Le Quoc
Huawei Munich Research Center
Munich, DE
quoc.do.le@huawei.com
Javier Picorel
Huawei Munich Research Center
Munich, DE
javier.picorel@huawei.com
Rachid Guerraoui
EPFL
Lausanne, CH
rachid.guerraoui@epfl.ch
Florin Dinu
Huawei Munich Research Center
Munich, DE
orin.dinu@huawei.com
ABSTRACT
Storage disaggregation underlies today’s cloud and is nat-
urally complemented by pushing down some computation
to storage, thus mitigating the potential network bottleneck
between the storage and compute tiers. We show how ML
training benets from storage pushdowns by focusing on
transfer learning (TL), the widespread technique that democ-
ratizes ML by reusing existing knowledge on related tasks.
We propose HAPI, a new TL processing system centered
around two complementary techniques that address chal-
lenges introduced by disaggregation. First, applications must
carefully balance execution across tiers for performance.
HAPI judiciously splits the TL computation during the fea-
ture extraction phase yielding pushdowns that not only im-
prove network time but also improve total TL training time
by overlapping the execution of consecutive training itera-
tions across tiers. Second, operators want resource eciency
from the storage-side computational resources. HAPI em-
ploys storage-side batch size adaptation allowing increased
storage-side pushdown concurrency without aecting train-
ing accuracy. HAPI yields up to 2.5
×
training speed-up while
choosing in 86.8% of cases the best performing split point or
one that is at most 5% o from the best.
1 INTRODUCTION
Storage disaggregation (i.e., the separation of the storage
and compute tiers) powers today’s cloud object stores (COS)
(e.g. Amazon S3 [
2
], Google Cloud Storage [
27
], Azure Blob
Storage [
11
]) as it reduces costs and simplies management
by allowing the two tiers to scale independently. Unfortu-
nately, these benets come at the cost of a potential network
bottleneck [
1
,
73
,
75
] between the tiers as network band-
width growth is outpaced by storage bandwidth and com-
pute throughput growth [
23
,
70
]. Near-data computation
techniques are the natural complement to storage disaggre-
gation. These involve provisioning storage-side compute
resources to run part of an application (called a pushdown)
in order to mitigate the network bottleneck by reducing the
amount of data transferred between tiers. These storage-
side compute resources are limited by design as they are
not meant to replace the compute tier. Following the ini-
tial success of pushdowns for a restricted set of workloads
(e.g., SQL [
7
]), there is renewed interest in broadening the
applicability of such pushdowns to new applications and to
specialized hardware.
This paper shows how ML training can benet from push-
downs to disaggregated storage by focusing on transfer learn-
ing (TL) [
80
], a widespread ML technique [
15
] that enables
a generic model previously trained (pre-trained) on a large
dataset to be eciently customized (ne-tuned) for a re-
lated task. TL democratizes ML by lowering the entry bar, as
ne-tuning existing models avoids the need for new, large
datasets and the computational expense of training models
from scratch. Thus, TL has become a cornerstone of modern
cloud ML services [
3
,
4
,
14
,
16
], enabling the use of pre-
trained models and scalable ne-tuning capabilities across
major platforms. In traditional TL ne-tuning, the initial
DNN layers perform feature extraction while the rest per-
form re-training.
This paper proposes Hapi
1
, a new TL ne-tuning system
that spans the storage and compute tiers and judiciously
pushes down to storage part of the TL DNN. Hapi lever-
ages two new techniques that address challenges introduced
by storage disaggregation for the benet of both users and
operators. The rst challenge is that pushdowns make it
harder for applications and users to optimize performance.
Typically, pushdowns are chosen to minimize network time
by having the pushdown’s output be smaller than the job’s
input. Hapi builds on the insight that, for reducing TL ne-
tuning time, pushing down only to minimize network time is
1
HAPI was the Egyptian god of the annual ooding of the Nile, often
portrayed as binding two regions (splits) of Egypt (https://en.wikipedia.org
/wiki/Hapi_(Nile_god)).
arXiv:2210.08650v3 [cs.LG] 1 Nov 2024
SoCC ’24, November 20–22, 2024, Redmond, WA, USA Diana Petrescu, Arsany Guirguis, Do Le oc, Javier Picorel, Rachid Guerraoui, and Florin Dinu
useful but, unfortunately, sub-optimal. Instead, applications
need to carefully balance the pushdown processing time, the
network transfer time as well as the compute tier processing
time. Hapi achieves this balance by splitting the TL DNN
during its feature extraction phase, which contains some
DNN layers with relatively small output sizes (for reducing
network time) (§3), while also allowing the pushdown pro-
cessing time for iteration
𝑁+
1to substantially overlap with
the compute tier processing time for iteration 𝑁.
The second challenge, particularly important to operators,
consists in using the limited storage-side compute resources
eciently. Hapi addresses this challenge with our novel tech-
nique called storage-side batch size adaptation. Splitting the
TL DNN naturally decouples the batch sizes used in each tier.
The insight is that one can use a signicantly smaller batch
size in the storage-side pushdown compared to the compute
tier, importantly, without aecting training accuracy. By de-
sign, Hapi maintains the same training accuracy as if the
training was fully performed in the compute tier (i.e., no
pushdowns to storage) by pushing down only parts of the TL
feature extraction phase which uses frozen weights (xed,
not re-trained) and by ensuring that the size and contents of
the training batch in the compute tier remain unchanged. The
benet of storage-side batch size adaptation is that it greatly
reduces the amount of GPU memory used by pushdowns as
the initial layers of feature extraction are the most memory
intensive (§3) due to their larger output sizes. This memory
reduction has two crucial consequences. First, this enables
several pushdowns to make progress concurrently in the
COS, thus improving resource eciency. Second, this avoids
the out-of-memory (OOM) errors that plague practitioners.
Hapi spans both the COS and compute tiers, is transparent
to the user, and relies on inexpensive proling runs. Hapi’s
design uses a stateless storage-side component alongside
lightweight request between the two tiers to simplify practi-
cal concerns regarding load balancing, scalability and failure
resilience.
Specically, the contributions of this paper are:
(1)
Identifying and demonstrating the benets of applying
near-data computation techniques to TL on top of the
disaggregated COS.
(2)
A measurement study of DNN layer characteristics
across 7 popular DNNs (§3), including modern archi-
tectures like the Vision Transformer (ViT) and widely-
used CNNs such as ResNet and DenseNet. These char-
acteristics play an important role in our system design.
(3)
Hapi, an end-to-end system comprising two key design
techniques: DNN splitting (§4.3) and storage-side batch
size adaptation (§4.4) which reduce network transfers,
improve training runtime and enable increased push-
down concurrency in the COS.
(4)
An extensive evaluation (§5) showing up to 2.5
×
speed-
up in training runtime while choosing in 86.8% of cases
the best performing split point or one that is at most
5% o from the best.
The paper is organized as follows. We provide background
in §2 and present our measurement study in §3. §4 presents
the design and implementation of Hapi. We show experi-
mental results in §5 and discuss related work in §6.
2 BACKGROUND
2.1 Transfer learning
TL democratizes ML by allowing knowledge from a model
pre-trained on a large dataset to be adapted and reused
(ne-tuned) for a dierent but related task [
69
]. In doing
so, the DNN training time and the generalization error are
reduced [
26
]. As models grow in size and complexity, TL has
become increasingly essential for eciently adapting pre-
trained models to specic tasks with fewer resources, making
their deployment more practical across various domains. The
intuition behind TL is that the pre-trained model, often re-
ferred to as a backbone, captures generalizable embeddings
(representations of the input data) which can be adapted
or ne-tuned for new tasks rather than requiring a model
to be trained from scratch. Examples of backbones include
convolutional neural networks (CNNs) and Vision Trans-
formers (ViTs) for computer vision [
66
], as well as models
like BERT [
20
] and GPT [
8
] for natural language processing
(NLP). Recent advances in Vision Transformers [
17
,
28
,
68
]
exemplify this approach by providing scalable and reusable
backbones that can be ne-tuned for a wide range of tasks,
including classication, detection, and segmentation.
Figure 1: Overview of TL ne-tuning.
The key to TL lies in ne-tuning the pre-trained model,
as depicted in Figure 1. Pre-training is usually done on a
dierent system and often by a dierent entity and is outside
the scope of this paper. Traditionally, ne-tuning is divided
further into two phases: (i) feature extraction, where em-
beddings or high-level representations are extracted from
new input data using (partially or entirely) the pre-trained
model, and then (ii) training, to create a new classier using
the extracted embeddings [
26
]. Typically, the early layers of
Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores SoCC ’24, November 20–22, 2024, Redmond, WA, USA
the pre-trained model, which capture more general features
(e.g., edges, textures in vision models), are frozen during
ne-tuning, i.e., the model weights of these layers (in red in
Figure 1) are not updated with backpropagation. Every itera-
tion (i.e., input batch processed) involves feature extraction
followed by training. We refer to the last frozen layer as the
freeze layer (or index).
2.2 Cloud object stores
Cloud object stores (COS), such as Amazon S3 [
2
], Google
Cloud Storage [
27
], and Azure Blob Storage [
11
], are a pop-
ular way to store large-scale unstructured data, providing
ease of use, high availability, high scalability, and durability
at a low cost [
51
]. COS are the prime example of storage
disaggregation. The COS is connected to the compute tier by
a network that, unfortunately, is a bottleneck [
23
,
46
,
47
,
60
]
even when the network is maxed out. The reason are hard-
ware trends. The network bandwidth between the COS and
the compute tier is lower than the internal storage bandwidth
of COS servers and also lower than the computation through-
put [
23
,
49
]. The typical network bandwidth of a single cloud
server is 25 - 400 Gbps (3.125 - 50 GBps) [
23
,
38
,
61
,
71
].
A single modern NVMe SSD can read sequentially at well
over 10 GBps [
10
,
48
], so a couple of NVMe SSDs are su-
cient to max out the network bandwidth. In practice, stor-
age servers are provisioned with many SSDs; an arrray of
PCIe 5.0 NVMe SSDs can exceed 100 GBps in read through-
put [
48
]. Other storage media are faster than SSDs, further
aggravating the network bottleneck. With sucient thread
parallelism, DRAM reads can exceed 100 GBps, and per-
sistent memory reads can exceed 30 GBps [
72
]. Compute
throughput also exceeds network bandwidth [
23
,
49
]. Earlier
studies [
46
,
60
] have reported network read throughput as
low as 100 MBps per connection from Amazon S3 but the
trend is for network bandwidth to improve. More recently,
up to 100 Gbps from general purpose instances to S3 has
been reported [21].
In light of this enduring network bottleneck, an important
trend has been to push down computation inside the COS, to
reduce the amount of data sent over the network. Pushdowns
were initially restricted to a subset of SQL (e.g., Amazon S3
Select [
7
]), but there is a renewed eort in the industry to
support more complex pushdowns for computations such
as image processing [
13
] or analytics [
1
]. There has been
growing interest in pushing down parts of ML computations
to storage [
45
,
67
]. This trend goes hand-in-hand with an-
other development, enabling pushdowns to use specialized
hardware (§2.3).
Despite these trends, two challenges remain for COS users.
The network may remain a bottleneck despite the push-
downs, and the COS computational resources are scarce and
need to be used eciently as they are only meant to mitigate
the network bottleneck and not replace the compute tier.
2.3 Hardware-accelerated pushdowns
Pushdowns were initially restricted to a subset of SQL, includ-
ing ltering, projecting, and aggregation (e.g., Amazon S3
Select [
7
]). The current, natural trend is to oer the benets
of pushing down to a wider range of applications. Unfortu-
nately, restricting pushdowns to CPUs can lead to wasted
resources and performance. First, for more complex oper-
ations, CPUs can become a bottleneck. Studies show that
even with 32 cores, an SGD optimizer can fully utilize the
CPU when using a 100 Gbps network [
42
]. Second, it is not
sucient for the CPU processing to be just faster than the
network because the output of a pushdown may be smaller
than its input. For example, for a pushdown to generate an
output at 100 Gbps, assuming an input/output ratio of 2, it
needs to process input at 25 GBps. Finally, the aggregate stor-
age bandwidth of a storage server tends to increase faster
than CPU capabilities [70].
As a result, the current trend is to allow pushdowns to use
specialized hardware such as GPUs. Several works [
12
,
31
]
have proposed to use them in storage systems to speed up
erasure coding. Finally, there is a push to more-closely inte-
grate storage with GPUs, which further increases the appeal
of next-to-storage GPUs. For example, IBM’s Storage Scale
System 6000 [
39
] integrates NVIDIA GPUDirect Storage [
57
],
enabling a direct data path between GPU memory and stor-
age to reduce latency and enhance performance for AI and
data-intensive workloads.
3 MEASUREMENT STUDY
Next, we present a detailed measurement study of 7 DNNs.
These include a state-of-the-art Vision Transformer [
66
] as
well as several widely-adopted foundational models such as
ResNet50 [
33
], DenseNet121 [
36
], and VGG19 [
64
]. These
models cover a diverse range of architectural characteris-
tics, making them well-suited for evaluating system-level
performance in terms of speed and resource eciency. We
characterize the per-layer properties across three dimensions:
output size, compute time, and maximum GPU memory used.
These properties all play a role in Hapi’s design (§4). Addi-
tional layer-related information for each DNN can be found
in Table 2. For the DNNs structured as a sequence of blocks
(e.g. ResNets) we count one block as one layer and we use
only block boundaries as candidate split points. The input
dataset is ImageNet. For readability purposes we group the
models into 3 sub-groups in each gure.
Hardware setup. For this section we use two identical
GPU-accelerated machines from a public cloud one for the
Hapi client and the other for the COS and the Hapi server.
SoCC ’24, November 20–22, 2024, Redmond, WA, USA Diana Petrescu, Arsany Guirguis, Do Le oc, Javier Picorel, Rachid Guerraoui, and Florin Dinu
1
0 5 10 15 20 25 30 35
AlexNet
Transformer
4
0 5 10 15 20 25 30 35
ResNet18
ResNet50
DenseNet
12
0 5 10 15 20 25 30 35
Size of layer output (MB)
Layer index
Vgg11
Vgg19
(a) Per-layer output sizes
0.2
0 5 10 15 20 25 30 35
AlexNet
Transformer
0.2
0 5 10 15 20 25 30 35
ResNet18
ResNet50
DenseNet
0.2
0 5 10 15 20 25 30 35
Fwd pass time (sec)
Layer index
Vgg11
Vgg19
(b) Per-layer forward pass time
2
0 5 10 15 20 25 30 35
AlexNet
Transformer
2
0 5 10 15 20 25 30 35
ResNet18
ResNet50
DenseNet
10
0 5 10 15 20 25 30 35
Memory usage per layer (GB)
Layer index
Vgg11
Vgg19
(c) Per-layer max GPU memory usage
Figure 2: A measurement study of 3 per-layer properties for 7 popular DNNs.
Each machine has an Intel Xeon Gold 6278C CPU with 16
cores, 64 GB RAM, 1 Tesla T4 GPU with 16 GB RAM, a 300
GB SSD and runs Ubuntu 20.04 server 64 bit with CUDA 11.2.
The network bandwidth between VMs is 12 Gbps.
Per-layer output size. Figure 2a shows the per-layer output
size. For reference, the size of a pre-processed ImageNet
input tensor is shown with the horizontal dotted blue line.
This output size is for a batch size of 1. One can accurately
extrapolate from this by multiplying by a specic batch size.
The important observation is that for most models, early
on in the DNN structure, there exist layers with a compar-
atively small output size, often signicantly smaller than
subsequent layers (e.g. ResNet layer 4, AlexNet layers 3 and
6). These layers are good candidates for splitting the TL ap-
plication between the COS and the compute tier. Overall,
the layer output size generally increases in the beginning
(with convolution layers) and then decreases (with pooling
layers) but not in a monotonic fashion. While only a few split
points have an output size smaller than the pre-processed
tensor, we will show that this is not strictly necessary for
performance gains because optimizing the network time is
only part of the story. Hapi owes its benets to balancing
network time against the computation time in both tiers.
Per-layer computation time. Figure 2b shows the per-
layer computation time for the forward pass for a batch
size of 128. The insights are that some models are more
lightweight (e.g. AlexNet) than others and that some models
(e.g. DenseNet) show signicant variability across layers.
However, the forward pass time remains relatively stable
with an increase in layer index and gives Hapi the exibility
to balance computation time across the two tiers.
Per-layer maximum GPU memory usage. Figure 2c
shows the maximum amount of GPU memory needed for
the processing of each layer for a batch size of 128. Again,
the variation is model dependent and for each model there
is variability between layers. Generally, the rst few layers
use more memory suggesting that these need to be the focal
point if reducing memory consumption is desired.
The need for a dynamic splitting algorithm. We next
motivate Hapi’s dynamic splitting approach by showing the
limitations of static splitting approaches.
Figure 3 shows 7 groups of bars. Each group represents
a model (listed on the x-axis) and the 7 bars in each group
represent dierent batch sizes (128, 256, 512, 1024, 1536, 2048,
3072 from left to right). The gure summarizes a performance
sweep over all possible split points in the DNNs (after each
layer). The gure shows per-epoch runtime normalized to
splitting at the freezing layer (@Freeze). The points (blue
rectangle, red circle) show two other sensible static splits:
(1) with red circles @Min, splitting at the layer with the
smallest output (earliest such layer as a tiebreaker) and (2)
with blue rectangles NoSplit, i.e., sending the entire input
to the compute tier. The lines show the speed-up range for
all other splits, except the three mentioned above (@Freeze,
@Min, NoSplit). If some of the static splits do not appear (e.g.
Vgg11, Vgg19, Transformer) it is because they cause OOMs
on the client. However, @Freeze (y-axis value = 1) never
OOMs.
0
0.5
1
1.5
2
2.5
AlexNet ResNet18 ResNet50 DenseNet Vgg11 Vgg19 Transformer
Speed-up vs split freeze
7 batch sizes per model. Left to right: 128, 256, 512, 1024, 1536, 2048, 3072
@Min
@Freeze = 1
NoSplit
Figure 3: Speed-up when splitting at various layers
normalized to splitting at the freezing layer.
Several insights can be derived from the gure, which
point to the need for an intelligent dynamic splitting algo-
rithm. First, no static splitting strategy wins in all cases.
Second, even though one splitting strategy is the best option
摘要:

AcceleratingTransferLearningwithNear-DataComputationonCloudObjectStoresDianaPetrescuEPFLLausanne,CHdiana.petrescu@epfl.chArsanyGuirguisEPFLLausanne,CHarsany.guirguis91@gmail.comDoLeQuocHuaweiMunichResearchCenterMunich,DEquoc.do.le@huawei.comJavierPicorelHuaweiMunichResearchCenterMunich,DEjavier.pico...

展开>> 收起<<
Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.11MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注