Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

2025-04-30 0 0 1.11MB 17 页 10玖币

侵权投诉

Accelerating Transfer Learning with Near-Data

Computation on Cloud Object Stores

Diana Petrescu

EPFL

Lausanne, CH

diana.petrescu@epfl.ch

Arsany Guirguis

EPFL

Lausanne, CH

arsany.guirguis91@gmail.com

Do Le Quoc

Huawei Munich Research Center

Munich, DE

quoc.do.le@huawei.com

Javier Picorel

Huawei Munich Research Center

Munich, DE

javier.picorel@huawei.com

Rachid Guerraoui

EPFL

Lausanne, CH

rachid.guerraoui@epfl.ch

Florin Dinu

Huawei Munich Research Center

Munich, DE

orin.dinu@huawei.com

ABSTRACT

Storage disaggregation underlies today’s cloud and is nat-

urally complemented by pushing down some computation

to storage, thus mitigating the potential network bottleneck

between the storage and compute tiers. We show how ML

training benets from storage pushdowns by focusing on

transfer learning (TL), the widespread technique that democ-

ratizes ML by reusing existing knowledge on related tasks.

We propose HAPI, a new TL processing system centered

around two complementary techniques that address chal-

lenges introduced by disaggregation. First, applications must

carefully balance execution across tiers for performance.

HAPI judiciously splits the TL computation during the fea-

ture extraction phase yielding pushdowns that not only im-

prove network time but also improve total TL training time

by overlapping the execution of consecutive training itera-

tions across tiers. Second, operators want resource eciency

from the storage-side computational resources. HAPI em-

ploys storage-side batch size adaptation allowing increased

storage-side pushdown concurrency without aecting train-

ing accuracy. HAPI yields up to 2.5

training speed-up while

choosing in 86.8% of cases the best performing split point or

one that is at most 5% o from the best.

1 INTRODUCTION

Storage disaggregation (i.e., the separation of the storage

and compute tiers) powers today’s cloud object stores (COS)

(e.g. Amazon S3 [

], Google Cloud Storage [

], Azure Blob

Storage [

]) as it reduces costs and simplies management

by allowing the two tiers to scale independently. Unfortu-

nately, these benets come at the cost of a potential network

bottleneck [

] between the tiers as network band-

width growth is outpaced by storage bandwidth and com-

pute throughput growth [

]. Near-data computation

techniques are the natural complement to storage disaggre-

gation. These involve provisioning storage-side compute

resources to run part of an application (called a pushdown)

in order to mitigate the network bottleneck by reducing the

amount of data transferred between tiers. These storage-

side compute resources are limited by design as they are

not meant to replace the compute tier. Following the ini-

tial success of pushdowns for a restricted set of workloads

(e.g., SQL [

]), there is renewed interest in broadening the

applicability of such pushdowns to new applications and to

specialized hardware.

This paper shows how ML training can benet from push-

downs to disaggregated storage by focusing on transfer learn-

ing (TL) [

], a widespread ML technique [

] that enables

a generic model previously trained (pre-trained) on a large

dataset to be eciently customized (ne-tuned) for a re-

lated task. TL democratizes ML by lowering the entry bar, as

ne-tuning existing models avoids the need for new, large

datasets and the computational expense of training models

from scratch. Thus, TL has become a cornerstone of modern

cloud ML services [

], enabling the use of pre-

trained models and scalable ne-tuning capabilities across

major platforms. In traditional TL ne-tuning, the initial

DNN layers perform feature extraction while the rest per-

form re-training.

This paper proposes Hapi

, a new TL ne-tuning system

that spans the storage and compute tiers and judiciously

pushes down to storage part of the TL DNN. Hapi lever-

ages two new techniques that address challenges introduced

by storage disaggregation for the benet of both users and

operators. The rst challenge is that pushdowns make it

harder for applications and users to optimize performance.

Typically, pushdowns are chosen to minimize network time

by having the pushdown’s output be smaller than the job’s

input. Hapi builds on the insight that, for reducing TL ne-

tuning time, pushing down only to minimize network time is

HAPI was the Egyptian god of the annual ooding of the Nile, often

portrayed as binding two regions (splits) of Egypt (https://en.wikipedia.org

/wiki/Hapi_(Nile_god)).

arXiv:2210.08650v3 [cs.LG] 1 Nov 2024

SoCC ’24, November 20–22, 2024, Redmond, WA, USA Diana Petrescu, Arsany Guirguis, Do Le oc, Javier Picorel, Rachid Guerraoui, and Florin Dinu

useful but, unfortunately, sub-optimal. Instead, applications

need to carefully balance the pushdown processing time, the

network transfer time as well as the compute tier processing

time. Hapi achieves this balance by splitting the TL DNN

during its feature extraction phase, which contains some

DNN layers with relatively small output sizes (for reducing

network time) (§3), while also allowing the pushdown pro-

cessing time for iteration

𝑁+

1to substantially overlap with

the compute tier processing time for iteration 𝑁.

The second challenge, particularly important to operators,

consists in using the limited storage-side compute resources

eciently. Hapi addresses this challenge with our novel tech-

nique called storage-side batch size adaptation. Splitting the

TL DNN naturally decouples the batch sizes used in each tier.

The insight is that one can use a signicantly smaller batch

size in the storage-side pushdown compared to the compute

tier, importantly, without aecting training accuracy. By de-

sign, Hapi maintains the same training accuracy as if the

training was fully performed in the compute tier (i.e., no

pushdowns to storage) by pushing down only parts of the TL

feature extraction phase which uses frozen weights (xed,

not re-trained) and by ensuring that the size and contents of

the training batch in the compute tier remain unchanged. The

benet of storage-side batch size adaptation is that it greatly

reduces the amount of GPU memory used by pushdowns as

the initial layers of feature extraction are the most memory

intensive (§3) due to their larger output sizes. This memory

reduction has two crucial consequences. First, this enables

several pushdowns to make progress concurrently in the

COS, thus improving resource eciency. Second, this avoids

the out-of-memory (OOM) errors that plague practitioners.

Hapi spans both the COS and compute tiers, is transparent

to the user, and relies on inexpensive proling runs. Hapi’s

design uses a stateless storage-side component alongside

lightweight request between the two tiers to simplify practi-

cal concerns regarding load balancing, scalability and failure

resilience.

Specically, the contributions of this paper are:

(1)

Identifying and demonstrating the benets of applying

near-data computation techniques to TL on top of the

disaggregated COS.

(2)

A measurement study of DNN layer characteristics

across 7 popular DNNs (§3), including modern archi-

tectures like the Vision Transformer (ViT) and widely-

used CNNs such as ResNet and DenseNet. These char-

acteristics play an important role in our system design.

(3)

Hapi, an end-to-end system comprising two key design

techniques: DNN splitting (§4.3) and storage-side batch

size adaptation (§4.4) which reduce network transfers,

improve training runtime and enable increased push-

down concurrency in the COS.

(4)

An extensive evaluation (§5) showing up to 2.5

speed-

up in training runtime while choosing in 86.8% of cases

the best performing split point or one that is at most

5% o from the best.

The paper is organized as follows. We provide background

in §2 and present our measurement study in §3. §4 presents

the design and implementation of Hapi. We show experi-

mental results in §5 and discuss related work in §6.

2 BACKGROUND

2.1 Transfer learning

TL democratizes ML by allowing knowledge from a model

pre-trained on a large dataset to be adapted and reused

(ne-tuned) for a dierent but related task [

]. In doing

so, the DNN training time and the generalization error are

reduced [

]. As models grow in size and complexity, TL has

become increasingly essential for eciently adapting pre-

trained models to specic tasks with fewer resources, making

their deployment more practical across various domains. The

intuition behind TL is that the pre-trained model, often re-

ferred to as a backbone, captures generalizable embeddings

(representations of the input data) which can be adapted

or ne-tuned for new tasks rather than requiring a model

to be trained from scratch. Examples of backbones include

convolutional neural networks (CNNs) and Vision Trans-

formers (ViTs) for computer vision [

], as well as models

like BERT [

] and GPT [

] for natural language processing

(NLP). Recent advances in Vision Transformers [

]

exemplify this approach by providing scalable and reusable

backbones that can be ne-tuned for a wide range of tasks,

including classication, detection, and segmentation.

Figure 1: Overview of TL ne-tuning.

The key to TL lies in ne-tuning the pre-trained model,

as depicted in Figure 1. Pre-training is usually done on a

dierent system and often by a dierent entity and is outside

the scope of this paper. Traditionally, ne-tuning is divided

further into two phases: (i) feature extraction, where em-

beddings or high-level representations are extracted from

new input data using (partially or entirely) the pre-trained

model, and then (ii) training, to create a new classier using

the extracted embeddings [

]. Typically, the early layers of

Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores SoCC ’24, November 20–22, 2024, Redmond, WA, USA

the pre-trained model, which capture more general features

(e.g., edges, textures in vision models), are frozen during

ne-tuning, i.e., the model weights of these layers (in red in

Figure 1) are not updated with backpropagation. Every itera-

tion (i.e., input batch processed) involves feature extraction

followed by training. We refer to the last frozen layer as the

freeze layer (or index).

2.2 Cloud object stores

Cloud object stores (COS), such as Amazon S3 [

], Google

Cloud Storage [

], and Azure Blob Storage [

], are a pop-

ular way to store large-scale unstructured data, providing

ease of use, high availability, high scalability, and durability

at a low cost [

]. COS are the prime example of storage

disaggregation. The COS is connected to the compute tier by

a network that, unfortunately, is a bottleneck [

]

even when the network is maxed out. The reason are hard-

ware trends. The network bandwidth between the COS and

the compute tier is lower than the internal storage bandwidth

of COS servers and also lower than the computation through-

put [

]. The typical network bandwidth of a single cloud

server is 25 - 400 Gbps (3.125 - 50 GBps) [

A single modern NVMe SSD can read sequentially at well

over 10 GBps [

], so a couple of NVMe SSDs are su-

cient to max out the network bandwidth. In practice, stor-

age servers are provisioned with many SSDs; an arrray of

PCIe 5.0 NVMe SSDs can exceed 100 GBps in read through-

put [

]. Other storage media are faster than SSDs, further

aggravating the network bottleneck. With sucient thread

parallelism, DRAM reads can exceed 100 GBps, and per-

sistent memory reads can exceed 30 GBps [

]. Compute

throughput also exceeds network bandwidth [

]. Earlier

studies [

] have reported network read throughput as

low as 100 MBps per connection from Amazon S3 but the

trend is for network bandwidth to improve. More recently,

up to 100 Gbps from general purpose instances to S3 has

been reported [21].

In light of this enduring network bottleneck, an important

trend has been to push down computation inside the COS, to

reduce the amount of data sent over the network. Pushdowns

were initially restricted to a subset of SQL (e.g., Amazon S3

Select [

]), but there is a renewed eort in the industry to

support more complex pushdowns for computations such

as image processing [

] or analytics [

]. There has been

growing interest in pushing down parts of ML computations

to storage [

]. This trend goes hand-in-hand with an-

other development, enabling pushdowns to use specialized

hardware (§2.3).

Despite these trends, two challenges remain for COS users.

The network may remain a bottleneck despite the push-

downs, and the COS computational resources are scarce and

need to be used eciently as they are only meant to mitigate

the network bottleneck and not replace the compute tier.

2.3 Hardware-accelerated pushdowns

Pushdowns were initially restricted to a subset of SQL, includ-

ing ltering, projecting, and aggregation (e.g., Amazon S3

Select [

]). The current, natural trend is to oer the benets

of pushing down to a wider range of applications. Unfortu-

nately, restricting pushdowns to CPUs can lead to wasted

resources and performance. First, for more complex oper-

ations, CPUs can become a bottleneck. Studies show that

even with 32 cores, an SGD optimizer can fully utilize the

CPU when using a 100 Gbps network [

]. Second, it is not

sucient for the CPU processing to be just faster than the

network because the output of a pushdown may be smaller

than its input. For example, for a pushdown to generate an

output at 100 Gbps, assuming an input/output ratio of 2, it

needs to process input at 25 GBps. Finally, the aggregate stor-

age bandwidth of a storage server tends to increase faster

than CPU capabilities [70].

As a result, the current trend is to allow pushdowns to use

specialized hardware such as GPUs. Several works [

]

have proposed to use them in storage systems to speed up

erasure coding. Finally, there is a push to more-closely inte-

grate storage with GPUs, which further increases the appeal

of next-to-storage GPUs. For example, IBM’s Storage Scale

System 6000 [

] integrates NVIDIA GPUDirect Storage [

enabling a direct data path between GPU memory and stor-

age to reduce latency and enhance performance for AI and

data-intensive workloads.

3 MEASUREMENT STUDY

Next, we present a detailed measurement study of 7 DNNs.

These include a state-of-the-art Vision Transformer [

] as

well as several widely-adopted foundational models such as

ResNet50 [

], DenseNet121 [

], and VGG19 [

]. These

models cover a diverse range of architectural characteris-

tics, making them well-suited for evaluating system-level

performance in terms of speed and resource eciency. We

characterize the per-layer properties across three dimensions:

output size, compute time, and maximum GPU memory used.

These properties all play a role in Hapi’s design (§4). Addi-

tional layer-related information for each DNN can be found

in Table 2. For the DNNs structured as a sequence of blocks

(e.g. ResNets) we count one block as one layer and we use

only block boundaries as candidate split points. The input

dataset is ImageNet. For readability purposes we group the

models into 3 sub-groups in each gure.

Hardware setup. For this section we use two identical

GPU-accelerated machines from a public cloud one for the

Hapi client and the other for the COS and the Hapi server.

SoCC ’24, November 20–22, 2024, Redmond, WA, USA Diana Petrescu, Arsany Guirguis, Do Le oc, Javier Picorel, Rachid Guerraoui, and Florin Dinu

0 5 10 15 20 25 30 35

AlexNet

Transformer

0 5 10 15 20 25 30 35

ResNet18

ResNet50

DenseNet

0 5 10 15 20 25 30 35

Size of layer output (MB)

Layer index

Vgg11

Vgg19

(a) Per-layer output sizes

0.2

0 5 10 15 20 25 30 35

AlexNet

Transformer

0.2

0 5 10 15 20 25 30 35

ResNet18

ResNet50

DenseNet

0.2

0 5 10 15 20 25 30 35

Fwd pass time (sec)

Layer index

Vgg11

Vgg19

(b) Per-layer forward pass time

0 5 10 15 20 25 30 35

AlexNet

Transformer

0 5 10 15 20 25 30 35

ResNet18

ResNet50

DenseNet

0 5 10 15 20 25 30 35

Memory usage per layer (GB)

Layer index

Vgg11

Vgg19

Figure 2: A measurement study of 3 per-layer properties for 7 popular DNNs.

Each machine has an Intel Xeon Gold 6278C CPU with 16

cores, 64 GB RAM, 1 Tesla T4 GPU with 16 GB RAM, a 300

GB SSD and runs Ubuntu 20.04 server 64 bit with CUDA 11.2.

The network bandwidth between VMs is 12 Gbps.

Per-layer output size. Figure 2a shows the per-layer output

size. For reference, the size of a pre-processed ImageNet

input tensor is shown with the horizontal dotted blue line.

This output size is for a batch size of 1. One can accurately

extrapolate from this by multiplying by a specic batch size.

The important observation is that for most models, early

on in the DNN structure, there exist layers with a compar-

atively small output size, often signicantly smaller than

subsequent layers (e.g. ResNet layer 4, AlexNet layers 3 and

6). These layers are good candidates for splitting the TL ap-

plication between the COS and the compute tier. Overall,

the layer output size generally increases in the beginning

(with convolution layers) and then decreases (with pooling

layers) but not in a monotonic fashion. While only a few split

points have an output size smaller than the pre-processed

tensor, we will show that this is not strictly necessary for

performance gains because optimizing the network time is

only part of the story. Hapi owes its benets to balancing

network time against the computation time in both tiers.

Per-layer computation time. Figure 2b shows the per-

layer computation time for the forward pass for a batch

size of 128. The insights are that some models are more

lightweight (e.g. AlexNet) than others and that some models

(e.g. DenseNet) show signicant variability across layers.

However, the forward pass time remains relatively stable

with an increase in layer index and gives Hapi the exibility

to balance computation time across the two tiers.

Per-layer maximum GPU memory usage. Figure 2c

shows the maximum amount of GPU memory needed for

the processing of each layer for a batch size of 128. Again,

the variation is model dependent and for each model there

is variability between layers. Generally, the rst few layers

use more memory suggesting that these need to be the focal

point if reducing memory consumption is desired.

The need for a dynamic splitting algorithm. We next

motivate Hapi’s dynamic splitting approach by showing the

limitations of static splitting approaches.

Figure 3 shows 7 groups of bars. Each group represents

a model (listed on the x-axis) and the 7 bars in each group

represent dierent batch sizes (128, 256, 512, 1024, 1536, 2048,

3072 from left to right). The gure summarizes a performance

sweep over all possible split points in the DNNs (after each

layer). The gure shows per-epoch runtime normalized to

splitting at the freezing layer (@Freeze). The points (blue

rectangle, red circle) show two other sensible static splits:

(1) with red circles @Min, splitting at the layer with the

smallest output (earliest such layer as a tiebreaker) and (2)

with blue rectangles NoSplit, i.e., sending the entire input

to the compute tier. The lines show the speed-up range for

all other splits, except the three mentioned above (@Freeze,

@Min, NoSplit). If some of the static splits do not appear (e.g.

Vgg11, Vgg19, Transformer) it is because they cause OOMs

on the client. However, @Freeze (y-axis value = 1) never

OOMs.

0.5

1.5

2.5

AlexNet ResNet18 ResNet50 DenseNet Vgg11 Vgg19 Transformer

Speed-up vs split freeze

7 batch sizes per model. Left to right: 128, 256, 512, 1024, 1536, 2048, 3072

@Min

@Freeze = 1

NoSplit

Figure 3: Speed-up when splitting at various layers

normalized to splitting at the freezing layer.

Several insights can be derived from the gure, which

point to the need for an intelligent dynamic splitting algo-

rithm. First, no static splitting strategy wins in all cases.

Second, even though one splitting strategy is the best option

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AcceleratingTransferLearningwithNear-DataComputationonCloudObjectStoresDianaPetrescuEPFLLausanne,CHdiana.petrescu@epfl.chArsanyGuirguisEPFLLausanne,CHarsany.guirguis91@gmail.comDoLeQuocHuaweiMunichResearchCenterMunich,DEquoc.do.le@huawei.comJavierPicorelHuaweiMunichResearchCenterMunich,DEjavier.pico...

展开>> 收起<<

Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: