Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores SoCC ’24, November 20–22, 2024, Redmond, WA, USA
the pre-trained model, which capture more general features
(e.g., edges, textures in vision models), are frozen during
ne-tuning, i.e., the model weights of these layers (in red in
Figure 1) are not updated with backpropagation. Every itera-
tion (i.e., input batch processed) involves feature extraction
followed by training. We refer to the last frozen layer as the
freeze layer (or index).
2.2 Cloud object stores
Cloud object stores (COS), such as Amazon S3 [
2
], Google
Cloud Storage [
27
], and Azure Blob Storage [
11
], are a pop-
ular way to store large-scale unstructured data, providing
ease of use, high availability, high scalability, and durability
at a low cost [
51
]. COS are the prime example of storage
disaggregation. The COS is connected to the compute tier by
a network that, unfortunately, is a bottleneck [
23
,
46
,
47
,
60
]
even when the network is maxed out. The reason are hard-
ware trends. The network bandwidth between the COS and
the compute tier is lower than the internal storage bandwidth
of COS servers and also lower than the computation through-
put [
23
,
49
]. The typical network bandwidth of a single cloud
server is 25 - 400 Gbps (3.125 - 50 GBps) [
23
,
38
,
61
,
71
].
A single modern NVMe SSD can read sequentially at well
over 10 GBps [
10
,
48
], so a couple of NVMe SSDs are su-
cient to max out the network bandwidth. In practice, stor-
age servers are provisioned with many SSDs; an arrray of
PCIe 5.0 NVMe SSDs can exceed 100 GBps in read through-
put [
48
]. Other storage media are faster than SSDs, further
aggravating the network bottleneck. With sucient thread
parallelism, DRAM reads can exceed 100 GBps, and per-
sistent memory reads can exceed 30 GBps [
72
]. Compute
throughput also exceeds network bandwidth [
23
,
49
]. Earlier
studies [
46
,
60
] have reported network read throughput as
low as 100 MBps per connection from Amazon S3 but the
trend is for network bandwidth to improve. More recently,
up to 100 Gbps from general purpose instances to S3 has
been reported [21].
In light of this enduring network bottleneck, an important
trend has been to push down computation inside the COS, to
reduce the amount of data sent over the network. Pushdowns
were initially restricted to a subset of SQL (e.g., Amazon S3
Select [
7
]), but there is a renewed eort in the industry to
support more complex pushdowns for computations such
as image processing [
13
] or analytics [
1
]. There has been
growing interest in pushing down parts of ML computations
to storage [
45
,
67
]. This trend goes hand-in-hand with an-
other development, enabling pushdowns to use specialized
hardware (§2.3).
Despite these trends, two challenges remain for COS users.
The network may remain a bottleneck despite the push-
downs, and the COS computational resources are scarce and
need to be used eciently as they are only meant to mitigate
the network bottleneck and not replace the compute tier.
2.3 Hardware-accelerated pushdowns
Pushdowns were initially restricted to a subset of SQL, includ-
ing ltering, projecting, and aggregation (e.g., Amazon S3
Select [
7
]). The current, natural trend is to oer the benets
of pushing down to a wider range of applications. Unfortu-
nately, restricting pushdowns to CPUs can lead to wasted
resources and performance. First, for more complex oper-
ations, CPUs can become a bottleneck. Studies show that
even with 32 cores, an SGD optimizer can fully utilize the
CPU when using a 100 Gbps network [
42
]. Second, it is not
sucient for the CPU processing to be just faster than the
network because the output of a pushdown may be smaller
than its input. For example, for a pushdown to generate an
output at 100 Gbps, assuming an input/output ratio of 2, it
needs to process input at 25 GBps. Finally, the aggregate stor-
age bandwidth of a storage server tends to increase faster
than CPU capabilities [70].
As a result, the current trend is to allow pushdowns to use
specialized hardware such as GPUs. Several works [
12
,
31
]
have proposed to use them in storage systems to speed up
erasure coding. Finally, there is a push to more-closely inte-
grate storage with GPUs, which further increases the appeal
of next-to-storage GPUs. For example, IBM’s Storage Scale
System 6000 [
39
] integrates NVIDIA GPUDirect Storage [
57
],
enabling a direct data path between GPU memory and stor-
age to reduce latency and enhance performance for AI and
data-intensive workloads.
3 MEASUREMENT STUDY
Next, we present a detailed measurement study of 7 DNNs.
These include a state-of-the-art Vision Transformer [
66
] as
well as several widely-adopted foundational models such as
ResNet50 [
33
], DenseNet121 [
36
], and VGG19 [
64
]. These
models cover a diverse range of architectural characteris-
tics, making them well-suited for evaluating system-level
performance in terms of speed and resource eciency. We
characterize the per-layer properties across three dimensions:
output size, compute time, and maximum GPU memory used.
These properties all play a role in Hapi’s design (§4). Addi-
tional layer-related information for each DNN can be found
in Table 2. For the DNNs structured as a sequence of blocks
(e.g. ResNets) we count one block as one layer and we use
only block boundaries as candidate split points. The input
dataset is ImageNet. For readability purposes we group the
models into 3 sub-groups in each gure.
Hardware setup. For this section we use two identical
GPU-accelerated machines from a public cloud one for the
Hapi client and the other for the COS and the Hapi server.