Noise in the Clouds Influence of Network Performance Variability on Application Scalability Daniele De Sensi Tiziano De Matteis Konstantin Taranov

2025-05-02 0 0 6.76MB 20 页 10玖币
侵权投诉
Noise in the Clouds: Influence of Network
Performance Variability on Application Scalability
Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov,
Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoefler
Department of Computer Science, ETH Zurich, Switzerland
{first-name.last-name}@inf.ethz.ch
Abstract—Cloud computing represents an appealing oppor-
tunity for cost-effective deployment of HPC workloads on the
best-fitting hardware. However, although cloud and on-premise
HPC systems offer similar computational resources, their net-
work architecture and performance may differ significantly. For
example, these systems use fundamentally different network
transport and routing protocols, which may introduce network
noise that can eventually limit the application scaling. This
work analyzes network performance, scalability, and cost of
running HPC workloads on cloud systems. First, we consider
latency, bandwidth, and collective communication patterns in
detailed small-scale measurements, and then we simulate network
performance at a larger scale. We validate our approach on
four popular cloud providers and three on-premise HPC systems,
showing that network (and also OS) noise can significantly impact
performance and cost both at small and large scale.
Index Terms—cloud; HPC; network noise; scalability;
I. INTRODUCTION
Due to flexibility and cost-effectiveness, running HPC ap-
plications in the cloud has become an appealing solution and a
potential alternative to on-premise systems [1], [2]. Scientific
applications from different domains already run on the cloud,
including multiphysics simulations [3], [4] and biomedical
applications [5], [6].
One of the main advantages of cloud computing is the
possibility to run an application on the most appropriate
computational resources in a cost-effective way. Instances
that can be deployed in the cloud come with a wide variety
of architectural characteristics in terms of memory, CPUs,
accelerators, and network bandwidth. On the CPU side, it is
possible to select between different processors, with different
numbers of cores, clock frequency, and architecture, ranging
from commercial off-the-shelf Intel and AMD processors to
custom ARM processors like the ARM Graviton processor
deployed by AWS [7]. Cloud providers also offer a wide
choice of accelerators that includes different types and gener-
ations of GPUs [8], TPUs (Tensor Processing Units) [9], and
FPGAs [10]. Similarly, different instances provide different
network bandwidths. Users can deploy instances with 100 Gb/s
networks on most major cloud providers and, in some cases,
even 200 Gb/s and 400 Gb/s instances (on Azure and AWS,
respectively). Finally, cloud vendors frequently deploy new
hardware, differently from on-premise HPC systems, where
compute resources have life-cycles spanning multiple years.
However, all this flexibility comes at a cost. Although we
can expect minor differences in the compute performance
between an HPC instance in the cloud and an equivalent
server in an on-premise HPC system [11], [12], the network
performance can significantly differ. Indeed, in some cases,
the network connecting those instances in the cloud signifi-
cantly differs from a traditional HPC network. For example,
packets might be routed using ECMP [13], [14], [15] in a
congestion-oblivious way, and can thus experience a higher
latency if multiple network flows are mapped on the same
paths [16], [17], [18]. On the contrary, HPC systems often
deploy adaptive routing to react more promptly to congestion
in the network [19], [20]. Also, differently from most HPC
systems, some providers do not use Remote Direct Memory
Access (RDMA), or run instances on tapered networks [15].
All these factors can contribute to increase network latency,
decrease network bandwidth, and increase network noise [21],
[22], [23], [24], [25] (i.e., performance variability induced by
the use of the network). This limits the scalability and tampers
cost-effectiveness. Although HPC applications can scale up
to 42 million cores [26] on on-premise HPC systems, it is
still not clear how far HPC applications could scale on the
cloud. Traditionally, cloud environments have been considered
a good match for loosely coupled or embarrassingly parallel
workloads, but network performance has been seen as one
of the main bottlenecks preventing their adoption for tightly
coupled computations [27], [11], [28], [29], [30], [31].
Assessing the network performance and the impact of noise
on scalability is even more relevant if we consider that the
gap between compute and network performance increases. For
example, from 2010 to 2018, the computational throughput of
the Top 500 HPC systems [32] increased by 65x, while the off-
node communication bandwidth only increased by 4.8x [33],
[34]. Thus we expect, in the future, network performance to
be even more relevant for HPC applications running on the
cloud.
In this work, we focus on network performance and noise,
assessing the impact on performance, scalability, and cost of
tightly-coupled HPC communication patterns at scale. Because
collecting statistically sound measurements at the scale of
thousands of HPC VMs would be too expensive (and on some
cloud providers not even feasible), we first perform detailed
network performance and noise measurement at small scale.
arXiv:2210.15315v2 [cs.DC] 1 Nov 2022
TABLE I
ANALYZED SYSTEMS:FOR EACH OF THEM WE DETAIL THE CPU, MEMORY AND NETWORK CHARACTERISTICS.CINDICATES THE NUMBER OF PHYSICAL
CORES. INSTANCE COSTS AS REFERRED TO THE JULY, 18 2022 FOR THE EAST US AVAILABILITY ZONE.
SYSTEM INSTANCE
TYPE CPU MEMORY
PER HOUR
INSTANCE COST
(COMMITTED)
PER HOUR
INSTANCE COST
(ON-DEMAND)
BANDWIDTH NETWORK ROUTING
TRANSPORT
PROTOCOL
AWS
Normal c5.18xlarge
2x18C
Intel Xeon Platinum
8124M @ 3GHz
144 GB 1.34 USD 3.06 USD 25 Gb/s Fat Tree [14] ECMP [14] SRD [14]
HPC (Metal) c5n.metal
2x18C
Intel Xeon Platinum
8124M @ 3GHz
192 GB 1.475 USD 3.88 USD 100 Gb/s Fat Tree [14] ECMP [14] SRD [14]
HPC c5n.18xlarge
2x18C
Intel Xeon Platinum
8124M @ 3GHz
192 GB 1.475 USD 3.88 USD 100 Gb/s Fat Tree [14] ECMP [14] SRD [14]
AZURE
Normal F72s v2
36C
Intel Xeon Platinum
8370C/8272CL/8168
144 GB 1.116 USD 3.045 USD 30 Gb/s Fat Tree [35] ECMP [35] N.A.
HPC HC44rs
2x22C
Intel Xeon Platinum
8168 @ 2.70GHz
352 GB 2.218 USD 3.168 USD 100 Gb/s Non-Blocking
Fat Tree [36]
Static/
Adaptive [37] InfiniBand [36]
HPC (200 Gb/s) HB120rs v2
2x60C
AMD Epyc
7V12 @ 2.45 GHz
456 GB 1.8 USD 3.6 USD 200 Gb/s Non-Blocking
Fat Tree [36]
Static/
Adaptive [37] InfiniBand [36]
GCP
Normal c2-standard-60
2x15C
Intel Cascade Lake
@ 3.10GHz
240 GB 1.25 USD 3.1321 USD 32 Gb/s
Jupiter
(3:1 Blocking
Fat Tree) [15]
ECMP [15] TCP/IP +
Intel QuickData [38]
HPC c2-standard-60
2x15C
Intel Cascade Lake
@ 3.10GHz
240 GB 2.148 USD 4.03 USD 100 Gb/s
Jupiter
(3:1 Blocking
Fat Tree) [15]
ECMP [15] TCP/IP +
Intel QuickData [38]
ORACLE
Normal VM.Optimized3.Flex
18C
Intel Xeon Gold
6354 @ 3GHz
256 GB N.A. 1.188 USD 40 Gb/s Non-Blocking
Fat Tree [39] N.A. N.A.
HPC (Metal) BM.Optimized3.36
2x18C
Intel Xeon Gold
6354 @ 3GHz
512 GB N.A. 2.712 USD 100 Gb/s Non-Blocking
Fat Tree [39] N.A. RoCEv2 [40]
DAINT
HPC (Metal) -
2x18C
Intel Xeon
E5-2695 v4 @ 2.10GHz
64 GB 1.02 USD [41] 1.73 USD [41] 82 Gb/s Cray Aries
(Dragonfly) [42]
Per-Packet
Adaptive [42] FMA [42]
ALPS
HPC (Metal) -
2x64C
AMD EPYC
7742 @ 2.25GHz
256 GB N.A. N.A. 100 Gb/s HPE Cray Slingshot
(Dragonfly) [19]
Per-Packet
Adaptive [19] RoCEv2 [19]
DEEP-EST
HPC (Metal) -
2x12C
Intel Xeon Gold
6146 @ 3.20GHz
192 GB N.A. N.A. 100 Gb/s
Mellanox
InfiniBand EDR
(Fat Tree) [43]
Static/
Adaptive Infiniband [43]
On one side, we analyze this data to spotlight differences in
network performance and noise between different cloud and
on-premise HPC systems. On the other side, we use this data to
calibrate the LogGOPSim simulator [44], [45], and to simulate
the scalability and cost at a larger scale (up to 16K HPC VMs).
We define the concepts of latency noise and bandwidth
noise, and assess the network performance and its impact on
scalability of HPC and normal instances of four major cloud
providers and of three on-premise systems (with different net-
work technology). We also assess OS noise (i.e., performance
variability introduced by OS processes), and we show how
different type of noise impact application performance and cost
both at small and large scale, for both latency- and bandwidth-
dominated communication patterns.
We describe in Sec. II the main characteristics of HPC cloud
solutions, in Sec. III we analyze the network performance of
both cloud and on-premise HPC systems, with a focus on OS
and network noise in Sec. IV. Then, we simulate how noise
affects performance at scale in Sec. V, and discuss related
work in Sec. VII. Eventually, Sec. VIII draws conclusions.
II. HPC IN THE CLOUD
In this section we measure and analyze the network per-
formance of HPC systems in the cloud at a small scale, to
understand better some peculiarities and limitations of those
systems. In this paper, we analyze four of the major cloud
providers: Amazon AWS [46], Google GCP [47], Microsoft
Azure [48], and Oracle Cloud [49]. We also analyze three on-
premise HPC systems: Piz Daint [50] (referred as Daint in the
following) and Alps [51], both deployed at the Swiss National
Supercomputing Centre, and DEEP-EST [43], deployed at the
J¨
ulich Supercomputing Centre. We analyze cloud instances
of different types, including HPC instances (with different
network bandwidth) and normal compute instances. We outline
the different analyzed systems, instance types, and their main
characteristics in Table I. In the following we analyze in detail
the different instance types (Sec. II-A), their network features
(Sec. II-B), and their cost (Sec. II-C).
A. Instances, CPUs, and OS
In the following, with the term HPC instances, we refer
to those instances providing at least 100 Gb/s networking.
For AWS, we evaluate both bare-metal and non bare-metal
HPC instances. Azure and GCP provide only non bare-metal
HPC instances, whereas Oracle only provides bare-metal HPC
instances. To have a fair comparison, we selected instance
types with similar CPUs when possible. We used Intel CPUs
on all the cloud instances except for the 200 Gb/s instances
of Azure, which only have AMD EPYC CPUs. For normal
instances, we selected those that provide a similar network
bandwidth and core count. For completeness, we also report
the amount of RAM memory on each instance type.
All four providers guarantee that HPC instances are run
on separate physical servers. For the normal instances we
selected CPUs with a high core count to have them allocated
on two separate servers. This is necessary to ensure that when
measuring network performance the two VMs are actually
using the network. For the cloud providers we report in the
Instance Type column the name of the instances we used.
On all cloud providers, we use the virtual machine (VM)
images and operating system suggested for the HPC instances.
These were: Amazon Linux 2 on AWS [52], CentOS 7.7 on
Azure [53], CentOS 7.9 on GCP [54], and Oracle Linux 7.9 on
Oracle. Daint and Alps run a Cray Linux Environment (CLE)
OS based on SUSE Linux Enterprise Server v15.2, and DEEP-
EST runs Rocky Linux v8.5.
B. Network
The four cloud providers and the DEEP-EST system deploy
a fat tree topology. According to the most recent documenta-
tion we found, Azure, Oracle, and DEEP-EST deploy a non-
blocking network [36], [55], GCP a 3:1 blocking network [15],
whereas we did not find any additional detail on network
over- or under-provisioning for AWS. Both AWS and GCP use
ECMP routing [13], Azure employs adaptive routing [37] for
HPC instances, and for Oracle we did not find any information
on routing. The routing protocol plays a crucial role in network
performance. For example, ECMP is congestion oblivious and
might suffer from flow collisions [16], [17], [18], increasing
the network bandwidth variability (see Sec. IV-C). Daint and
Alps deploy a dragonfly interconnect (Cray Aries [42] and
Slingshot [19] respectively) with adaptive routing.
Each of the evaluated cloud providers uses a different
transport protocol. AWS provides its proprietary RDMA-
like protocol called SRD (Scalable Reliable Datagram) [14],
which resembles in some aspects InfiniBand verbs [56]. It
provides reliable out-of-order delivery of packets and uses a
custom congestion control protocol. The AWS Nitro Card [57]
implements the reliability layer, and the Elastic Fabric Adapter
(EFA) provides OS-bypass capabilities. To react to congestion,
SRD monitors the round trip time (RTT) and forces packets to
be routed differently by changing some of the fields used by
ECMP to select the path. This approach is probabilistic and
might allow avoiding congested paths, but, differently than
truly adaptive routing, it does not allow selecting the least
congested path nor any specific path.
Azure and DEEP-EST use RDMA through InfiniBand [36],
Oracle uses RDMA over Converged Ethernet (RoCEv2) [40],
whereas GCP does not use RDMA and relies on TCP/IP.
To minimize data movement overheads, GCP uses Intel’s
QuickData DMA Engines [58] to offload payload copies of
larger packets. Daint uses a proprietary RDMA protocol [42]
(FMA), whereas Alps uses RoCEv2 [19].
C. Cost
Table I shows the per-hour cost charged to the user as
of July 18, 2022. For the cloud systems, we report the cost
for the East US availability zone. We consider both the cost
for a committed 3-years usage with upfront payment and the
on-demand cost without any minimum commitment. Please
note that 3-years is the maximum commitment allowed on
those providers (and that leads to the lowest per-hour cost),
whereas when having no commitments we have the highest
per-hour cost. For Daint, we report the cost for a minimum
usage of 10,000 compute hours, as well as the on demand
cost, both for non-academic partners. Academic partners have
discounted rates and this would otherwise lead to an unfair
comparison. For Alps and DEEP-EST there is no publicly
available information on the per-hour cost.
On AWS, the main difference between the normal and
HPC instances we selected is the support for Elastic Fabric
Adapter (EFA), which provides the 100 Gb/s networking.
Thus, we can estimate the 3-years committed cost of the
high-performance networks at around 0.135 USD per hour
per VM, and an on-demand cost of 0.82 USD. Similarly, we
selected the same instance type for normal and HPC instances
on GCP. The only difference is that we enabled the so-called
Tier 1 network on the HPC instance, which provides 100 Gb/s
network bandwidth. On GCP, we can thus estimate the cost of
the HPC network at around 0.9 USD per hour per VM [59]
(both for the committed and on-demand cost). Unfortunately,
Azure and Oracle do not provide the same instance in HPC
and non-HPC flavors, and it is thus not possible to isolate the
cost of the HPC network from the rest. Also, we observe that
whereas the on-demand cost of 100 Gb/s instances is lower
than the cost of 200Gb/s instances, this is not true for the
3-years committed usage. Indeed, at the time of the writing,
committing for a 3-years usage led to a 30% discount for 100
Gb/s instances, and to a 50% discount for the 200Gb/s ones.
III. NETWORK PERFORMANCE
We measure network performance using the Netgauge
tool [60], that provides detailed, sample-by-sample measure-
ments (fundamental for estimating network noise in Sec. IV).
We used the Message Passing Interface (MPI) backend and, on
each system, the MPI library recommended by the provider1.
a) Methods: We created an account on each provider,
and used our own funding and/or academic credits, without
coordinating with the providers. The clusters have been cre-
ated and tuned following the guidelines publicly available
1On Azure we used HPC-X v2.8.3 on HPC instances and Open MPI
v4.1.2 on normal instances. We used Open MPI v4.1.1 on AWS, Intel MPI
v2018.4.274 on GCP, Open MPI v4.0.4 on Oracle, Cray MPICH v7.7.18 on
Daint, Cray MPICH v8.1.12 on Alps, and Open MPI v4.1.3 on DEEP-EST.
1B 16B 256B 4KiB 64KiB 1MiB 16MiB
Message Size
0
20
40
60
80
100
Bandwidth (Gb/s)
AWS
1B 16B 256B 4KiB 64KiB
0
20
40
RTT/2 (us)
1B 16B 256B 4KiB 64KiB 1MiB 16MiB
Message Size
GCP
1B 16B 256B 4KiB 64KiB
0
20
40
RTT/2 (us)
Concurrent Communications
124816
Fig. 1. Bandwidth for HPC instances as a function of message size and number of concurrent connections between the two servers. Inner plots show RTT/2
for small messages.
in the documentation of the cloud providers. After running
the benchmarks, we contacted the leads of the cloud busi-
ness of each of the providers, sharing a draft of the paper
with them. They assessed the correctness of our evalua-
tion, and we integrated their feedback in the paper. Only
in one case we improved the performance by applying a
technique not described in the publicly available documen-
tation, that we describe in the text (see the comment about
the FI_EFA_TX_MIN_CREDITS in Sec. III-A). On all the
providers, if not specified otherwise, we allocated the two VMs
(or the two servers) on the same rack. The only exception
is Oracle, where it is not possible to explicitly control the
allocation. We analyze in detail the impact of allocation on
performance and noise in Sec. IV.
A. Bandwidth Saturation
All the four analyzed cloud providers claim a 100 Gb/s
bandwidth on Intel-based HPC instances. However, this is true
only under certain conditions. For example, AWS documents
a maximum per-message bandwidth of 25Gb/s [61]. Even if
not explicitly documented, we observed similar limitations on
GCP. One possible reason justifying this behavior is that even
if the instance exposes a single 100 Gb/s NIC, it might be
equipped with multiple 25Gb/s NICs (or a multi-port NIC).
While some providers explicitly documented this for non-HPC
instances, the specific configuration is often unclear for HPC
ones.
For this reason, we can expect a higher bandwidth when
sending a message over multiple connections. To assess if this
is the case, we run a ping-pong benchmark between two nodes.
We establish multiple concurrent connections between the two
nodes, by running multiple processes per node and letting each
pair of processes send/receive disjoint parts of the message.
For example, a 16MiB pingpong with 16 processes per node
runs 16 concurrent ping-pongs between 16 processes on the
first node and 16 processes on the second node, each with a
1MiB message.
We report the results of this experiment for AWS and GCP
in Figure 1. We report the bandwidth as the message size
divided by half the round trip time (RTT/2), and the inner plots
show the RTT/2 for small messages. Each point in the plot is
the average over 1000 runs, whereas the band around the point
represents the standard deviation. We do not report the results
for the other systems since they can saturate the bandwidth
even with a single connection (we show results in the next
section). On AWS we increased the bandwidth by increasing
the maximum number of in-flight packets to 1024 (by setting
the FI_EFA_TX_MIN_CREDITS environment variable).
On both AWS and GCP the bandwidth increases when
increasing the number of concurrent communications (up to
80Gb/s with 16 processes per node). Also, when using a
single connection, the bandwidth drops for messages larger
than 4MiB. This is caused by a more-than-linear increase in
last level cache (LLC) misses, that we measured by using the
perf tool. For example, on AWS, we observe a 4×increase
in LLC misses when going from 1MiB to 4MiB messages, but
a 8×increase when moving from 4MiB to 16MiB messages.
This effect is not present when using more concurrent com-
munications because the message is split among the processes,
each transmitting a smaller message.
We also observe that having more processes per node
increases the RTT of small messages, due to additional
overhead and contention on the NIC access. For this reason,
only large messages should be sent with multiple concurrent
connections. Instead of having more processes sending a part
of the message each, we could have a single process sending
multiple smaller messages. For example, some MPI libraries
provide the possibility to stripe messages transparently over
multiple connections (e.g., by using the btl_tcp_links
command line flag on Open MPI [62]). However, we did not
observe any performance improvement compared to the single
connection case.
Observation 1: On AWS and GCP, the peak bandwidth
on a single connection is 50Gb/s and 30Gb/s respectively.
A bandwidth of 80Gb/s can only be reached by forcing
messages to be concurrently sent/received by/from multi-
ple processes on different connections.
摘要:

NoiseintheClouds:InuenceofNetworkPerformanceVariabilityonApplicationScalabilityDanieleDeSensi,TizianoDeMatteis,KonstantinTaranov,SalvatoreDiGirolamo,TobiasRahn,andTorstenHoeerDepartmentofComputerScience,ETHZurich,Switzerlandfrst-name.last-nameg@inf.ethz.chAbstract—Cloudcomputingrepresentsanappeal...

展开>> 收起<<
Noise in the Clouds Influence of Network Performance Variability on Application Scalability Daniele De Sensi Tiziano De Matteis Konstantin Taranov.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:6.76MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注