Noise in the Clouds Inﬂuence of Network Performance Variability on Application Scalability Daniele De Sensi Tiziano De Matteis Konstantin Taranov

2025-05-02 0 0 6.76MB 20 页 10玖币

侵权投诉

Noise in the Clouds: Inﬂuence of Network

Performance Variability on Application Scalability

Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov,

Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoeﬂer

Department of Computer Science, ETH Zurich, Switzerland

{ﬁrst-name.last-name}@inf.ethz.ch

Abstract—Cloud computing represents an appealing oppor-

tunity for cost-effective deployment of HPC workloads on the

best-ﬁtting hardware. However, although cloud and on-premise

HPC systems offer similar computational resources, their net-

work architecture and performance may differ signiﬁcantly. For

example, these systems use fundamentally different network

transport and routing protocols, which may introduce network

noise that can eventually limit the application scaling. This

work analyzes network performance, scalability, and cost of

running HPC workloads on cloud systems. First, we consider

latency, bandwidth, and collective communication patterns in

detailed small-scale measurements, and then we simulate network

performance at a larger scale. We validate our approach on

four popular cloud providers and three on-premise HPC systems,

showing that network (and also OS) noise can signiﬁcantly impact

performance and cost both at small and large scale.

Index Terms—cloud; HPC; network noise; scalability;

I. INTRODUCTION

Due to ﬂexibility and cost-effectiveness, running HPC ap-

plications in the cloud has become an appealing solution and a

potential alternative to on-premise systems [1], [2]. Scientiﬁc

applications from different domains already run on the cloud,

including multiphysics simulations [3], [4] and biomedical

applications [5], [6].

One of the main advantages of cloud computing is the

possibility to run an application on the most appropriate

computational resources in a cost-effective way. Instances

that can be deployed in the cloud come with a wide variety

of architectural characteristics in terms of memory, CPUs,

accelerators, and network bandwidth. On the CPU side, it is

possible to select between different processors, with different

numbers of cores, clock frequency, and architecture, ranging

from commercial off-the-shelf Intel and AMD processors to

custom ARM processors like the ARM Graviton processor

deployed by AWS [7]. Cloud providers also offer a wide

choice of accelerators that includes different types and gener-

ations of GPUs [8], TPUs (Tensor Processing Units) [9], and

FPGAs [10]. Similarly, different instances provide different

network bandwidths. Users can deploy instances with 100 Gb/s

networks on most major cloud providers and, in some cases,

even 200 Gb/s and 400 Gb/s instances (on Azure and AWS,

respectively). Finally, cloud vendors frequently deploy new

hardware, differently from on-premise HPC systems, where

compute resources have life-cycles spanning multiple years.

However, all this ﬂexibility comes at a cost. Although we

can expect minor differences in the compute performance

between an HPC instance in the cloud and an equivalent

server in an on-premise HPC system [11], [12], the network

performance can signiﬁcantly differ. Indeed, in some cases,

the network connecting those instances in the cloud signiﬁ-

cantly differs from a traditional HPC network. For example,

packets might be routed using ECMP [13], [14], [15] in a

congestion-oblivious way, and can thus experience a higher

latency if multiple network ﬂows are mapped on the same

paths [16], [17], [18]. On the contrary, HPC systems often

deploy adaptive routing to react more promptly to congestion

in the network [19], [20]. Also, differently from most HPC

systems, some providers do not use Remote Direct Memory

Access (RDMA), or run instances on tapered networks [15].

All these factors can contribute to increase network latency,

decrease network bandwidth, and increase network noise [21],

[22], [23], [24], [25] (i.e., performance variability induced by

the use of the network). This limits the scalability and tampers

cost-effectiveness. Although HPC applications can scale up

to 42 million cores [26] on on-premise HPC systems, it is

still not clear how far HPC applications could scale on the

cloud. Traditionally, cloud environments have been considered

a good match for loosely coupled or embarrassingly parallel

workloads, but network performance has been seen as one

of the main bottlenecks preventing their adoption for tightly

coupled computations [27], [11], [28], [29], [30], [31].

Assessing the network performance and the impact of noise

on scalability is even more relevant if we consider that the

gap between compute and network performance increases. For

example, from 2010 to 2018, the computational throughput of

the Top 500 HPC systems [32] increased by 65x, while the off-

node communication bandwidth only increased by 4.8x [33],

[34]. Thus we expect, in the future, network performance to

be even more relevant for HPC applications running on the

cloud.

In this work, we focus on network performance and noise,

assessing the impact on performance, scalability, and cost of

tightly-coupled HPC communication patterns at scale. Because

collecting statistically sound measurements at the scale of

thousands of HPC VMs would be too expensive (and on some

cloud providers not even feasible), we ﬁrst perform detailed

network performance and noise measurement at small scale.

arXiv:2210.15315v2 [cs.DC] 1 Nov 2022

TABLE I

ANALYZED SYSTEMS:FOR EACH OF THEM WE DETAIL THE CPU, MEMORY AND NETWORK CHARACTERISTICS.CINDICATES THE NUMBER OF PHYSICAL

CORES. INSTANCE COSTS AS REFERRED TO THE JULY, 18 2022 FOR THE EAST US AVAILABILITY ZONE.

SYSTEM INSTANCE

TYPE CPU MEMORY

PER HOUR

INSTANCE COST

(COMMITTED)

PER HOUR

INSTANCE COST

(ON-DEMAND)

BANDWIDTH NETWORK ROUTING

TRANSPORT

PROTOCOL

AWS

Normal c5.18xlarge

2x18C

Intel Xeon Platinum

8124M @ 3GHz

144 GB 1.34 USD 3.06 USD 25 Gb/s Fat Tree [14] ECMP [14] SRD [14]

HPC (Metal) c5n.metal

2x18C

Intel Xeon Platinum

8124M @ 3GHz

192 GB 1.475 USD 3.88 USD 100 Gb/s Fat Tree [14] ECMP [14] SRD [14]

HPC c5n.18xlarge

2x18C

Intel Xeon Platinum

8124M @ 3GHz

192 GB 1.475 USD 3.88 USD 100 Gb/s Fat Tree [14] ECMP [14] SRD [14]

AZURE

Normal F72s v2

36C

Intel Xeon Platinum

8370C/8272CL/8168

144 GB 1.116 USD 3.045 USD 30 Gb/s Fat Tree [35] ECMP [35] N.A.

HPC HC44rs

2x22C

Intel Xeon Platinum

8168 @ 2.70GHz

352 GB 2.218 USD 3.168 USD 100 Gb/s Non-Blocking

Fat Tree [36]

Static/

Adaptive [37] InﬁniBand [36]

HPC (200 Gb/s) HB120rs v2

2x60C

AMD Epyc

7V12 @ 2.45 GHz

456 GB 1.8 USD 3.6 USD 200 Gb/s Non-Blocking

Fat Tree [36]

Static/

Adaptive [37] InﬁniBand [36]

GCP

Normal c2-standard-60

2x15C

Intel Cascade Lake

@ 3.10GHz

240 GB 1.25 USD 3.1321 USD 32 Gb/s

Jupiter

(3:1 Blocking

Fat Tree) [15]

ECMP [15] TCP/IP +

Intel QuickData [38]

HPC c2-standard-60

2x15C

Intel Cascade Lake

@ 3.10GHz

240 GB 2.148 USD 4.03 USD 100 Gb/s

Jupiter

(3:1 Blocking

Fat Tree) [15]

ECMP [15] TCP/IP +

Intel QuickData [38]

ORACLE

Normal VM.Optimized3.Flex

18C

Intel Xeon Gold

6354 @ 3GHz

256 GB N.A. 1.188 USD 40 Gb/s Non-Blocking

Fat Tree [39] N.A. N.A.

HPC (Metal) BM.Optimized3.36

2x18C

Intel Xeon Gold

6354 @ 3GHz

512 GB N.A. 2.712 USD 100 Gb/s Non-Blocking

Fat Tree [39] N.A. RoCEv2 [40]

DAINT

HPC (Metal) -

2x18C

Intel Xeon

E5-2695 v4 @ 2.10GHz

64 GB 1.02 USD [41] 1.73 USD [41] 82 Gb/s Cray Aries

(Dragonﬂy) [42]

Per-Packet

Adaptive [42] FMA [42]

ALPS

HPC (Metal) -

2x64C

AMD EPYC

7742 @ 2.25GHz

256 GB N.A. N.A. 100 Gb/s HPE Cray Slingshot

(Dragonﬂy) [19]

Per-Packet

Adaptive [19] RoCEv2 [19]

DEEP-EST

HPC (Metal) -

2x12C

Intel Xeon Gold

6146 @ 3.20GHz

192 GB N.A. N.A. 100 Gb/s

Mellanox

InﬁniBand EDR

(Fat Tree) [43]

Static/

Adaptive Inﬁniband [43]

On one side, we analyze this data to spotlight differences in

network performance and noise between different cloud and

on-premise HPC systems. On the other side, we use this data to

calibrate the LogGOPSim simulator [44], [45], and to simulate

the scalability and cost at a larger scale (up to 16K HPC VMs).

We deﬁne the concepts of latency noise and bandwidth

noise, and assess the network performance and its impact on

scalability of HPC and normal instances of four major cloud

providers and of three on-premise systems (with different net-

work technology). We also assess OS noise (i.e., performance

variability introduced by OS processes), and we show how

different type of noise impact application performance and cost

both at small and large scale, for both latency- and bandwidth-

dominated communication patterns.

We describe in Sec. II the main characteristics of HPC cloud

solutions, in Sec. III we analyze the network performance of

both cloud and on-premise HPC systems, with a focus on OS

and network noise in Sec. IV. Then, we simulate how noise

affects performance at scale in Sec. V, and discuss related

work in Sec. VII. Eventually, Sec. VIII draws conclusions.

II. HPC IN THE CLOUD

In this section we measure and analyze the network per-

formance of HPC systems in the cloud at a small scale, to

understand better some peculiarities and limitations of those

systems. In this paper, we analyze four of the major cloud

providers: Amazon AWS [46], Google GCP [47], Microsoft

Azure [48], and Oracle Cloud [49]. We also analyze three on-

premise HPC systems: Piz Daint [50] (referred as Daint in the

following) and Alps [51], both deployed at the Swiss National

Supercomputing Centre, and DEEP-EST [43], deployed at the

J¨

ulich Supercomputing Centre. We analyze cloud instances

of different types, including HPC instances (with different

network bandwidth) and normal compute instances. We outline

the different analyzed systems, instance types, and their main

characteristics in Table I. In the following we analyze in detail

the different instance types (Sec. II-A), their network features

(Sec. II-B), and their cost (Sec. II-C).

A. Instances, CPUs, and OS

In the following, with the term HPC instances, we refer

to those instances providing at least 100 Gb/s networking.

For AWS, we evaluate both bare-metal and non bare-metal

HPC instances. Azure and GCP provide only non bare-metal

HPC instances, whereas Oracle only provides bare-metal HPC

instances. To have a fair comparison, we selected instance

types with similar CPUs when possible. We used Intel CPUs

on all the cloud instances except for the 200 Gb/s instances

of Azure, which only have AMD EPYC CPUs. For normal

instances, we selected those that provide a similar network

bandwidth and core count. For completeness, we also report

the amount of RAM memory on each instance type.

All four providers guarantee that HPC instances are run

on separate physical servers. For the normal instances we

selected CPUs with a high core count to have them allocated

on two separate servers. This is necessary to ensure that when

measuring network performance the two VMs are actually

using the network. For the cloud providers we report in the

Instance Type column the name of the instances we used.

On all cloud providers, we use the virtual machine (VM)

images and operating system suggested for the HPC instances.

These were: Amazon Linux 2 on AWS [52], CentOS 7.7 on

Azure [53], CentOS 7.9 on GCP [54], and Oracle Linux 7.9 on

Oracle. Daint and Alps run a Cray Linux Environment (CLE)

OS based on SUSE Linux Enterprise Server v15.2, and DEEP-

EST runs Rocky Linux v8.5.

B. Network

The four cloud providers and the DEEP-EST system deploy

a fat tree topology. According to the most recent documenta-

tion we found, Azure, Oracle, and DEEP-EST deploy a non-

blocking network [36], [55], GCP a 3:1 blocking network [15],

whereas we did not ﬁnd any additional detail on network

over- or under-provisioning for AWS. Both AWS and GCP use

ECMP routing [13], Azure employs adaptive routing [37] for

HPC instances, and for Oracle we did not ﬁnd any information

on routing. The routing protocol plays a crucial role in network

performance. For example, ECMP is congestion oblivious and

might suffer from ﬂow collisions [16], [17], [18], increasing

the network bandwidth variability (see Sec. IV-C). Daint and

Alps deploy a dragonﬂy interconnect (Cray Aries [42] and

Slingshot [19] respectively) with adaptive routing.

Each of the evaluated cloud providers uses a different

transport protocol. AWS provides its proprietary RDMA-

like protocol called SRD (Scalable Reliable Datagram) [14],

which resembles in some aspects InﬁniBand verbs [56]. It

provides reliable out-of-order delivery of packets and uses a

custom congestion control protocol. The AWS Nitro Card [57]

implements the reliability layer, and the Elastic Fabric Adapter

(EFA) provides OS-bypass capabilities. To react to congestion,

SRD monitors the round trip time (RTT) and forces packets to

be routed differently by changing some of the ﬁelds used by

ECMP to select the path. This approach is probabilistic and

might allow avoiding congested paths, but, differently than

truly adaptive routing, it does not allow selecting the least

congested path nor any speciﬁc path.

Azure and DEEP-EST use RDMA through InﬁniBand [36],

Oracle uses RDMA over Converged Ethernet (RoCEv2) [40],

whereas GCP does not use RDMA and relies on TCP/IP.

To minimize data movement overheads, GCP uses Intel’s

QuickData DMA Engines [58] to ofﬂoad payload copies of

larger packets. Daint uses a proprietary RDMA protocol [42]

(FMA), whereas Alps uses RoCEv2 [19].

C. Cost

Table I shows the per-hour cost charged to the user as

of July 18, 2022. For the cloud systems, we report the cost

for the East US availability zone. We consider both the cost

for a committed 3-years usage with upfront payment and the

on-demand cost without any minimum commitment. Please

note that 3-years is the maximum commitment allowed on

those providers (and that leads to the lowest per-hour cost),

whereas when having no commitments we have the highest

per-hour cost. For Daint, we report the cost for a minimum

usage of 10,000 compute hours, as well as the on demand

cost, both for non-academic partners. Academic partners have

discounted rates and this would otherwise lead to an unfair

comparison. For Alps and DEEP-EST there is no publicly

available information on the per-hour cost.

On AWS, the main difference between the normal and

HPC instances we selected is the support for Elastic Fabric

Adapter (EFA), which provides the 100 Gb/s networking.

Thus, we can estimate the 3-years committed cost of the

high-performance networks at around 0.135 USD per hour

per VM, and an on-demand cost of 0.82 USD. Similarly, we

selected the same instance type for normal and HPC instances

on GCP. The only difference is that we enabled the so-called

Tier 1 network on the HPC instance, which provides 100 Gb/s

network bandwidth. On GCP, we can thus estimate the cost of

the HPC network at around 0.9 USD per hour per VM [59]

(both for the committed and on-demand cost). Unfortunately,

Azure and Oracle do not provide the same instance in HPC

and non-HPC ﬂavors, and it is thus not possible to isolate the

cost of the HPC network from the rest. Also, we observe that

whereas the on-demand cost of 100 Gb/s instances is lower

than the cost of 200Gb/s instances, this is not true for the

3-years committed usage. Indeed, at the time of the writing,

committing for a 3-years usage led to a 30% discount for 100

Gb/s instances, and to a 50% discount for the 200Gb/s ones.

III. NETWORK PERFORMANCE

We measure network performance using the Netgauge

tool [60], that provides detailed, sample-by-sample measure-

ments (fundamental for estimating network noise in Sec. IV).

We used the Message Passing Interface (MPI) backend and, on

each system, the MPI library recommended by the provider1.

a) Methods: We created an account on each provider,

and used our own funding and/or academic credits, without

coordinating with the providers. The clusters have been cre-

ated and tuned following the guidelines publicly available

1On Azure we used HPC-X v2.8.3 on HPC instances and Open MPI

v4.1.2 on normal instances. We used Open MPI v4.1.1 on AWS, Intel MPI

v2018.4.274 on GCP, Open MPI v4.0.4 on Oracle, Cray MPICH v7.7.18 on

Daint, Cray MPICH v8.1.12 on Alps, and Open MPI v4.1.3 on DEEP-EST.

1B 16B 256B 4KiB 64KiB 1MiB 16MiB

Message Size

100

Bandwidth (Gb/s)

AWS

1B 16B 256B 4KiB 64KiB

RTT/2 (us)

1B 16B 256B 4KiB 64KiB 1MiB 16MiB

Message Size

GCP

1B 16B 256B 4KiB 64KiB

RTT/2 (us)

Concurrent Communications

124816

Fig. 1. Bandwidth for HPC instances as a function of message size and number of concurrent connections between the two servers. Inner plots show RTT/2

for small messages.

in the documentation of the cloud providers. After running

the benchmarks, we contacted the leads of the cloud busi-

ness of each of the providers, sharing a draft of the paper

with them. They assessed the correctness of our evalua-

tion, and we integrated their feedback in the paper. Only

in one case we improved the performance by applying a

technique not described in the publicly available documen-

tation, that we describe in the text (see the comment about

the FI_EFA_TX_MIN_CREDITS in Sec. III-A). On all the

providers, if not speciﬁed otherwise, we allocated the two VMs

(or the two servers) on the same rack. The only exception

is Oracle, where it is not possible to explicitly control the

allocation. We analyze in detail the impact of allocation on

performance and noise in Sec. IV.

A. Bandwidth Saturation

All the four analyzed cloud providers claim a 100 Gb/s

bandwidth on Intel-based HPC instances. However, this is true

only under certain conditions. For example, AWS documents

a maximum per-message bandwidth of 25Gb/s [61]. Even if

not explicitly documented, we observed similar limitations on

GCP. One possible reason justifying this behavior is that even

if the instance exposes a single 100 Gb/s NIC, it might be

equipped with multiple 25Gb/s NICs (or a multi-port NIC).

While some providers explicitly documented this for non-HPC

instances, the speciﬁc conﬁguration is often unclear for HPC

ones.

For this reason, we can expect a higher bandwidth when

sending a message over multiple connections. To assess if this

is the case, we run a ping-pong benchmark between two nodes.

We establish multiple concurrent connections between the two

nodes, by running multiple processes per node and letting each

pair of processes send/receive disjoint parts of the message.

For example, a 16MiB pingpong with 16 processes per node

runs 16 concurrent ping-pongs between 16 processes on the

ﬁrst node and 16 processes on the second node, each with a

1MiB message.

We report the results of this experiment for AWS and GCP

in Figure 1. We report the bandwidth as the message size

divided by half the round trip time (RTT/2), and the inner plots

show the RTT/2 for small messages. Each point in the plot is

the average over 1000 runs, whereas the band around the point

represents the standard deviation. We do not report the results

for the other systems since they can saturate the bandwidth

even with a single connection (we show results in the next

section). On AWS we increased the bandwidth by increasing

the maximum number of in-ﬂight packets to 1024 (by setting

the FI_EFA_TX_MIN_CREDITS environment variable).

On both AWS and GCP the bandwidth increases when

increasing the number of concurrent communications (up to

80Gb/s with 16 processes per node). Also, when using a

single connection, the bandwidth drops for messages larger

than 4MiB. This is caused by a more-than-linear increase in

last level cache (LLC) misses, that we measured by using the

perf tool. For example, on AWS, we observe a 4×increase

in LLC misses when going from 1MiB to 4MiB messages, but

a 8×increase when moving from 4MiB to 16MiB messages.

This effect is not present when using more concurrent com-

munications because the message is split among the processes,

each transmitting a smaller message.

We also observe that having more processes per node

increases the RTT of small messages, due to additional

overhead and contention on the NIC access. For this reason,

only large messages should be sent with multiple concurrent

connections. Instead of having more processes sending a part

of the message each, we could have a single process sending

multiple smaller messages. For example, some MPI libraries

provide the possibility to stripe messages transparently over

multiple connections (e.g., by using the btl_tcp_links

command line ﬂag on Open MPI [62]). However, we did not

observe any performance improvement compared to the single

connection case.

Observation 1: On AWS and GCP, the peak bandwidth

on a single connection is 50Gb/s and 30Gb/s respectively.

A bandwidth of 80Gb/s can only be reached by forcing

messages to be concurrently sent/received by/from multi-

ple processes on different connections.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NoiseintheClouds:InuenceofNetworkPerformanceVariabilityonApplicationScalabilityDanieleDeSensi,TizianoDeMatteis,KonstantinTaranov,SalvatoreDiGirolamo,TobiasRahn,andTorstenHoeerDepartmentofComputerScience,ETHZurich,Switzerlandfrst-name.last-nameg@inf.ethz.chAbstractCloudcomputingrepresentsanappeal...

展开>> 收起<<

Noise in the Clouds Inﬂuence of Network Performance Variability on Application Scalability Daniele De Sensi Tiziano De Matteis Konstantin Taranov.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Noise in the Clouds Inﬂuence of Network Performance Variability on Application Scalability Daniele De Sensi Tiziano De Matteis Konstantin Taranov

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: