Teal Learning-Accelerated Optimization of WAN Traffic Engineering

2025-05-06 0 0 2.13MB 16 页 10玖币

侵权投诉

Teal: Learning-Accelerated Optimization of

WAN Traic Engineering

Zhiying Xu

Harvard University

Francis Y. Yan

Microsoft Research

Rachee Singh

Cornell University

Justin T. Chiu

Cornell University

Alexander M. Rush

Cornell University

Minlan Yu

Harvard University

ABSTRACT

The rapid expansion of global cloud wide-area networks (WANs)

has posed a challenge for commercial optimization engines to ef-

ciently solve network trac engineering (TE) problems at scale.

Existing acceleration strategies decompose TE optimization into

concurrent subproblems but realize limited parallelism due to an

inherent tradeo between run time and allocation performance.

We present Teal, a learning-based TE algorithm that leverages

the parallel processing power of GPUs to accelerate TE control. First,

Teal designs a ow-centric graph neural network (GNN) to capture

WAN connectivity and network ows, learning ow features as

inputs to downstream allocation. Second, to reduce the problem

scale and make learning tractable, Teal employs a multi-agent rein-

forcement learning (RL) algorithm to independently allocate each

trac demand while optimizing a central TE objective. Finally, Teal

ne-tunes allocations with ADMM (Alternating Direction Method

of Multipliers), a highly parallelizable optimization algorithm for

reducing constraint violations such as overutilized links.

We evaluate Teal using trac matrices from Microsoft’s WAN.

On a large WAN topology with

1,700 nodes, Teal generates near-

optimal ow allocations while running several orders of magnitude

faster than the production optimization engine. Compared with

other TE acceleration schemes, Teal satises 6–32% more trac

demand and yields 197–625×speedups.

CCS CONCEPTS

•Networks

→

Trac engineering algorithms;•Computing

methodologies →Machine learning.

KEYWORDS

Trac Engineering, Wide-Area Networks, Network Optimization,

Machine Learning

1 INTRODUCTION

Large cloud providers invest billions of dollars to provision and

operate planet-scale wide-area networks (WANs) that interconnect

geo-distributed cloud datacenters. Cloud WANs play a vital role in

the operations of cloud providers as they enable low-latency and

high-throughput applications in the cloud. Over the last decade,

cloud providers have implemented centralized network trac engi-

neering (TE) systems based on SDN (software-dened networking)

to eciently utilize their cloud WANs [21, 22, 25, 33].

TE systems allocate demands between datacenters to achieve

high link utilization [

], fairness among network ows [

and resilience to link failures in WANs [

]. Traditionally, cloud

WAN TE systems have approached trac allocation as an optimiza-

tion problem, with the objective of achieving a desired network

property (e.g., minimum latency, maximum throughput). To this

end, they implement a software-dened TE controller as illustrated

in Figure 1. The TE controller periodically (e.g., every ve minutes)

receives trac demands to allocate gauged by a bandwidth bro-

ker, solves the TE optimization problem, and translates the trac

allocations into router congurations to deploy through SDN.

After a decade of operation, production WAN TE systems are

facing two major challenges. First, the deployment of new edge

sites and datacenters has increased the size of cloud WANs by an

order of magnitude [

]. Larger WAN topologies have increased

the complexity of TE optimization and the time required to solve it.

During the computation of updated ow allocations (even when the

ve-minute time budget is not exceeded), stale routes will remain

in use and lead to suboptimal network performance [

]. Second,

WANs have evolved from carrying rst-party discretionary trac to

real-time and user-facing trac [

]. As a result, cloud TE systems

must react to rapid changes in trac volume, which is a hallmark of

organic user-driven demands. Sudden topology changes due to link

failures further exacerbate the negative eects of long TE control

on network performance. Therefore, fast computation of trac

allocations is critical for TE systems to retain performance on large

WAN topologies.

While linear programming (LP) solvers used by TE systems can

nd optimal solutions, they struggle to scale with the growing

network size. State-of-the-art algorithms designed for accelerating

TE optimization address this challenge by decomposing the original

problem into smaller subproblems (through the partition of WAN

topology [

] or trac demands [

]), and solve them in parallel

using LP solvers. However, these algorithms face a fundamental

tradeo between speed and performance in the decomposition,

restricting themselves to only a few dozen subproblems and thus

limited parallelism (§2.1).

Our key insight is that deep learning-based TE schemes may

unlock massive parallelism by utilizing thousands of GPU threads

that are made readily accessible through modern deep learning

frameworks [

]. The enormous parallelism is owing to the

well-known anity between neural networks and GPUs (e.g., SIMD

operations on GPUs speed up matrix multiplication), as well as

the tremendous community eorts for optimizing neural network

inference [

]. Meanwhile, by capitalizing on a wealth of his-

torical data from production WANs and exploiting trac patterns,

arXiv:2210.13763v4 [cs.NI] 20 May 2024

This work is accepted at ACM SIGCOMM 2023 Z. Xu, F. Yan, R. Singh, J. Chiu, A. Rush, M. Yu

update routing

every 5 minutes

topology,

demands allocations

Bandwidth

Broker

Controller SDN

Cloud

WAN

network state

Figure 1: Control loop of WAN trac engineering.

learning-based algorithms are poised to simultaneously retain TE

performance as well.

Unfortunately, o-the-shelf deep learning models do not directly

apply to TE. First, standard fully-connected neural networks fail

to take into account the eects of WAN connectivity on trac

allocations, yielding solutions that are far from optimal. Second,

the escalating scale of the TE problem makes it intractable to train a

monolithic model to navigate the high-dimensional solution space.

Finally, neural networks are unable to enforce constraints, leading

to unnecessary trac drops due to exceeded link capacities.

To address these challenges, we present a learning-accelerated

TE scheme named Teal. First, Teal constructs a ow-centric graph

neural network (GNN) to capture WAN topology and extract infor-

mative features from trac ows for the downstream allocation

task. Next, Teal allocates each demand individually using a shared

policy (neural) network based on the learned features. Doing so

reduces the problem scale from global trac allocation to per-

demand tasks, making the learning process more tractable (in a

low-dimensional space) and feasible (t into GPU memory). To coor-

dinate the independent allocations of demands and avoid contention

for links, Teal leverages multi-agent reinforcement learning (RL) to

train the end-to-end model—GNN and policy network—toward op-

timizing a central TE objective. Finally, Teal ne-tunes the model’s

output allocations using a highly parallelizable constrained opti-

mization algorithm—ADMM (alternating direction method of multi-

pliers), which is well suited for reducing constraint violations such

as oversubscribed links and enhancing solution quality.

We evaluate Teal on trac matrices collected over a 20-day

period from Microsoft’s WAN (§5). Our experimental results show

that on large WAN topologies, Teal realizes near-optimal ow

allocation while being several orders of magnitude faster than the

production TE optimization engine using LP solvers. Compared

with the state-of-the-art schemes for TE acceleration [

] on a

large topology with

1,700 nodes, Teal satises 6–32% more trac

demand and yields 197–625

speedups. To aid further research

and development, we have released Teal’s source code at https:

//github.com/harvard-cns/teal.

2 BACKGROUND AND MOTIVATION

Production WANs rely on a centralized TE controller to allocate

trac demands between datacenters, which are gauged by a band-

width broker periodically (e.g., every 5 minutes). The TE controller

splits the trac demand onto a handful of precomputed paths (e.g.,

4 shortest paths [

]) between the demand’s source and desti-

nation, with the goal of maximizing a TE objective (e.g., overall

Figure 2: On a topology with

1,700 nodes (ASN in Table 1),

the TE optimization using the Gurobi solver experiences a

marginal speedup as more CPU threads become available.

throughput) while satisfying a set of constraints (e.g., link capaci-

ties). This path formulation of TE is widely adopted in production

inter-datacenter WANs [

]. At its core, TE optimization

is a multi-commodity ow problem (formally dened in Appen-

dix A), which is traditionally solved with linear programming (LP)

solvers. In this section, we begin with the scalability crisis faced

by today’s TE systems, and motivate the need and challenges for a

learning-accelerated TE solution.

2.1 Scaling challenges of TE

In their early years, cloud WANs only consisted of tens of data-

centers, so it was feasible for commercial LP solvers to compute

trac allocations frequently. However, the rapid deployment of

new datacenters has rendered the TE task prohibitively slow at

scale, requiring hours for commercial solvers to allocate trac on

WANs with thousands of nodes. Consequently, WAN operators

are seeking to accelerate TE optimization to keep pace with the

growing size of the WAN.

Parallelizing LP solvers. An intuitive way of accelerating TE

optimization is to parallelize state-of-the-art LP solvers, such as

Gurobi [

] and CPLEX [

]. Figure 2 evaluates the speedup of the

Gurobi solver on a WAN topology with more than 1,700 nodes (ASN

in Table 1). As more CPU threads are made available, we observe

that the speedup is sublinear and marginal. E.g., using 16 threads

only makes Gurobi 3

faster, which still requires 5.5 hours to

complete a ow allocation. This is due to LP solvers’ sequential

nature, e.g., the conventional simplex method [

] takes one small

step at a time toward the optimal solution along the edges of the

feasible region, and requires thousands to millions of such steps to

complete. To exploit multiple CPU threads, LP solvers often resort

to concurrently running independent instances of dierent opti-

mization algorithms [

], where each instance executes serially on a

separate thread; the solution is yielded by whichever instance com-

pletes rst. This is apparently not an ecient use of CPU capacity,

thus resulting in marginal speedups on multiple CPUs.

Approximation algorithms. Combinatorial algorithms, such as

the Fleischer’s algorithm [

], are designed to compute approximate

but asymptotically faster solutions to the underlying network ow

problem of TE. Despite having a lower time complexity than LP

solvers in theory, these approximation algorithms are found to be

hardly faster in practice [

]. The reason is that these algorithms

Teal: Traic Engineering Accelerated by Learning This work is accepted at ACM SIGCOMM 2023

remain iterative in nature, incrementally allocating more ows until

the solution quality is deemed adequate (yet suboptimal), which

often results in an excess of iterations to terminate.

Decomposing TE into subproblems. Recently, NCFlow [

] and

POP [

] introduced techniques to decompose TE optimization into

subproblems, applying LP solvers simultaneously in each subprob-

lem and merging their results at the end. NCFlow partitions the

network spatially into

𝑘

clusters, whereas POP creates

𝑘

replicas of

the network, each with 1

/𝑘

link capacities, and randomly assigns

trac demands to these replicas. Although a larger

𝑘

reduces the

overall run time, it also fundamentally impairs the TE performance.

Moreover, NCFlow also requires nontrivial coordination during the

merging process. Consequently, NCFlow and POP are forced to

adopt small values of

𝑘

(e.g., 64–81 on a network of 754 nodes). In

§5, we show that NCFlow and POP are substantially slower than

our learning-accelerated approach, while having notably worse

allocation performance.

2.2 Accelerate TE optimization with ML

To cope with the growing scale of TE, we argue that with judicious

design, machine learning (ML) can signicantly accelerate TE op-

timization. By training on a vast amount of historical trac data,

ML-based TE schemes also have the potential to attain near-optimal

allocation performance.

Unlocking massive parallelism. Encoding a TE algorithm in

neural networks transforms the traditionally iterative TE optimiza-

tion (LP solvers or combinatorial algorithms) into the inference

process of neural networks, where the input data (e.g., trac de-

mands on a WAN topology) is propagated in the forward direction

through the neural network to compute the output (e.g., trac splits

on the precongured paths). This inference process unlocks mas-

sive parallelism due to mainly consisting of highly parallelizable

operations such as matrix multiplications.

Leveraging hardware acceleration. Modern deep learning frame-

works [

] have empowered neural networks to easily

leverage thousands of threads on GPUs (or other specialized hard-

ware [

]). They can greatly accelerate the computation of learning-

based TE systems compared with state-of-the-art schemes [

which are fundamentally limited to tens of parallel workers. In

addition, the deep learning community has integrated various op-

timization techniques [

] into these frameworks, further

accelerating neural network inference.

Exploiting abundant training data. Operational WANs gener-

ate an abundance of trac data that can be used to train neural

networks. A carefully designed ML-based TE scheme is capable

of discovering regularities in the training data—such as patterns

in graph connectivity, link capacities, and trac demands—and

ultimately learns to optimize allocation performance with respect

to operator-specied objectives.

2.3 Challenges of applying ML to TE

While holding the promise of near-optimal performance and sub-

stantial speedup relative to LP-based TE methods, deep learning

is not a panacea. In fact, using ML for TE optimization is not as

straightforward as it may appear.

Graph connectivity and network ows. Naively using vanilla

fully-connected neural networks for TE optimization would ig-

nore the connectivity in WAN topology. While graph neural net-

works (GNNs) [

], designed for graph-structured input data,

can model traditional graph attributes such as nodes and edges,

their unmodied form is inadequate to model network ows—the

focal point of TE optimization.

High-dimensional solution space. In the path formulation of TE

widely adopted in practice (details in Appendix A), the TE controller

splits each demand across a handful of precongured paths, such

as 4 shortest paths. Therefore, representing the ow allocation for

a topology of

𝑁

nodes requires

𝑂(𝑁2)

split ratios. To put it into

context, on a topology with 1,000 nodes, the solution space would

contain up to 4 million dimensions, exposing ML-based TE methods

to the “curse of dimensionality” [32].

Constrained optimization. Unlike LP solvers, neural networks

are known to lack the capability to enforce constraints on out-

puts [

]. As a result, the trac allocations generated by neural

networks may exceed certain link capacities when deployed directly,

leading to network congestion and reduced TE performance.

To tackle the above challenges, we propose the following de-

signs: i) a ow-centric GNN (called “FlowGNN”) to capture WAN

connectivity and model network ows (§3.2); ii) a multi-agent re-

inforcement learning (RL) algorithm that allocates each trac de-

mand independently to reduce the problem scale and make learn-

ing tractable (§3.3); iii) solution ne-tuning using the alternating

direction method of multipliers (ADMM) to minimize constraint

violations such as link overutilization (§3.4).

3TEAL: LEARNING-ACCELERATED TE

In this section, we present the design of Teal—Trac Engineering

Accelerated with Learning. The goal of Teal is to train a fast and

scalable TE scheme through deep learning while achieving near-

optimal trac allocation on large-scale topologies. The rationale

behind using deep learning is to harness the massive parallelism

and hardware acceleration unlocked by neural networks. Moreover,

every component of Teal is carefully designed to be parallelizable

(fast on GPUs) and scalable (performant on large WANs).

3.1 Overview

We begin by outlining the workow of Teal during deployment

(Figure 3). Upon the arrival of a new trac demand matrix or a

change in link capacity

,Teal passes the updated trac demands

and current link capacities into a novel graph neural network (GNN)

that we call FlowGNN (§3.2). FlowGNN learns to transform the

demands into compact feature vectors known as “embeddings,”

which preserve the graph structure and encode the ow information

required for the downstream allocation. These ow embeddings are

extracted by FlowGNN in a parallel and parameter-ecient manner

that scales well with the size of the WAN topology.

In the widely adopted path formulation of TE (details in Ap-

pendix A), each trac demand is split into multiple ows over a

set of precongured paths (e.g., 4 shortest paths [

]). To de-

termine the split ratios of a given demand, Teal aggregates the

We note that link failures can be viewed as an extreme scenario of capacity change,

where the capacity of a failed link is reduced to zero.

This work is accepted at ACM SIGCOMM 2023 Z. Xu, F. Yan, R. Singh, J. Chiu, A. Rush, M. Yu

FlowGNN

(§3.2)

traffic

demands

Multi-agent

RL (§3.3)

ADMM

(§3.4)

...

traffic

allocations



′

Figure 3: Workow of Teal.Teal inputs trac demands and

link capacities into FlowGNN to learn ow embeddings (§3.2),

which are then mapped to initial trac allocations through

multi-agent RL (§3.3). ADMM subsequently ne-tunes the

allocations and mitigates constraint violations (§3.4).

embeddings learned for each ow of the demand and inputs them

into a shared policy (neural) network. The policy network, which

allocates demands independently, is trained oine to coordinate

ow allocations toward optimizing a global TE objective (e.g., total

ow), through a multi-agent reinforcement learning (RL) algorithm

we customize for TE (§3.3). This design enables processing demands

individually rather than the entire trac matrix at once, making

the policy network more compact (in terms of the parameters to

learn) and oblivious to the WAN topology size.

So far, the split ratios output by the policy network might still

exceed certain link capacity constraints, resulting in dropped trac

and suboptimal TE performance. To mitigate constraint violations

and enhance the solution quality, Teal augments the neural net-

works (FlowGNN and policy network) with 2–5 rapid iterations

of a classical constrained optimization algorithm—alternating di-

rection method of multipliers (ADMM) (§3.4). During each iteration,

ADMM starts from a potentially infeasible TE solution with capac-

ity violations and advances toward the feasible region, ne-tuning

trac splits to meet more constraints and improve the overall TE

performance. Each iteration of ADMM is inherently parallel. When

warm-started with a reasonably good solution such as the one

generated by the neural networks, ADMM can attain a noticeable

improvement in performance within several fast iterations.

For each WAN topology, Teal trains its “model”—FlowGNN and

policy network—end to end to optimize an operator-provided TE

objective; ADMM requires no training. All the three key compo-

nents of Teal (FlowGNN, multi-agent RL, and ADMM) are carefully

designed to be highly parallelizable, enabling fast computation and

scalable TE performance as the size of the WAN topology grows.

3.2 Feature learning with FlowGNN

In light of the graph-based structure of WAN topologies, Teal lever-

ages graph neural networks (GNNs) for feature learning. GNNs are

a family of neural networks designed to handle graph-structured

data [

] and have found applications in various domains, including

network planning [

], social network [

], and trac predic-

tion [36].

GNNs typically store information in graph attributes, commonly

in nodes, using a compact vector representation known as em-

beddings [

]. To preserve graph connectivity in the embeddings,

capacity

constraints

demand

constraints

edge

path

GNN layer

DNN layer

from 1 to 4

message passing

1234

124

134

1324

Figure 4: Illustration of a FlowGNN construction. FlowGNN

alternates between GNN layers that are designed to capture

capacity constraints, and DNN layers that are intended to

capture demand constraints.

neighboring nodes in the GNN exchange information through “mes-

sage passing” [

]: Each node collects the embeddings from adja-

cent nodes, transforms the aggregated embeddings using a learned

transformation function (e.g., encoded in a fully-connected neural

network), and updates its own embedding with the result. GNNs

are intrinsically parallel as message passing occurs simultaneously

across nodes. Applying message passing once constitutes one GNN

layer, and stacking multiple GNN layers allows information to be

disseminated multiple hops away. It is noteworthy that GNNs are

parameter ecient because each layer shares the same transfor-

mation function that operates in the low-dimensional embedding

space, which does not grow in proportion to the input graph size.

Despite the strengths of GNNs, the primary focus of TE is the

assignment of ows, in which each ow originating from a demand

follows a predened path along a chain of network links (edges). TE

is concerned with the interference between ows as they compete

for link capacities. Hence, we put a spotlight on ows and explicitly

represent ow-related entities—edges and paths—as the nodes in

our TE-specic GNN, which we call FlowGNN.

Figure 4 exemplies the construction of a FlowGNN. At a high

level, FlowGNN alternates between GNN layers aimed at capturing

capacity constraints, and DNN layers aimed at capturing demand

constraints, which dictate that the total volume of all ows de-

rived from a demand should not exceed the demand itself (formal

formulation in Appendix A).

The GNN layer in FlowGNN is structured as a bipartite graph.

For each edge in the input topology, we create an “EdgeNode” (e.g.,

𝑉13

for the edge connecting nodes #1 and #3), and for each precon-

gured path associated with a demand, we create a “PathNode” (e.g.,

𝑉134

for the path containing nodes #1, #3, and #4). An EdgeNode

and a PathNode are connected in the GNN layer if and only if the

edge lies on the path (e.g.,

𝑉13

is connected to

𝑉134

but not to

𝑉124

The intuition of this setup is to allow EdgeNodes and PathNodes to

interact and learn to respect capacity constraints during message

passing. For example, when an edge serves as a bottleneck for com-

peting ows, the EdgeNode’s embedding will be inuenced by its

neighboring PathNodes. During initialization, the embedding of an

EdgeNode is initialized with the capacity of the corresponding edge,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Teal:Learning-AcceleratedOptimizationofWANTrafficEngineeringZhiyingXuHarvardUniversityFrancisY.YanMicrosoftResearchRacheeSinghCornellUniversityJustinT.ChiuCornellUniversityAlexanderM.RushCornellUniversityMinlanYuHarvardUniversityABSTRACTTherapidexpansionofglobalcloudwide-areanetworks(WANs)hasposedac...

展开>> 收起<<

Teal Learning-Accelerated Optimization of WAN Traffic Engineering.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Teal Learning-Accelerated Optimization of WAN Traffic Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: