Teal Learning-Accelerated Optimization of WAN Traffic Engineering

2025-05-06 0 0 2.13MB 16 页 10玖币
侵权投诉
Teal: Learning-Accelerated Optimization of
WAN Traic Engineering
Zhiying Xu
Harvard University
Francis Y. Yan
Microsoft Research
Rachee Singh
Cornell University
Justin T. Chiu
Cornell University
Alexander M. Rush
Cornell University
Minlan Yu
Harvard University
ABSTRACT
The rapid expansion of global cloud wide-area networks (WANs)
has posed a challenge for commercial optimization engines to ef-
ciently solve network trac engineering (TE) problems at scale.
Existing acceleration strategies decompose TE optimization into
concurrent subproblems but realize limited parallelism due to an
inherent tradeo between run time and allocation performance.
We present Teal, a learning-based TE algorithm that leverages
the parallel processing power of GPUs to accelerate TE control. First,
Teal designs a ow-centric graph neural network (GNN) to capture
WAN connectivity and network ows, learning ow features as
inputs to downstream allocation. Second, to reduce the problem
scale and make learning tractable, Teal employs a multi-agent rein-
forcement learning (RL) algorithm to independently allocate each
trac demand while optimizing a central TE objective. Finally, Teal
ne-tunes allocations with ADMM (Alternating Direction Method
of Multipliers), a highly parallelizable optimization algorithm for
reducing constraint violations such as overutilized links.
We evaluate Teal using trac matrices from Microsoft’s WAN.
On a large WAN topology with
>
1,700 nodes, Teal generates near-
optimal ow allocations while running several orders of magnitude
faster than the production optimization engine. Compared with
other TE acceleration schemes, Teal satises 6–32% more trac
demand and yields 197–625×speedups.
CCS CONCEPTS
Networks
Trac engineering algorithms;Computing
methodologies Machine learning.
KEYWORDS
Trac Engineering, Wide-Area Networks, Network Optimization,
Machine Learning
1 INTRODUCTION
Large cloud providers invest billions of dollars to provision and
operate planet-scale wide-area networks (WANs) that interconnect
geo-distributed cloud datacenters. Cloud WANs play a vital role in
the operations of cloud providers as they enable low-latency and
high-throughput applications in the cloud. Over the last decade,
cloud providers have implemented centralized network trac engi-
neering (TE) systems based on SDN (software-dened networking)
to eciently utilize their cloud WANs [21, 22, 25, 33].
TE systems allocate demands between datacenters to achieve
high link utilization [
21
,
25
], fairness among network ows [
21
],
and resilience to link failures in WANs [
4
,
38
]. Traditionally, cloud
WAN TE systems have approached trac allocation as an optimiza-
tion problem, with the objective of achieving a desired network
property (e.g., minimum latency, maximum throughput). To this
end, they implement a software-dened TE controller as illustrated
in Figure 1. The TE controller periodically (e.g., every ve minutes)
receives trac demands to allocate gauged by a bandwidth bro-
ker, solves the TE optimization problem, and translates the trac
allocations into router congurations to deploy through SDN.
After a decade of operation, production WAN TE systems are
facing two major challenges. First, the deployment of new edge
sites and datacenters has increased the size of cloud WANs by an
order of magnitude [
21
]. Larger WAN topologies have increased
the complexity of TE optimization and the time required to solve it.
During the computation of updated ow allocations (even when the
ve-minute time budget is not exceeded), stale routes will remain
in use and lead to suboptimal network performance [
2
]. Second,
WANs have evolved from carrying rst-party discretionary trac to
real-time and user-facing trac [
34
]. As a result, cloud TE systems
must react to rapid changes in trac volume, which is a hallmark of
organic user-driven demands. Sudden topology changes due to link
failures further exacerbate the negative eects of long TE control
on network performance. Therefore, fast computation of trac
allocations is critical for TE systems to retain performance on large
WAN topologies.
While linear programming (LP) solvers used by TE systems can
nd optimal solutions, they struggle to scale with the growing
network size. State-of-the-art algorithms designed for accelerating
TE optimization address this challenge by decomposing the original
problem into smaller subproblems (through the partition of WAN
topology [
2
] or trac demands [
46
]), and solve them in parallel
using LP solvers. However, these algorithms face a fundamental
tradeo between speed and performance in the decomposition,
restricting themselves to only a few dozen subproblems and thus
limited parallelism (§2.1).
Our key insight is that deep learning-based TE schemes may
unlock massive parallelism by utilizing thousands of GPU threads
that are made readily accessible through modern deep learning
frameworks [
7
,
48
,
56
]. The enormous parallelism is owing to the
well-known anity between neural networks and GPUs (e.g., SIMD
operations on GPUs speed up matrix multiplication), as well as
the tremendous community eorts for optimizing neural network
inference [
24
,
27
,
57
]. Meanwhile, by capitalizing on a wealth of his-
torical data from production WANs and exploiting trac patterns,
arXiv:2210.13763v4 [cs.NI] 20 May 2024
This work is accepted at ACM SIGCOMM 2023 Z. Xu, F. Yan, R. Singh, J. Chiu, A. Rush, M. Yu
update routing
every 5 minutes
?
topology,
demands allocations
Bandwidth
Broker
TE
Controller SDN
Cloud
WAN
network state
Figure 1: Control loop of WAN trac engineering.
learning-based algorithms are poised to simultaneously retain TE
performance as well.
Unfortunately, o-the-shelf deep learning models do not directly
apply to TE. First, standard fully-connected neural networks fail
to take into account the eects of WAN connectivity on trac
allocations, yielding solutions that are far from optimal. Second,
the escalating scale of the TE problem makes it intractable to train a
monolithic model to navigate the high-dimensional solution space.
Finally, neural networks are unable to enforce constraints, leading
to unnecessary trac drops due to exceeded link capacities.
To address these challenges, we present a learning-accelerated
TE scheme named Teal. First, Teal constructs a ow-centric graph
neural network (GNN) to capture WAN topology and extract infor-
mative features from trac ows for the downstream allocation
task. Next, Teal allocates each demand individually using a shared
policy (neural) network based on the learned features. Doing so
reduces the problem scale from global trac allocation to per-
demand tasks, making the learning process more tractable (in a
low-dimensional space) and feasible (t into GPU memory). To coor-
dinate the independent allocations of demands and avoid contention
for links, Teal leverages multi-agent reinforcement learning (RL) to
train the end-to-end model—GNN and policy network—toward op-
timizing a central TE objective. Finally, Teal ne-tunes the model’s
output allocations using a highly parallelizable constrained opti-
mization algorithm—ADMM (alternating direction method of multi-
pliers), which is well suited for reducing constraint violations such
as oversubscribed links and enhancing solution quality.
We evaluate Teal on trac matrices collected over a 20-day
period from Microsoft’s WAN (§5). Our experimental results show
that on large WAN topologies, Teal realizes near-optimal ow
allocation while being several orders of magnitude faster than the
production TE optimization engine using LP solvers. Compared
with the state-of-the-art schemes for TE acceleration [
2
,
45
,
46
] on a
large topology with
>
1,700 nodes, Teal satises 6–32% more trac
demand and yields 197–625
×
speedups. To aid further research
and development, we have released Teal’s source code at https:
//github.com/harvard-cns/teal.
2 BACKGROUND AND MOTIVATION
Production WANs rely on a centralized TE controller to allocate
trac demands between datacenters, which are gauged by a band-
width broker periodically (e.g., every 5 minutes). The TE controller
splits the trac demand onto a handful of precomputed paths (e.g.,
4 shortest paths [
2
,
46
]) between the demand’s source and desti-
nation, with the goal of maximizing a TE objective (e.g., overall
Figure 2: On a topology with
>
1,700 nodes (ASN in Table 1),
the TE optimization using the Gurobi solver experiences a
marginal speedup as more CPU threads become available.
throughput) while satisfying a set of constraints (e.g., link capaci-
ties). This path formulation of TE is widely adopted in production
inter-datacenter WANs [
21
,
22
,
25
,
33
]. At its core, TE optimization
is a multi-commodity ow problem (formally dened in Appen-
dix A), which is traditionally solved with linear programming (LP)
solvers. In this section, we begin with the scalability crisis faced
by today’s TE systems, and motivate the need and challenges for a
learning-accelerated TE solution.
2.1 Scaling challenges of TE
In their early years, cloud WANs only consisted of tens of data-
centers, so it was feasible for commercial LP solvers to compute
trac allocations frequently. However, the rapid deployment of
new datacenters has rendered the TE task prohibitively slow at
scale, requiring hours for commercial solvers to allocate trac on
WANs with thousands of nodes. Consequently, WAN operators
are seeking to accelerate TE optimization to keep pace with the
growing size of the WAN.
Parallelizing LP solvers. An intuitive way of accelerating TE
optimization is to parallelize state-of-the-art LP solvers, such as
Gurobi [
17
] and CPLEX [
23
]. Figure 2 evaluates the speedup of the
Gurobi solver on a WAN topology with more than 1,700 nodes (ASN
in Table 1). As more CPU threads are made available, we observe
that the speedup is sublinear and marginal. E.g., using 16 threads
only makes Gurobi 3
.
8
×
faster, which still requires 5.5 hours to
complete a ow allocation. This is due to LP solvers’ sequential
nature, e.g., the conventional simplex method [
47
] takes one small
step at a time toward the optimal solution along the edges of the
feasible region, and requires thousands to millions of such steps to
complete. To exploit multiple CPU threads, LP solvers often resort
to concurrently running independent instances of dierent opti-
mization algorithms [
1
], where each instance executes serially on a
separate thread; the solution is yielded by whichever instance com-
pletes rst. This is apparently not an ecient use of CPU capacity,
thus resulting in marginal speedups on multiple CPUs.
Approximation algorithms. Combinatorial algorithms, such as
the Fleischer’s algorithm [
10
], are designed to compute approximate
but asymptotically faster solutions to the underlying network ow
problem of TE. Despite having a lower time complexity than LP
solvers in theory, these approximation algorithms are found to be
hardly faster in practice [
2
]. The reason is that these algorithms
Teal: Traic Engineering Accelerated by Learning This work is accepted at ACM SIGCOMM 2023
remain iterative in nature, incrementally allocating more ows until
the solution quality is deemed adequate (yet suboptimal), which
often results in an excess of iterations to terminate.
Decomposing TE into subproblems. Recently, NCFlow [
2
] and
POP [
46
] introduced techniques to decompose TE optimization into
subproblems, applying LP solvers simultaneously in each subprob-
lem and merging their results at the end. NCFlow partitions the
network spatially into
𝑘
clusters, whereas POP creates
𝑘
replicas of
the network, each with 1
/𝑘
link capacities, and randomly assigns
trac demands to these replicas. Although a larger
𝑘
reduces the
overall run time, it also fundamentally impairs the TE performance.
Moreover, NCFlow also requires nontrivial coordination during the
merging process. Consequently, NCFlow and POP are forced to
adopt small values of
𝑘
(e.g., 64–81 on a network of 754 nodes). In
§5, we show that NCFlow and POP are substantially slower than
our learning-accelerated approach, while having notably worse
allocation performance.
2.2 Accelerate TE optimization with ML
To cope with the growing scale of TE, we argue that with judicious
design, machine learning (ML) can signicantly accelerate TE op-
timization. By training on a vast amount of historical trac data,
ML-based TE schemes also have the potential to attain near-optimal
allocation performance.
Unlocking massive parallelism. Encoding a TE algorithm in
neural networks transforms the traditionally iterative TE optimiza-
tion (LP solvers or combinatorial algorithms) into the inference
process of neural networks, where the input data (e.g., trac de-
mands on a WAN topology) is propagated in the forward direction
through the neural network to compute the output (e.g., trac splits
on the precongured paths). This inference process unlocks mas-
sive parallelism due to mainly consisting of highly parallelizable
operations such as matrix multiplications.
Leveraging hardware acceleration. Modern deep learning frame-
works [
48
,
56
,
57
] have empowered neural networks to easily
leverage thousands of threads on GPUs (or other specialized hard-
ware [
16
]). They can greatly accelerate the computation of learning-
based TE systems compared with state-of-the-art schemes [
2
,
46
],
which are fundamentally limited to tens of parallel workers. In
addition, the deep learning community has integrated various op-
timization techniques [
24
,
27
,
57
] into these frameworks, further
accelerating neural network inference.
Exploiting abundant training data. Operational WANs gener-
ate an abundance of trac data that can be used to train neural
networks. A carefully designed ML-based TE scheme is capable
of discovering regularities in the training data—such as patterns
in graph connectivity, link capacities, and trac demands—and
ultimately learns to optimize allocation performance with respect
to operator-specied objectives.
2.3 Challenges of applying ML to TE
While holding the promise of near-optimal performance and sub-
stantial speedup relative to LP-based TE methods, deep learning
is not a panacea. In fact, using ML for TE optimization is not as
straightforward as it may appear.
Graph connectivity and network ows. Naively using vanilla
fully-connected neural networks for TE optimization would ig-
nore the connectivity in WAN topology. While graph neural net-
works (GNNs) [
50
,
63
], designed for graph-structured input data,
can model traditional graph attributes such as nodes and edges,
their unmodied form is inadequate to model network ows—the
focal point of TE optimization.
High-dimensional solution space. In the path formulation of TE
widely adopted in practice (details in Appendix A), the TE controller
splits each demand across a handful of precongured paths, such
as 4 shortest paths. Therefore, representing the ow allocation for
a topology of
𝑁
nodes requires
𝑂(𝑁2)
split ratios. To put it into
context, on a topology with 1,000 nodes, the solution space would
contain up to 4 million dimensions, exposing ML-based TE methods
to the “curse of dimensionality” [32].
Constrained optimization. Unlike LP solvers, neural networks
are known to lack the capability to enforce constraints on out-
puts [
37
]. As a result, the trac allocations generated by neural
networks may exceed certain link capacities when deployed directly,
leading to network congestion and reduced TE performance.
To tackle the above challenges, we propose the following de-
signs: i) a ow-centric GNN (called “FlowGNN”) to capture WAN
connectivity and model network ows (§3.2); ii) a multi-agent re-
inforcement learning (RL) algorithm that allocates each trac de-
mand independently to reduce the problem scale and make learn-
ing tractable (§3.3); iii) solution ne-tuning using the alternating
direction method of multipliers (ADMM) to minimize constraint
violations such as link overutilization (§3.4).
3TEAL: LEARNING-ACCELERATED TE
In this section, we present the design of TealTrac Engineering
Accelerated with Learning. The goal of Teal is to train a fast and
scalable TE scheme through deep learning while achieving near-
optimal trac allocation on large-scale topologies. The rationale
behind using deep learning is to harness the massive parallelism
and hardware acceleration unlocked by neural networks. Moreover,
every component of Teal is carefully designed to be parallelizable
(fast on GPUs) and scalable (performant on large WANs).
3.1 Overview
We begin by outlining the workow of Teal during deployment
(Figure 3). Upon the arrival of a new trac demand matrix or a
change in link capacity
1
,Teal passes the updated trac demands
and current link capacities into a novel graph neural network (GNN)
that we call FlowGNN (§3.2). FlowGNN learns to transform the
demands into compact feature vectors known as “embeddings,
which preserve the graph structure and encode the ow information
required for the downstream allocation. These ow embeddings are
extracted by FlowGNN in a parallel and parameter-ecient manner
that scales well with the size of the WAN topology.
In the widely adopted path formulation of TE (details in Ap-
pendix A), each trac demand is split into multiple ows over a
set of precongured paths (e.g., 4 shortest paths [
2
,
46
]). To de-
termine the split ratios of a given demand, Teal aggregates the
1
We note that link failures can be viewed as an extreme scenario of capacity change,
where the capacity of a failed link is reduced to zero.
This work is accepted at ACM SIGCOMM 2023 Z. Xu, F. Yan, R. Singh, J. Chiu, A. Rush, M. Yu
FlowGNN
(§3.2)
traffic
demands
Multi-agent
RL (§3.3)
ADMM
(§3.4)
...
traffic
allocations
+
Figure 3: Workow of Teal.Teal inputs trac demands and
link capacities into FlowGNN to learn ow embeddings (§3.2),
which are then mapped to initial trac allocations through
multi-agent RL (§3.3). ADMM subsequently ne-tunes the
allocations and mitigates constraint violations (§3.4).
embeddings learned for each ow of the demand and inputs them
into a shared policy (neural) network. The policy network, which
allocates demands independently, is trained oine to coordinate
ow allocations toward optimizing a global TE objective (e.g., total
ow), through a multi-agent reinforcement learning (RL) algorithm
we customize for TE (§3.3). This design enables processing demands
individually rather than the entire trac matrix at once, making
the policy network more compact (in terms of the parameters to
learn) and oblivious to the WAN topology size.
So far, the split ratios output by the policy network might still
exceed certain link capacity constraints, resulting in dropped trac
and suboptimal TE performance. To mitigate constraint violations
and enhance the solution quality, Teal augments the neural net-
works (FlowGNN and policy network) with 2–5 rapid iterations
of a classical constrained optimization algorithm—alternating di-
rection method of multipliers (ADMM) (§3.4). During each iteration,
ADMM starts from a potentially infeasible TE solution with capac-
ity violations and advances toward the feasible region, ne-tuning
trac splits to meet more constraints and improve the overall TE
performance. Each iteration of ADMM is inherently parallel. When
warm-started with a reasonably good solution such as the one
generated by the neural networks, ADMM can attain a noticeable
improvement in performance within several fast iterations.
For each WAN topology, Teal trains its “model”—FlowGNN and
policy network—end to end to optimize an operator-provided TE
objective; ADMM requires no training. All the three key compo-
nents of Teal (FlowGNN, multi-agent RL, and ADMM) are carefully
designed to be highly parallelizable, enabling fast computation and
scalable TE performance as the size of the WAN topology grows.
3.2 Feature learning with FlowGNN
In light of the graph-based structure of WAN topologies, Teal lever-
ages graph neural networks (GNNs) for feature learning. GNNs are
a family of neural networks designed to handle graph-structured
data [
50
] and have found applications in various domains, including
network planning [
71
], social network [
9
,
42
], and trac predic-
tion [36].
GNNs typically store information in graph attributes, commonly
in nodes, using a compact vector representation known as em-
beddings [
18
]. To preserve graph connectivity in the embeddings,
1
2
3
4
capacity
constraints
demand
constraints
edge
path
GNN layer
DNN layer
from 1 to 4
message passing
23
V
12
V
34
V
24
V
13
V
32
V
1234
V
124
V
134
V
1324
V
Figure 4: Illustration of a FlowGNN construction. FlowGNN
alternates between GNN layers that are designed to capture
capacity constraints, and DNN layers that are intended to
capture demand constraints.
neighboring nodes in the GNN exchange information through “mes-
sage passing” [
15
]: Each node collects the embeddings from adja-
cent nodes, transforms the aggregated embeddings using a learned
transformation function (e.g., encoded in a fully-connected neural
network), and updates its own embedding with the result. GNNs
are intrinsically parallel as message passing occurs simultaneously
across nodes. Applying message passing once constitutes one GNN
layer, and stacking multiple GNN layers allows information to be
disseminated multiple hops away. It is noteworthy that GNNs are
parameter ecient because each layer shares the same transfor-
mation function that operates in the low-dimensional embedding
space, which does not grow in proportion to the input graph size.
Despite the strengths of GNNs, the primary focus of TE is the
assignment of ows, in which each ow originating from a demand
follows a predened path along a chain of network links (edges). TE
is concerned with the interference between ows as they compete
for link capacities. Hence, we put a spotlight on ows and explicitly
represent ow-related entities—edges and paths—as the nodes in
our TE-specic GNN, which we call FlowGNN.
Figure 4 exemplies the construction of a FlowGNN. At a high
level, FlowGNN alternates between GNN layers aimed at capturing
capacity constraints, and DNN layers aimed at capturing demand
constraints, which dictate that the total volume of all ows de-
rived from a demand should not exceed the demand itself (formal
formulation in Appendix A).
The GNN layer in FlowGNN is structured as a bipartite graph.
For each edge in the input topology, we create an “EdgeNode” (e.g.,
𝑉13
for the edge connecting nodes #1 and #3), and for each precon-
gured path associated with a demand, we create a “PathNode” (e.g.,
𝑉134
for the path containing nodes #1, #3, and #4). An EdgeNode
and a PathNode are connected in the GNN layer if and only if the
edge lies on the path (e.g.,
𝑉13
is connected to
𝑉134
but not to
𝑉124
).
The intuition of this setup is to allow EdgeNodes and PathNodes to
interact and learn to respect capacity constraints during message
passing. For example, when an edge serves as a bottleneck for com-
peting ows, the EdgeNode’s embedding will be inuenced by its
neighboring PathNodes. During initialization, the embedding of an
EdgeNode is initialized with the capacity of the corresponding edge,
摘要:

Teal:Learning-AcceleratedOptimizationofWANTrafficEngineeringZhiyingXuHarvardUniversityFrancisY.YanMicrosoftResearchRacheeSinghCornellUniversityJustinT.ChiuCornellUniversityAlexanderM.RushCornellUniversityMinlanYuHarvardUniversityABSTRACTTherapidexpansionofglobalcloudwide-areanetworks(WANs)hasposedac...

展开>> 收起<<
Teal Learning-Accelerated Optimization of WAN Traffic Engineering.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:2.13MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注