
Teal: Traic Engineering Accelerated by Learning This work is accepted at ACM SIGCOMM 2023
remain iterative in nature, incrementally allocating more ows until
the solution quality is deemed adequate (yet suboptimal), which
often results in an excess of iterations to terminate.
Decomposing TE into subproblems. Recently, NCFlow [
2
] and
POP [
46
] introduced techniques to decompose TE optimization into
subproblems, applying LP solvers simultaneously in each subprob-
lem and merging their results at the end. NCFlow partitions the
network spatially into
𝑘
clusters, whereas POP creates
𝑘
replicas of
the network, each with 1
/𝑘
link capacities, and randomly assigns
trac demands to these replicas. Although a larger
𝑘
reduces the
overall run time, it also fundamentally impairs the TE performance.
Moreover, NCFlow also requires nontrivial coordination during the
merging process. Consequently, NCFlow and POP are forced to
adopt small values of
𝑘
(e.g., 64–81 on a network of 754 nodes). In
§5, we show that NCFlow and POP are substantially slower than
our learning-accelerated approach, while having notably worse
allocation performance.
2.2 Accelerate TE optimization with ML
To cope with the growing scale of TE, we argue that with judicious
design, machine learning (ML) can signicantly accelerate TE op-
timization. By training on a vast amount of historical trac data,
ML-based TE schemes also have the potential to attain near-optimal
allocation performance.
Unlocking massive parallelism. Encoding a TE algorithm in
neural networks transforms the traditionally iterative TE optimiza-
tion (LP solvers or combinatorial algorithms) into the inference
process of neural networks, where the input data (e.g., trac de-
mands on a WAN topology) is propagated in the forward direction
through the neural network to compute the output (e.g., trac splits
on the precongured paths). This inference process unlocks mas-
sive parallelism due to mainly consisting of highly parallelizable
operations such as matrix multiplications.
Leveraging hardware acceleration. Modern deep learning frame-
works [
48
,
56
,
57
] have empowered neural networks to easily
leverage thousands of threads on GPUs (or other specialized hard-
ware [
16
]). They can greatly accelerate the computation of learning-
based TE systems compared with state-of-the-art schemes [
2
,
46
],
which are fundamentally limited to tens of parallel workers. In
addition, the deep learning community has integrated various op-
timization techniques [
24
,
27
,
57
] into these frameworks, further
accelerating neural network inference.
Exploiting abundant training data. Operational WANs gener-
ate an abundance of trac data that can be used to train neural
networks. A carefully designed ML-based TE scheme is capable
of discovering regularities in the training data—such as patterns
in graph connectivity, link capacities, and trac demands—and
ultimately learns to optimize allocation performance with respect
to operator-specied objectives.
2.3 Challenges of applying ML to TE
While holding the promise of near-optimal performance and sub-
stantial speedup relative to LP-based TE methods, deep learning
is not a panacea. In fact, using ML for TE optimization is not as
straightforward as it may appear.
Graph connectivity and network ows. Naively using vanilla
fully-connected neural networks for TE optimization would ig-
nore the connectivity in WAN topology. While graph neural net-
works (GNNs) [
50
,
63
], designed for graph-structured input data,
can model traditional graph attributes such as nodes and edges,
their unmodied form is inadequate to model network ows—the
focal point of TE optimization.
High-dimensional solution space. In the path formulation of TE
widely adopted in practice (details in Appendix A), the TE controller
splits each demand across a handful of precongured paths, such
as 4 shortest paths. Therefore, representing the ow allocation for
a topology of
𝑁
nodes requires
𝑂(𝑁2)
split ratios. To put it into
context, on a topology with 1,000 nodes, the solution space would
contain up to 4 million dimensions, exposing ML-based TE methods
to the “curse of dimensionality” [32].
Constrained optimization. Unlike LP solvers, neural networks
are known to lack the capability to enforce constraints on out-
puts [
37
]. As a result, the trac allocations generated by neural
networks may exceed certain link capacities when deployed directly,
leading to network congestion and reduced TE performance.
To tackle the above challenges, we propose the following de-
signs: i) a ow-centric GNN (called “FlowGNN”) to capture WAN
connectivity and model network ows (§3.2); ii) a multi-agent re-
inforcement learning (RL) algorithm that allocates each trac de-
mand independently to reduce the problem scale and make learn-
ing tractable (§3.3); iii) solution ne-tuning using the alternating
direction method of multipliers (ADMM) to minimize constraint
violations such as link overutilization (§3.4).
3TEAL: LEARNING-ACCELERATED TE
In this section, we present the design of Teal—Trac Engineering
Accelerated with Learning. The goal of Teal is to train a fast and
scalable TE scheme through deep learning while achieving near-
optimal trac allocation on large-scale topologies. The rationale
behind using deep learning is to harness the massive parallelism
and hardware acceleration unlocked by neural networks. Moreover,
every component of Teal is carefully designed to be parallelizable
(fast on GPUs) and scalable (performant on large WANs).
3.1 Overview
We begin by outlining the workow of Teal during deployment
(Figure 3). Upon the arrival of a new trac demand matrix or a
change in link capacity
1
,Teal passes the updated trac demands
and current link capacities into a novel graph neural network (GNN)
that we call FlowGNN (§3.2). FlowGNN learns to transform the
demands into compact feature vectors known as “embeddings,”
which preserve the graph structure and encode the ow information
required for the downstream allocation. These ow embeddings are
extracted by FlowGNN in a parallel and parameter-ecient manner
that scales well with the size of the WAN topology.
In the widely adopted path formulation of TE (details in Ap-
pendix A), each trac demand is split into multiple ows over a
set of precongured paths (e.g., 4 shortest paths [
2
,
46
]). To de-
termine the split ratios of a given demand, Teal aggregates the
1
We note that link failures can be viewed as an extreme scenario of capacity change,
where the capacity of a failed link is reduced to zero.