
Multi-Agent Path Finding via Tree LSTM
Yuhao Jiang*, Kunjie Zhang*, Qimai Li*,
Jiaxin Chen†, Xiaolong Zhu
Parametrix.AI
Shenzhen, Guangdong, China
{yuhaojiang, ericzhang, qimaili, jiaxinchen, xiaolongzhu}@chaocanshu.ai
Abstract
In recent years, Multi-Agent Path Finding (MAPF) has at-
tracted attention from the fields of both Operations Research
(OR) and Reinforcement Learning (RL). However, in the
2021 Flatland3 Challenge, a competition on MAPF, the best
RL method scored only 27.9, far less than the best OR
method. This paper proposes a new RL solution to Flatland3
Challenge, which scores 125.3, several times higher than the
best RL solution before. We creatively apply a novel network
architecture, TreeLSTM, to MAPF in our solution. Together
with several other RL techniques, including reward shaping,
multiple-phase training, and centralized control, our solution
is comparable to the top 2-3 OR methods.
1 Introduction
Multi-agent path finding (MAPF), i.e., finding collision-
free paths for multiple agents on a graph, has been a long-
standing combinatorial problem. On undirected graphs, a
feasible solution can be found in polynomial time, but find-
ing the fastest solution is NP-hard. And on directed graphs,
even finding a feasible solution is NP-hard in general cases
(Nebel 2020). Despite the great challenges of MAPF, it
is of great social and economic value because many real-
life scheduling and planning problems can be formulated
as MAPF questions. Many operations research (OR) algo-
rithms are proposed to efficiently find sub-optimal solutions
(Cohen et al. 2019;ˇ
Svancara et al. 2019;Ma et al. 2018;Ma,
Kumar, and Koenig 2017).
As an important branch in decision theory, reinforcement
learning has attracted a lot of attention these years due to
its super-human performance in many complex games, such
as AlphaGo (Silver et al. 2017), AlphaStar (Vinyals et al.
2019), and OpenAI Five (Berner et al. 2019). Inspired by
the tremendous successes of RL on these complex games
and decision scenarios, multi-agent reinforcement learning
(MARL) is widely expected to work on MAPF problems
as well. We study MARL in MAPF problems and aim to
provide high-performance and scalable RL solutions. To de-
velop and test our RL solution, we focus on a specific MAPF
environment, Flatland.
*These authors contributed equally.
†Corresponding author
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Flatland (Mohanty et al. 2020;Laurent et al. 2021) is
a train schedule simulator developed by the Swiss Federal
Railway Company (SBB). It simulates trains and rail net-
works in the real world and serves as an excellent environ-
ment for testing different MAPF algorithms. Since 2019,
SBB has successfully held three flatland challenges, attract-
ing more than 200 teams worldwide, receiving thousands of
submissions and over one million views. The key reasons
why we focus on this environment are as follows,
•Support Massive Agents: On the maximum size of the
map, up to hundreds of trains need to be planned.
•Directed Graphs and Conflicts between Agents:
Trains CANNOT move back, and all decisions are not
revocable. Deadlock occurs if the trains are not well
planned (Figure 5), which makes this question very chal-
lenging.
•Lack of High-Performance RL Solutions: Existing RL
solutions show a significant disadvantage compared with
OR algorithms (27.9 vs. 141.0 scores).
To solve Flatland, we propose an RL solution by standard
reinforcement learning algorithms at scale. The critical com-
ponents of our RL solution are (1) the application of a TreeL-
STM network architecture to process the tree-structured lo-
cal observations of each agent, (2) the centralized control
method to promote cooperation between agents, and (3) our
optimized 20x faster feature parser.
Our contributions can be summarized as (1) We propose
an RL solution consisting of domain-specific feature extrac-
tion and curriculum training phases design, a TreeLSTM
network to process the structured observations, and a 20x
faster feature parser to improve the sample efficiency. (2)
Our observed strategies and performance show the potential
of RL algorithms in MAPF problems. We find that standard
RL methods coupled with domain-specific engineering can
achieve comparable performance with OR algorithm (2nd–
3rd OR ). Our solution provides implementation insights to
the MARL in the MAPF research community. (3) We open-
sourced1our solution and the optimized feature parser for
further research on multi-agent reinforcement learning in
MAPF problems.
1https://github.com/liqimai/flatland-marl
arXiv:2210.12933v2 [cs.AI] 13 Dec 2022