Multi-Agent Path Finding via Tree LSTM Yuhao Jiang Kunjie Zhang Qimai Li Jiaxin Chen Xiaolong Zhu

2025-04-29 0 0 3.44MB 8 页 10玖币
侵权投诉
Multi-Agent Path Finding via Tree LSTM
Yuhao Jiang*, Kunjie Zhang*, Qimai Li*,
Jiaxin Chen, Xiaolong Zhu
Parametrix.AI
Shenzhen, Guangdong, China
{yuhaojiang, ericzhang, qimaili, jiaxinchen, xiaolongzhu}@chaocanshu.ai
Abstract
In recent years, Multi-Agent Path Finding (MAPF) has at-
tracted attention from the fields of both Operations Research
(OR) and Reinforcement Learning (RL). However, in the
2021 Flatland3 Challenge, a competition on MAPF, the best
RL method scored only 27.9, far less than the best OR
method. This paper proposes a new RL solution to Flatland3
Challenge, which scores 125.3, several times higher than the
best RL solution before. We creatively apply a novel network
architecture, TreeLSTM, to MAPF in our solution. Together
with several other RL techniques, including reward shaping,
multiple-phase training, and centralized control, our solution
is comparable to the top 2-3 OR methods.
1 Introduction
Multi-agent path finding (MAPF), i.e., finding collision-
free paths for multiple agents on a graph, has been a long-
standing combinatorial problem. On undirected graphs, a
feasible solution can be found in polynomial time, but find-
ing the fastest solution is NP-hard. And on directed graphs,
even finding a feasible solution is NP-hard in general cases
(Nebel 2020). Despite the great challenges of MAPF, it
is of great social and economic value because many real-
life scheduling and planning problems can be formulated
as MAPF questions. Many operations research (OR) algo-
rithms are proposed to efficiently find sub-optimal solutions
(Cohen et al. 2019;ˇ
Svancara et al. 2019;Ma et al. 2018;Ma,
Kumar, and Koenig 2017).
As an important branch in decision theory, reinforcement
learning has attracted a lot of attention these years due to
its super-human performance in many complex games, such
as AlphaGo (Silver et al. 2017), AlphaStar (Vinyals et al.
2019), and OpenAI Five (Berner et al. 2019). Inspired by
the tremendous successes of RL on these complex games
and decision scenarios, multi-agent reinforcement learning
(MARL) is widely expected to work on MAPF problems
as well. We study MARL in MAPF problems and aim to
provide high-performance and scalable RL solutions. To de-
velop and test our RL solution, we focus on a specific MAPF
environment, Flatland.
*These authors contributed equally.
Corresponding author
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Flatland (Mohanty et al. 2020;Laurent et al. 2021) is
a train schedule simulator developed by the Swiss Federal
Railway Company (SBB). It simulates trains and rail net-
works in the real world and serves as an excellent environ-
ment for testing different MAPF algorithms. Since 2019,
SBB has successfully held three flatland challenges, attract-
ing more than 200 teams worldwide, receiving thousands of
submissions and over one million views. The key reasons
why we focus on this environment are as follows,
Support Massive Agents: On the maximum size of the
map, up to hundreds of trains need to be planned.
Directed Graphs and Conflicts between Agents:
Trains CANNOT move back, and all decisions are not
revocable. Deadlock occurs if the trains are not well
planned (Figure 5), which makes this question very chal-
lenging.
Lack of High-Performance RL Solutions: Existing RL
solutions show a significant disadvantage compared with
OR algorithms (27.9 vs. 141.0 scores).
To solve Flatland, we propose an RL solution by standard
reinforcement learning algorithms at scale. The critical com-
ponents of our RL solution are (1) the application of a TreeL-
STM network architecture to process the tree-structured lo-
cal observations of each agent, (2) the centralized control
method to promote cooperation between agents, and (3) our
optimized 20x faster feature parser.
Our contributions can be summarized as (1) We propose
an RL solution consisting of domain-specific feature extrac-
tion and curriculum training phases design, a TreeLSTM
network to process the structured observations, and a 20x
faster feature parser to improve the sample efficiency. (2)
Our observed strategies and performance show the potential
of RL algorithms in MAPF problems. We find that standard
RL methods coupled with domain-specific engineering can
achieve comparable performance with OR algorithm (2nd
3rd OR ). Our solution provides implementation insights to
the MARL in the MAPF research community. (3) We open-
sourced1our solution and the optimized feature parser for
further research on multi-agent reinforcement learning in
MAPF problems.
1https://github.com/liqimai/flatland-marl
arXiv:2210.12933v2 [cs.AI] 13 Dec 2022
Figure 1: A 30 ×30 flatland map. There are two stations and
two trains. The lines connecting trains and stations indicate
trains’ targets.
2 Flatland3 Environment
Flatland is a simplified world of rail networks in which sta-
tions are connected by rails. Players control trains to run
from one station to another. Its newest version, Flatland3,
consists of the following rules.
The time is discretized into timestamps from 0 to Tmax.
There are Ntrains and several cities. Trains are num-
bered from 1to N.
Trains’ action space consists of five actions, {do nothing,
forward, stop, left, right}. Trains are not allowed to move
backward and must go along the rails.
Trains have different speeds. Each train ihas its own
speed si, and can move one step every 1/siturns. The
time cost of one step 1/siis guaranteed to be an integer,
and the largest possible speed is 1, i.e., one step a turn.
For each train i, it has an earliest departure time Aiand
a latest arrival time Bi. Each train can depart from its
initial station only after its earliest departure time Aiand
should try its best to arrive at its target station before the
latest arrival time Bi.
Trains randomly break down (malfunction) while run-
ning or waiting for departure. After the breakdown, the
train must stay still for a period of time before moving
again.
Reward The goal is to control all trains to reach target
stations before their latest arrival time Bi. Every train will
get a reward Riin the end. Tidenotes the arrival time of
each train.
If a train arrives on time, then it scores 0.
Figure 2: Different types of rail cells.
If it arrives late, it gets a negative reward BiTias a
penalty, according to how late it is.
If a train does not manage to arrive at its target before
the end time Tmax, then the penalty consists of two parts,
a temporal penalty and a spatial penalty. The temporal
penalty is BiTmax, reflecting how late it is. The spa-
tial penalty is decided by the shortest path distance di
between its final location at time Tmax and its target.
Formally, Riis defined as
Ri=
0,if TiBi;
BiTi,if Bi< TiTmax;
BiTmax d(Tmax )
i,if Ti> Tmax;
(1)
where d(Tmax )
iis the distance between train iand its target at
time Tmax,
d(t)
i=d(x(t)
i, y(t)
i),targeti.(2)
Our goal is to maximize the sum of individual rewards
R=
N
X
i=0
Ri.(3)
Apparently, Ris always non-positive, and R= 0 if and only
if all trains reach targets on time. |R|can be arbitrarily large,
as long as the map size is sufficiently large and the algorithm
performance is sufficiently bad.
Normalized Reward The magnitude of total reward R
greatly relies on the problem scale, such as the number of
trains, the number of cities, the speeds of trains, and map
size. To make rewards of different problem scales compara-
ble, they are normalized as follows:
¯
R= 1 + R
NTmax
,(4)
where Nis the number of trains. The environment gener-
ating procedure guarantees ¯
R[0,1] by adjusting Tmax.
Normalized reward serves as the standard criterion for test-
ing algorithms.
3 Our Approach
The RL solution we provide is cooperative multi-agent re-
inforcement learning. Each agent independently observes a
part of the map localized around itself and encodes the part
摘要:

Multi-AgentPathFindingviaTreeLSTMYuhaoJiang*,KunjieZhang*,QimaiLi*,JiaxinChen†,XiaolongZhuParametrix.AIShenzhen,Guangdong,Chinafyuhaojiang,ericzhang,qimaili,jiaxinchen,xiaolongzhug@chaocanshu.aiAbstractInrecentyears,Multi-AgentPathFinding(MAPF)hasat-tractedattentionfromtheeldsofbothOperationsResear...

展开>> 收起<<
Multi-Agent Path Finding via Tree LSTM Yuhao Jiang Kunjie Zhang Qimai Li Jiaxin Chen Xiaolong Zhu.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:3.44MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注