Multi-Agent Path Finding via Tree LSTM Yuhao Jiang Kunjie Zhang Qimai Li Jiaxin Chen Xiaolong Zhu

2025-04-29 0 0 3.44MB 8 页 10玖币

侵权投诉

Multi-Agent Path Finding via Tree LSTM

Yuhao Jiang*, Kunjie Zhang*, Qimai Li*,

Jiaxin Chen†, Xiaolong Zhu

Parametrix.AI

Shenzhen, Guangdong, China

{yuhaojiang, ericzhang, qimaili, jiaxinchen, xiaolongzhu}@chaocanshu.ai

Abstract

In recent years, Multi-Agent Path Finding (MAPF) has at-

tracted attention from the ﬁelds of both Operations Research

(OR) and Reinforcement Learning (RL). However, in the

2021 Flatland3 Challenge, a competition on MAPF, the best

RL method scored only 27.9, far less than the best OR

method. This paper proposes a new RL solution to Flatland3

Challenge, which scores 125.3, several times higher than the

best RL solution before. We creatively apply a novel network

architecture, TreeLSTM, to MAPF in our solution. Together

with several other RL techniques, including reward shaping,

multiple-phase training, and centralized control, our solution

is comparable to the top 2-3 OR methods.

1 Introduction

Multi-agent path ﬁnding (MAPF), i.e., ﬁnding collision-

free paths for multiple agents on a graph, has been a long-

standing combinatorial problem. On undirected graphs, a

feasible solution can be found in polynomial time, but ﬁnd-

ing the fastest solution is NP-hard. And on directed graphs,

even ﬁnding a feasible solution is NP-hard in general cases

(Nebel 2020). Despite the great challenges of MAPF, it

is of great social and economic value because many real-

life scheduling and planning problems can be formulated

as MAPF questions. Many operations research (OR) algo-

rithms are proposed to efﬁciently ﬁnd sub-optimal solutions

(Cohen et al. 2019;ˇ

Svancara et al. 2019;Ma et al. 2018;Ma,

Kumar, and Koenig 2017).

As an important branch in decision theory, reinforcement

learning has attracted a lot of attention these years due to

its super-human performance in many complex games, such

as AlphaGo (Silver et al. 2017), AlphaStar (Vinyals et al.

2019), and OpenAI Five (Berner et al. 2019). Inspired by

the tremendous successes of RL on these complex games

and decision scenarios, multi-agent reinforcement learning

(MARL) is widely expected to work on MAPF problems

as well. We study MARL in MAPF problems and aim to

provide high-performance and scalable RL solutions. To de-

velop and test our RL solution, we focus on a speciﬁc MAPF

environment, Flatland.

*These authors contributed equally.

†Corresponding author

Flatland (Mohanty et al. 2020;Laurent et al. 2021) is

a train schedule simulator developed by the Swiss Federal

Railway Company (SBB). It simulates trains and rail net-

works in the real world and serves as an excellent environ-

ment for testing different MAPF algorithms. Since 2019,

SBB has successfully held three ﬂatland challenges, attract-

ing more than 200 teams worldwide, receiving thousands of

submissions and over one million views. The key reasons

why we focus on this environment are as follows,

•Support Massive Agents: On the maximum size of the

map, up to hundreds of trains need to be planned.

•Directed Graphs and Conﬂicts between Agents:

Trains CANNOT move back, and all decisions are not

revocable. Deadlock occurs if the trains are not well

planned (Figure 5), which makes this question very chal-

lenging.

•Lack of High-Performance RL Solutions: Existing RL

solutions show a signiﬁcant disadvantage compared with

OR algorithms (27.9 vs. 141.0 scores).

To solve Flatland, we propose an RL solution by standard

reinforcement learning algorithms at scale. The critical com-

ponents of our RL solution are (1) the application of a TreeL-

STM network architecture to process the tree-structured lo-

cal observations of each agent, (2) the centralized control

method to promote cooperation between agents, and (3) our

optimized 20x faster feature parser.

Our contributions can be summarized as (1) We propose

an RL solution consisting of domain-speciﬁc feature extrac-

tion and curriculum training phases design, a TreeLSTM

network to process the structured observations, and a 20x

faster feature parser to improve the sample efﬁciency. (2)

Our observed strategies and performance show the potential

of RL algorithms in MAPF problems. We ﬁnd that standard

RL methods coupled with domain-speciﬁc engineering can

achieve comparable performance with OR algorithm (2nd–

3rd OR ). Our solution provides implementation insights to

the MARL in the MAPF research community. (3) We open-

sourced1our solution and the optimized feature parser for

further research on multi-agent reinforcement learning in

MAPF problems.

1https://github.com/liqimai/ﬂatland-marl

arXiv:2210.12933v2 [cs.AI] 13 Dec 2022

Figure 1: A 30 ×30 ﬂatland map. There are two stations and

two trains. The lines connecting trains and stations indicate

trains’ targets.

2 Flatland3 Environment

Flatland is a simpliﬁed world of rail networks in which sta-

tions are connected by rails. Players control trains to run

from one station to another. Its newest version, Flatland3,

consists of the following rules.

• The time is discretized into timestamps from 0 to Tmax.

• There are Ntrains and several cities. Trains are num-

bered from 1to N.

• Trains’ action space consists of ﬁve actions, {do nothing,

forward, stop, left, right}. Trains are not allowed to move

backward and must go along the rails.

• Trains have different speeds. Each train ihas its own

speed si, and can move one step every 1/siturns. The

time cost of one step 1/siis guaranteed to be an integer,

and the largest possible speed is 1, i.e., one step a turn.

• For each train i, it has an earliest departure time Aiand

a latest arrival time Bi. Each train can depart from its

initial station only after its earliest departure time Aiand

should try its best to arrive at its target station before the

latest arrival time Bi.

• Trains randomly break down (malfunction) while run-

ning or waiting for departure. After the breakdown, the

train must stay still for a period of time before moving

again.

Reward The goal is to control all trains to reach target

stations before their latest arrival time Bi. Every train will

get a reward Riin the end. Tidenotes the arrival time of

each train.

• If a train arrives on time, then it scores 0.

Figure 2: Different types of rail cells.

• If it arrives late, it gets a negative reward Bi−Tias a

penalty, according to how late it is.

• If a train does not manage to arrive at its target before

the end time Tmax, then the penalty consists of two parts,

a temporal penalty and a spatial penalty. The temporal

penalty is Bi−Tmax, reﬂecting how late it is. The spa-

tial penalty is decided by the shortest path distance di

between its ﬁnal location at time Tmax and its target.

Formally, Riis deﬁned as

Ri=









0,if Ti≤Bi;

Bi−Ti,if Bi< Ti≤Tmax;

Bi−Tmax −d(Tmax )

i,if Ti> Tmax;

(1)

where d(Tmax )

iis the distance between train iand its target at

time Tmax,

d(t)

i=d(x(t)

i, y(t)

i),targeti.(2)

Our goal is to maximize the sum of individual rewards

i=0

Ri.(3)

Apparently, Ris always non-positive, and R= 0 if and only

if all trains reach targets on time. |R|can be arbitrarily large,

as long as the map size is sufﬁciently large and the algorithm

performance is sufﬁciently bad.

Normalized Reward The magnitude of total reward R

greatly relies on the problem scale, such as the number of

trains, the number of cities, the speeds of trains, and map

size. To make rewards of different problem scales compara-

ble, they are normalized as follows:

R= 1 + R

NTmax

,(4)

where Nis the number of trains. The environment gener-

ating procedure guarantees ¯

R∈[0,1] by adjusting Tmax.

Normalized reward serves as the standard criterion for test-

ing algorithms.

3 Our Approach

The RL solution we provide is cooperative multi-agent re-

inforcement learning. Each agent independently observes a

part of the map localized around itself and encodes the part

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-AgentPathFindingviaTreeLSTMYuhaoJiang*,KunjieZhang*,QimaiLi*,JiaxinChen,XiaolongZhuParametrix.AIShenzhen,Guangdong,Chinafyuhaojiang,ericzhang,qimaili,jiaxinchen,xiaolongzhug@chaocanshu.aiAbstractInrecentyears,Multi-AgentPathFinding(MAPF)hasat-tractedattentionfromtheeldsofbothOperationsResear...

展开>> 收起<<

Multi-Agent Path Finding via Tree LSTM Yuhao Jiang Kunjie Zhang Qimai Li Jiaxin Chen Xiaolong Zhu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Agent Path Finding via Tree LSTM Yuhao Jiang Kunjie Zhang Qimai Li Jiaxin Chen Xiaolong Zhu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: