Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning Pedro P. Santos Diogo S. Carvalho Miguel Vasco

2025-04-27 0 0 4.75MB 34 页 10玖币

侵权投诉

Centralized Training with Hybrid Execution in

Multi-Agent Reinforcement Learning

Pedro P. Santos, Diogo S. Carvalho, Miguel Vasco,

Alberto Sardinha, Pedro A. Santos, Ana Paiva & Francisco S. Melo

INESC-ID & Instituto Superior Técnico, University of Lisbon

pedro.pinto.santos@tecnico.ulisboa.pt

Abstract

We introduce hybrid execution in multi-agent reinforcement learning (MARL),

a new paradigm in which agents aim to successfully complete cooperative tasks

with arbitrary communication levels at execution time by taking advantage of

information-sharing among the agents. Under hybrid execution, the communica-

tion level can range from a setting in which no communication is allowed between

agents (fully decentralized), to a setting featuring full communication (fully cen-

tralized), but the agents do not know beforehand which communication level they

will encounter at execution time. To formalize our setting, we deﬁne a new class

of multi-agent partially observable Markov decision processes (POMDPs) that we

name hybrid-POMDPs, which explicitly model a communication process between

the agents. We contribute MARO, an approach that makes use of an auto-regressive

predictive model, trained in a centralized manner, to estimate missing agents’ obser-

vations at execution time. We evaluate MARO on standard scenarios and extensions

of previous benchmarks tailored to emphasize the negative impact of partial ob-

servability in MARL. Experimental results show that our method consistently

outperforms relevant baselines, allowing agents to act with faulty communication

while successfully exploiting shared information.

1 Introduction

Multi-agent reinforcement learning (MARL) aims to learn utility-maximizing behavior in scenarios

involving multiple agents. In recent years, deep MARL methods have been successfully applied to

multi-agent tasks such as game-playing [

], trafﬁc light control [

], or energy management [

Despite recent successes, the multi-agent setting happens to be substantially harder than its single-

agent counterpart [

] because multiple concurrent learners can create non-stationarity conditions that

hinder learning; the curse of dimensionality obstructs centralized approaches to MARL due to the

exponential growth in state and action spaces with the number of agents; and agents seldom observe

the true state of the environment.

As a way to deal with the exponential growth in the state/action space and with environmental

constraints, both in perception and actuation, existing methods aim to learn decentralized policies

that allow the agents to act based on local perceptions and partial information about other agents’

intentions. The paradigm of centralized training with decentralized execution is undoubtedly at

the core of recent research in the ﬁeld [

]; such paradigm takes advantage of the fact that

additional information, available only at training time, can be used to learn decentralized policies in a

way that the need for communication is alleviated.

While in some settings partial observability and/or communication constraints require learning fully

decentralized policies, the assumption that agents cannot communicate at execution time is often

too strict for a great number of real-world application domains such as robotics, game-playing or

autonomous driving [

]. In such domains, learning fully decentralized policies should be deemed

arXiv:2210.06274v2 [cs.LG] 5 Jun 2023

too restrictive since such policies do not take into account the possibility of communication between

the agents. Other MARL strategies, which do take advantage of additional information shared among

the agents, can surely be developed [42].

In this work, we propose RL agents that are able to exploit the beneﬁts of centralized training

while, simultaneously, taking advantage of information-sharing at execution time. We introduce

the paradigm of hybrid execution, in which agents act in scenarios with arbitrary (but unknown)

communication levels that can range from no communication (fully decentralized) to full commu-

nication between the agents (fully centralized). In particular, we consider scenarios with faulty

communication during execution, in which agents passively share their local observations to perform

partially observable cooperative tasks. To formalize our setting, we start by deﬁning hybrid partially

observable Markov decision process (H-POMDP), a new class of multi-agent POMDPs that explicitly

considers a communication process between the agents. We then propose a novel method that allows

agents to solve H-POMDPs regardless of the communication process encountered at execution time.

Speciﬁcally, we propose multi-agent observation sharing under communication dropout (MARO).

MARO can be easily integrated with current deep MARL methods and comprises an auto-regressive

model, trained in a centralized manner, that explicitly predicts non-shared information from past

observations of the agents.

We evaluate the performance of MARO across different communication levels, in different MARL

benchmark environments and using multiple RL algorithms. Furthermore, we introduce novel

MARL environments that explicitly require communication during execution to successfully perform

cooperative tasks, currently missing in the literature. Experimental results show that our method

consistently outperforms the baselines, allowing agents to exploit shared information during execution

and perform tasks under various communication levels.

In summary, our contributions are three-fold: (i) we propose and formalize the setting of hybrid

execution in MARL, in which agents must perform partially-observable cooperative tasks across

all possible communication levels; (ii) we propose MARO, an approach that makes use of an

autoregressive predictive model of agents’ observations; and (iii) we evaluate MARO in multiple

environments using different RL algorithms, showing that our approach consistently allows agents to

act with different communication levels.

2 Hybrid Execution in Multi-Agent Reinforcement Learning

A fully cooperative multi-agent system with Markovian dynamics can be modeled as a decentral-

ized partially observable Markov decision process (Dec-POMDP) [

]. A Dec-POMDP is a tuple

([n],X,A,P, r, γ, Z,O)

, where

[n] = {1, . . . , n}

is the set of indexes of

agents,

is the set of

states of the environment,

A=×iAi

is the set of joint actions, where

is the set of individual

actions of agent

is the set of probability distributions over next states in

, one for each state

and action in

X × A

r:X × A → R

maps states and actions to expected rewards,

γ∈[0,1[

is a

discount factor,

Z=×iZi

is the set of joint observations, where

is the set of local observations

of agent

, and

is the set of probability distributions over joint observations in

, one for each state

and action in

X × A

. A decentralized policy for agent

πi:Zi→ Ai

and the joint decentralized

policy is π:Z → A such that π(z1, . . . , zn) = π1(z1), . . . , πn(zn).

Fully decentralized approaches to MARL directly apply standard single-agent RL algorithms for

learning each agent’s policy

πi

in a decentralized manner. In independent

-learning (IQL) [

each agent treats other agents as being part of the environment, ignoring the inﬂuence of other

agents’ observations and actions. Similarly, independent proximal policy optimization (IPPO), an

adaptation of the PPO algorithm [

], learns fully decentralized critic and actor networks, neglecting

the inﬂuence of other agents. More recently, under the paradigm of centralized training with

decentralized execution, QMIX [

] aims at learning decentralized policies with centralization at

training time while fostering cooperation among the agents. Multi-agent PPO (MAPPO) [

] learns

decentralized actors using a centralized critic during training. Finally, if we know that all agents can

share their local observations among themselves at execution time, we can use any of the approaches

above to learn fully centralized policies.

None of the aforementioned classes of methods assumes, however, that agents may sometimes have

access to other agents’ observations and sometimes not. Therefore, decentralized agents are unable to

take advantage of the additional information that they may receive from other agents at execution

Environment

LSTM

∆n

t+1

∆1

t+1

ht+1

o1:n

History

State

(a) Training time.

Environment

o1:n

˜on

(b) Execution time.

Figure 1: MARO approach for hybrid execution: (a) at training time, an autoregressive predictive

model

learns to estimate observation deltas

p(∆1:n

t|o1:n

t, ht)

from previous observations

o1:n

and a history variable

; and (b) at execution time, an agent-speciﬁc predictive model,

, predicts

missing agents’ observations. More details in the main text.

time, and centralized agents are unable to act when the sharing of information fails. In this work, we

introduce hybrid execution in MARL, a setting in which agents act regardless of the communication

process while taking advantage of additional information they may receive during execution. To

formalize this setting, we deﬁne a new class of multi-agent POMDPs that we name hybrid-POMDPs

(H-POMDPs), which explicitly considers a speciﬁc communication process among the agents.

2.1 Hybrid Partially Observable Markov Decision Processes

We deﬁne a hybrid-POMDP (H-POMDP) as a tuple

([n],X,A,P, r, γ, Z,O, C)

where, in addition

to the tuple that describes the Dec-POMDP, we consider a

n×n

communication matrix

such that

[C]i,j =pi,j

is the probability that, at a certain time step, agent

has access to the local observation

of agent

. H-POMDPs generalize both the notion of decentralized execution and centralized

execution in MARL. Speciﬁcally, for a given Dec-POMDP, we can consider

as the identity matrix

to capture fully decentralized execution or as a matrix of ones to capture fully centralized execution.

In our setting, we assume that at execution time agents will face an H-POMDP with an unknown

communication matrix

, sampled from a set

according to an unknown probability distribution

. The performance of the agent is measured as

Jµ(π) = EC∼µ[J(π;C)]

, where

J(π;C)

denotes

the expected discounted cumulative reward under an H-POMDP with communication matrix

. At

training time, agents may have access to the fully centralized H-POMDP. Therefore, the setting we

consider is one of centralized training with hybrid execution and an unknown communication process.

We note here that every H-POMDP has a corresponding Dec-POMDP, which can be obtained by

adequately changing the observation space

and the set of emission probability distributions

Consequently, any reinforcement learning method can be trained to solve a speciﬁc H-POMDP, with

a speciﬁc communication matrix

, by solving the corresponding Dec-POMDP. However, we seek to

ﬁnd a method that takes explicit advantage of the characteristics of hybrid execution to be able to act

on H-POMDPs regardless of the matrix

that models the communication process at execution time.

To the best of our knowledge, there exists no method that addresses our problem.

3 Multi-Agent Observation Sharing under Communication Dropout

While acting on an H-POMDP, agents may not have access to the perceptual information of all

agents due to a faulty communication process. We propose MARO, a novel approach to exploit

shared information and overcome communication issues during task execution. MARO comprises an

autoregressive predictive model that estimates missing information from previous observations.

We set up the RL controller of each agent, i.e., the

-network associated with each agent for the IQL

and QMIX algorithms, and the actor network associated with each agent for the IPPO and MAPPO

algorithms, to receive as input the joint observation

o1:n

t={o1

t, . . . , on

, where

is the observation

of the

-th agent at timestep

. In order to overcome communication failures during execution, we

train a predictive model Mto impute the non-shared observations ˜oi

t,i∈[n].

Training time We learn a transition model,

p(∆1:n

t|o1:n

t, ht)

, depicted in Fig. 1a, that given

the current observations

o1:n

and some history variable

is able to predict the next-step observa-

tions as

o1:n

t+1 =o1:n

t+ ∆1:n

, where

∆1:n

corresponds to the predicted deltas of the observations.

We learn a single predictive model in a fully centralized and supervised fashion. We instantiate

pθ(∆1:n

t|o1:n

t, ht)as an LSTM, parameterized by θ, with:

pθ(∆1:n

t|o1:n

t, ht) =

i=1

pθ(∆i

t|o1:n

t, ht),(1)

where

pθ(∆i

t|o1:n

t, ht)

is the Gaussian distribution of the predicted deltas for the

-th agent. We

train the predictive model and RL controllers simultaneously: we consider single-step transitions

(o1:n

t,∆1:n

, with

∆1:n

t=o1:n

t+1 −o1:n

, and evaluate the negative log-likelihood of the target next-step

deltas ∆1:n

t, given the estimated next-step deltas distribution pθ(· | o1:n

t, ht):

LM(o1:n

t,∆1:n

t) = −

i=1

log pθ(∆i

t|o1:n

t, ht).(2)

Execution time We provide each agent with an independent instance of the predictive model

which updates the estimated joint-observations in the perspective of the agent

˜o1:n,i

t={˜o1,i

t,...,˜on,i

and maintains an agent-speciﬁc history state

. As depicted in Fig. 1b, we use the predictive model

Mito impute missing observations.

4 Evaluation

In this section, we evaluate our approach for hybrid execution against relevant baselines under

multiple MARL algorithms. We show that the core component of MARO, i.e., the predictive model,

allows the execution of tasks across multiple communication levels, outperforming baselines. We

start by describing our experimental scenarios and baselines in Sec. 4.1 and Sec. 4.2, respectively. In

Sec. 4.3, we present our main experimental results.

4.1 Experimental Scenarios

We focus our evaluation on multi-agent cooperative environments. As discussed by Papoudakis

et al.

[24]

, the main challenges in current MARL benchmark scenarios majorly involve coordination,

large action space, sparse reward and non-stationarity. Thus, in order to emphasize the impact of

information sharing among the agents, we contribute the following environments (adapted from [

]):

•

HearSee (HS): Two heterogeneous agents cover a single landmark in a 2D map. The “Hear”

agent observes the absolute position of the landmark, but it does not have access to its own

position in the environment. The “See” agent observes the position and velocities of both

agents, yet does not have access to the position of the landmark.

•

SpreadXY-2 (SXY-2): Two heterogeneous agents cover two designated landmarks in a 2D

map while avoiding collisions. In this scenario, one of the agents has access to the X-axis

position and velocity of both agents, while the other agent has access to the Y-axis position

and velocity of both agents. Both agents observe the landmarks’ absolute position;

•SpreadXY-4 (SXY-4): Similar to the scenario above but with two teams of two agents;

•

SpreadBlindfold (SBF): Three agents cover three designated landmarks in a 2D map while

avoiding collisions. Each agent’s observation only includes its own position and velocity

and the absolute position of all landmarks;

In addition to the proposed environments, we evaluate our approach in the standard SpeakerListener

(SL) environment from [

], as well as the Level-Based Foraging (Foraging-2s-15x15-2p-2f-coop-

v2) (LBF) environment [

], which we modiﬁed to comprise the absolute positions of the agents.

For some scenarios in standard benchmarks, such as the Multi-Agent Particle Environment [

], or

Level-Based Foraging [

], we observed no advantage in allowing observation sharing between the

agents even without considering communication failures (more details in Appendix B.1). Thus, we

did not consider such environments in this work. For a complete description of the scenarios, as well

as additional details regarding the choice of the environments used, we refer to Appendix B.1.

Finally, we consider H-POMDPs with communication matrices such that each agent

can always

access its own local observation, i.e.,

pi,i = 1

, and the communication matrix is symmetric between

agents

and

, i.e.,

pi,j =pj,i

. To simplify the exposition and the evaluation, we use the same

pi,j =p

for all pairs of different agents

. Therefore, we use

to unambiguously denote the

communication level of a given H-POMDP. Nevertheless, we perform a comparative study between

different sampling schemes for the communication matrix in Sec. 4.3.2, highlighting the robustness

of MARO under different communication settings.

4.2 Baselines and Experimental Methodology

We compare MARO against the following baselines, which do not make use of a predictive model

and perform constant imputation of missing observations:

•

Observation (Obs.): Agents only have access to their own observations and are unable

to communicate with other agents during execution. Corresponds to standard MARL

algorithms designed for decentralized execution.

•

Masked Joint-Observation (Masked j. obs.): During the centralized training phase, the

RL controllers receive as input the concatenation of the observations of all agents. At

execution-time, missing observations are replaced with a vector of zeros.

•

Message-Dropout (MD): During the centralized training phase, the RL controllers receive

as input the concatenation of the observations of all agents, but a dropout-based mechanism

randomly drops some of the observations (i.e., replaces them with a vector of zeros) accord-

ing to

p∼ U(0,1)

. At execution-time, missing observations are replaced with a vector of

zeros. This baseline is adapted from [13].

•

Message-Dropout w/ masks (MD w/ masks): This baseline is similar to the MD baseline,

but additionally appends to the input of the RL controllers a set of binary ﬂags encoding

whether the observations of the agents are missing or not. The masks give additional context

to the RL agent regarding the validity of the entries in the vector of observations.

All baselines above can be used in the context of hybrid execution. Additionally, we consider an

Oracle baseline under which all agents have access to the observations of all agents both during

training and execution. Such oracle baseline corresponds to standard MARL algorithms designed for

centralized execution, however, it is unable to perform when communication fails. We use the Oracle

baseline to better contextualize the performance of the methods developed for hybrid execution

against an optimal setting featuring no communication failures.

We employ the same RL controller networks across all evaluations. The RL networks include

recurrent layers to mitigate the effects of partial observability. We consider four different MARL

algorithms: IQL, QMIX, IPPO, and MAPPO. We perform 3 training runs for each experimental

setting and 100 evaluation rollouts for each training run. We report, both in tables and plots, the

95% bootstrapped conﬁdence interval alongside the corresponding scalar mean value. We assume

that

p= 1

t= 0

for all algorithms. The algorithms are evaluated for

p∼ U(0,1)

whenever the

communication level is not explicitly referred, or for a given ﬁxed communication level

when

explicitly speciﬁed. The Oracle baseline is always evaluated with

p= 1

. We refer to Appendix B.2 for

a complete description of the experimental methodology, including hyperparameters of the predictive

model and the RL controllers, as well as the code used for this work.

4.3 Results

We present the main evaluation results in Tables 1 and 2 for the value-based and actor critic-based

algorithms respectively. For each environment, RL algorithm and method, we present the values of

the accumulated rewards obtained, for

p∼ U(0,1)

. The values that are not signiﬁcantly different than

the highest are presented in bold. The results show that MARO is the best-performing method overall.

In particular, out of the 24 algorithm-environment combinations considered, MARO performed equal

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CentralizedTrainingwithHybridExecutioninMulti-AgentReinforcementLearningPedroP.Santos,DiogoS.Carvalho,MiguelVasco,AlbertoSardinha,PedroA.Santos,AnaPaiva&FranciscoS.MeloINESC-ID&InstitutoSuperiorTécnico,UniversityofLisbonpedro.pinto.santos@tecnico.ulisboa.ptAbstractWeintroducehybridexecutioninmulti-a...

展开>> 收起<<

Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning Pedro P. Santos Diogo S. Carvalho Miguel Vasco.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning Pedro P. Santos Diogo S. Carvalho Miguel Vasco

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: