Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning Pedro P. Santos Diogo S. Carvalho Miguel Vasco

2025-04-27 0 0 4.75MB 34 页 10玖币
侵权投诉
Centralized Training with Hybrid Execution in
Multi-Agent Reinforcement Learning
Pedro P. Santos, Diogo S. Carvalho, Miguel Vasco,
Alberto Sardinha, Pedro A. Santos, Ana Paiva & Francisco S. Melo
INESC-ID & Instituto Superior Técnico, University of Lisbon
pedro.pinto.santos@tecnico.ulisboa.pt
Abstract
We introduce hybrid execution in multi-agent reinforcement learning (MARL),
a new paradigm in which agents aim to successfully complete cooperative tasks
with arbitrary communication levels at execution time by taking advantage of
information-sharing among the agents. Under hybrid execution, the communica-
tion level can range from a setting in which no communication is allowed between
agents (fully decentralized), to a setting featuring full communication (fully cen-
tralized), but the agents do not know beforehand which communication level they
will encounter at execution time. To formalize our setting, we define a new class
of multi-agent partially observable Markov decision processes (POMDPs) that we
name hybrid-POMDPs, which explicitly model a communication process between
the agents. We contribute MARO, an approach that makes use of an auto-regressive
predictive model, trained in a centralized manner, to estimate missing agents’ obser-
vations at execution time. We evaluate MARO on standard scenarios and extensions
of previous benchmarks tailored to emphasize the negative impact of partial ob-
servability in MARL. Experimental results show that our method consistently
outperforms relevant baselines, allowing agents to act with faulty communication
while successfully exploiting shared information.
1 Introduction
Multi-agent reinforcement learning (MARL) aims to learn utility-maximizing behavior in scenarios
involving multiple agents. In recent years, deep MARL methods have been successfully applied to
multi-agent tasks such as game-playing [
22
], traffic light control [
34
], or energy management [
4
].
Despite recent successes, the multi-agent setting happens to be substantially harder than its single-
agent counterpart [
3
] because multiple concurrent learners can create non-stationarity conditions that
hinder learning; the curse of dimensionality obstructs centralized approaches to MARL due to the
exponential growth in state and action spaces with the number of agents; and agents seldom observe
the true state of the environment.
As a way to deal with the exponential growth in the state/action space and with environmental
constraints, both in perception and actuation, existing methods aim to learn decentralized policies
that allow the agents to act based on local perceptions and partial information about other agents’
intentions. The paradigm of centralized training with decentralized execution is undoubtedly at
the core of recent research in the field [
19
,
26
,
5
]; such paradigm takes advantage of the fact that
additional information, available only at training time, can be used to learn decentralized policies in a
way that the need for communication is alleviated.
While in some settings partial observability and/or communication constraints require learning fully
decentralized policies, the assumption that agents cannot communicate at execution time is often
too strict for a great number of real-world application domains such as robotics, game-playing or
autonomous driving [
9
,
39
]. In such domains, learning fully decentralized policies should be deemed
arXiv:2210.06274v2 [cs.LG] 5 Jun 2023
too restrictive since such policies do not take into account the possibility of communication between
the agents. Other MARL strategies, which do take advantage of additional information shared among
the agents, can surely be developed [42].
In this work, we propose RL agents that are able to exploit the benefits of centralized training
while, simultaneously, taking advantage of information-sharing at execution time. We introduce
the paradigm of hybrid execution, in which agents act in scenarios with arbitrary (but unknown)
communication levels that can range from no communication (fully decentralized) to full commu-
nication between the agents (fully centralized). In particular, we consider scenarios with faulty
communication during execution, in which agents passively share their local observations to perform
partially observable cooperative tasks. To formalize our setting, we start by defining hybrid partially
observable Markov decision process (H-POMDP), a new class of multi-agent POMDPs that explicitly
considers a communication process between the agents. We then propose a novel method that allows
agents to solve H-POMDPs regardless of the communication process encountered at execution time.
Specifically, we propose multi-agent observation sharing under communication dropout (MARO).
MARO can be easily integrated with current deep MARL methods and comprises an auto-regressive
model, trained in a centralized manner, that explicitly predicts non-shared information from past
observations of the agents.
We evaluate the performance of MARO across different communication levels, in different MARL
benchmark environments and using multiple RL algorithms. Furthermore, we introduce novel
MARL environments that explicitly require communication during execution to successfully perform
cooperative tasks, currently missing in the literature. Experimental results show that our method
consistently outperforms the baselines, allowing agents to exploit shared information during execution
and perform tasks under various communication levels.
In summary, our contributions are three-fold: (i) we propose and formalize the setting of hybrid
execution in MARL, in which agents must perform partially-observable cooperative tasks across
all possible communication levels; (ii) we propose MARO, an approach that makes use of an
autoregressive predictive model of agents’ observations; and (iii) we evaluate MARO in multiple
environments using different RL algorithms, showing that our approach consistently allows agents to
act with different communication levels.
2 Hybrid Execution in Multi-Agent Reinforcement Learning
A fully cooperative multi-agent system with Markovian dynamics can be modeled as a decentral-
ized partially observable Markov decision process (Dec-POMDP) [
18
]. A Dec-POMDP is a tuple
([n],X,A,P, r, γ, Z,O)
, where
[n] = {1, . . . , n}
is the set of indexes of
n
agents,
X
is the set of
states of the environment,
A=×iAi
is the set of joint actions, where
Ai
is the set of individual
actions of agent
i
,
P
is the set of probability distributions over next states in
X
, one for each state
and action in
X × A
,
r:X × A R
maps states and actions to expected rewards,
γ[0,1[
is a
discount factor,
Z=×iZi
is the set of joint observations, where
Zi
is the set of local observations
of agent
i
, and
O
is the set of probability distributions over joint observations in
Z
, one for each state
and action in
X × A
. A decentralized policy for agent
i
is
πi:Zi→ Ai
and the joint decentralized
policy is π:Z → A such that π(z1, . . . , zn) = π1(z1), . . . , πn(zn).
Fully decentralized approaches to MARL directly apply standard single-agent RL algorithms for
learning each agent’s policy
πi
in a decentralized manner. In independent
Q
-learning (IQL) [
30
],
each agent treats other agents as being part of the environment, ignoring the influence of other
agents’ observations and actions. Similarly, independent proximal policy optimization (IPPO), an
adaptation of the PPO algorithm [
27
], learns fully decentralized critic and actor networks, neglecting
the influence of other agents. More recently, under the paradigm of centralized training with
decentralized execution, QMIX [
26
] aims at learning decentralized policies with centralization at
training time while fostering cooperation among the agents. Multi-agent PPO (MAPPO) [
38
] learns
decentralized actors using a centralized critic during training. Finally, if we know that all agents can
share their local observations among themselves at execution time, we can use any of the approaches
above to learn fully centralized policies.
None of the aforementioned classes of methods assumes, however, that agents may sometimes have
access to other agents’ observations and sometimes not. Therefore, decentralized agents are unable to
take advantage of the additional information that they may receive from other agents at execution
2
ht
Environment
o1
t
on
t
M
LSTM
n
t+1
1
t+1
o1
t
on
t
o1:n
t
History
State
(a) Training time.
Environment
oi
t
on
t
oi
t
o1:n
t
RL
Mi
˜on
t
oi
t
?
(b) Execution time.
Figure 1: MARO approach for hybrid execution: (a) at training time, an autoregressive predictive
model
M
learns to estimate observation deltas
p(∆1:n
t|o1:n
t, ht)
from previous observations
o1:n
t
and a history variable
ht
; and (b) at execution time, an agent-specific predictive model,
Mi
, predicts
missing agents’ observations. More details in the main text.
time, and centralized agents are unable to act when the sharing of information fails. In this work, we
introduce hybrid execution in MARL, a setting in which agents act regardless of the communication
process while taking advantage of additional information they may receive during execution. To
formalize this setting, we define a new class of multi-agent POMDPs that we name hybrid-POMDPs
(H-POMDPs), which explicitly considers a specific communication process among the agents.
2.1 Hybrid Partially Observable Markov Decision Processes
We define a hybrid-POMDP (H-POMDP) as a tuple
([n],X,A,P, r, γ, Z,O, C)
where, in addition
to the tuple that describes the Dec-POMDP, we consider a
n×n
communication matrix
C
such that
[C]i,j =pi,j
is the probability that, at a certain time step, agent
i
has access to the local observation
of agent
j
in
Zj
. H-POMDPs generalize both the notion of decentralized execution and centralized
execution in MARL. Specifically, for a given Dec-POMDP, we can consider
C
as the identity matrix
to capture fully decentralized execution or as a matrix of ones to capture fully centralized execution.
In our setting, we assume that at execution time agents will face an H-POMDP with an unknown
communication matrix
C
, sampled from a set
C
according to an unknown probability distribution
µ
. The performance of the agent is measured as
Jµ(π) = ECµ[J(π;C)]
, where
J(π;C)
denotes
the expected discounted cumulative reward under an H-POMDP with communication matrix
C
. At
training time, agents may have access to the fully centralized H-POMDP. Therefore, the setting we
consider is one of centralized training with hybrid execution and an unknown communication process.
We note here that every H-POMDP has a corresponding Dec-POMDP, which can be obtained by
adequately changing the observation space
Z
and the set of emission probability distributions
O
.
Consequently, any reinforcement learning method can be trained to solve a specific H-POMDP, with
a specific communication matrix
C
, by solving the corresponding Dec-POMDP. However, we seek to
find a method that takes explicit advantage of the characteristics of hybrid execution to be able to act
on H-POMDPs regardless of the matrix
C
that models the communication process at execution time.
To the best of our knowledge, there exists no method that addresses our problem.
3 Multi-Agent Observation Sharing under Communication Dropout
While acting on an H-POMDP, agents may not have access to the perceptual information of all
agents due to a faulty communication process. We propose MARO, a novel approach to exploit
shared information and overcome communication issues during task execution. MARO comprises an
autoregressive predictive model that estimates missing information from previous observations.
We set up the RL controller of each agent, i.e., the
Q
-network associated with each agent for the IQL
and QMIX algorithms, and the actor network associated with each agent for the IPPO and MAPPO
algorithms, to receive as input the joint observation
o1:n
t={o1
t, . . . , on
t}
, where
oi
t
is the observation
3
of the
i
-th agent at timestep
t
. In order to overcome communication failures during execution, we
train a predictive model Mto impute the non-shared observations ˜oi
t,i[n].
Training time We learn a transition model,
p(∆1:n
t|o1:n
t, ht)
, depicted in Fig. 1a, that given
the current observations
o1:n
t
and some history variable
ht
is able to predict the next-step observa-
tions as
o1:n
t+1 =o1:n
t+ ∆1:n
t
, where
1:n
t
corresponds to the predicted deltas of the observations.
We learn a single predictive model in a fully centralized and supervised fashion. We instantiate
pθ(∆1:n
t|o1:n
t, ht)as an LSTM, parameterized by θ, with:
pθ(∆1:n
t|o1:n
t, ht) =
n
Y
i=1
pθ(∆i
t|o1:n
t, ht),(1)
where
pθ(∆i
t|o1:n
t, ht)
is the Gaussian distribution of the predicted deltas for the
i
-th agent. We
train the predictive model and RL controllers simultaneously: we consider single-step transitions
(o1:n
t,1:n
t)
, with
1:n
t=o1:n
t+1 o1:n
t
, and evaluate the negative log-likelihood of the target next-step
deltas 1:n
t, given the estimated next-step deltas distribution pθ(· | o1:n
t, ht):
LM(o1:n
t,1:n
t) =
n
X
i=1
log pθ(∆i
t|o1:n
t, ht).(2)
Execution time We provide each agent with an independent instance of the predictive model
Mi
,
which updates the estimated joint-observations in the perspective of the agent
˜o1:n,i
t={˜o1,i
t,...,˜on,i
t}
and maintains an agent-specific history state
hi
t
. As depicted in Fig. 1b, we use the predictive model
Mito impute missing observations.
4 Evaluation
In this section, we evaluate our approach for hybrid execution against relevant baselines under
multiple MARL algorithms. We show that the core component of MARO, i.e., the predictive model,
allows the execution of tasks across multiple communication levels, outperforming baselines. We
start by describing our experimental scenarios and baselines in Sec. 4.1 and Sec. 4.2, respectively. In
Sec. 4.3, we present our main experimental results.
4.1 Experimental Scenarios
We focus our evaluation on multi-agent cooperative environments. As discussed by Papoudakis
et al.
[24]
, the main challenges in current MARL benchmark scenarios majorly involve coordination,
large action space, sparse reward and non-stationarity. Thus, in order to emphasize the impact of
information sharing among the agents, we contribute the following environments (adapted from [
15
]):
HearSee (HS): Two heterogeneous agents cover a single landmark in a 2D map. The “Hear”
agent observes the absolute position of the landmark, but it does not have access to its own
position in the environment. The “See” agent observes the position and velocities of both
agents, yet does not have access to the position of the landmark.
SpreadXY-2 (SXY-2): Two heterogeneous agents cover two designated landmarks in a 2D
map while avoiding collisions. In this scenario, one of the agents has access to the X-axis
position and velocity of both agents, while the other agent has access to the Y-axis position
and velocity of both agents. Both agents observe the landmarks’ absolute position;
SpreadXY-4 (SXY-4): Similar to the scenario above but with two teams of two agents;
SpreadBlindfold (SBF): Three agents cover three designated landmarks in a 2D map while
avoiding collisions. Each agent’s observation only includes its own position and velocity
and the absolute position of all landmarks;
In addition to the proposed environments, we evaluate our approach in the standard SpeakerListener
(SL) environment from [
15
], as well as the Level-Based Foraging (Foraging-2s-15x15-2p-2f-coop-
v2) (LBF) environment [
24
], which we modified to comprise the absolute positions of the agents.
For some scenarios in standard benchmarks, such as the Multi-Agent Particle Environment [
15
], or
4
Level-Based Foraging [
24
], we observed no advantage in allowing observation sharing between the
agents even without considering communication failures (more details in Appendix B.1). Thus, we
did not consider such environments in this work. For a complete description of the scenarios, as well
as additional details regarding the choice of the environments used, we refer to Appendix B.1.
Finally, we consider H-POMDPs with communication matrices such that each agent
i
can always
access its own local observation, i.e.,
pi,i = 1
, and the communication matrix is symmetric between
agents
i
and
j
, i.e.,
pi,j =pj,i
. To simplify the exposition and the evaluation, we use the same
pi,j =p
for all pairs of different agents
i
,
j
. Therefore, we use
p
to unambiguously denote the
communication level of a given H-POMDP. Nevertheless, we perform a comparative study between
different sampling schemes for the communication matrix in Sec. 4.3.2, highlighting the robustness
of MARO under different communication settings.
4.2 Baselines and Experimental Methodology
We compare MARO against the following baselines, which do not make use of a predictive model
and perform constant imputation of missing observations:
Observation (Obs.): Agents only have access to their own observations and are unable
to communicate with other agents during execution. Corresponds to standard MARL
algorithms designed for decentralized execution.
Masked Joint-Observation (Masked j. obs.): During the centralized training phase, the
RL controllers receive as input the concatenation of the observations of all agents. At
execution-time, missing observations are replaced with a vector of zeros.
Message-Dropout (MD): During the centralized training phase, the RL controllers receive
as input the concatenation of the observations of all agents, but a dropout-based mechanism
randomly drops some of the observations (i.e., replaces them with a vector of zeros) accord-
ing to
p∼ U(0,1)
. At execution-time, missing observations are replaced with a vector of
zeros. This baseline is adapted from [13].
Message-Dropout w/ masks (MD w/ masks): This baseline is similar to the MD baseline,
but additionally appends to the input of the RL controllers a set of binary flags encoding
whether the observations of the agents are missing or not. The masks give additional context
to the RL agent regarding the validity of the entries in the vector of observations.
All baselines above can be used in the context of hybrid execution. Additionally, we consider an
Oracle baseline under which all agents have access to the observations of all agents both during
training and execution. Such oracle baseline corresponds to standard MARL algorithms designed for
centralized execution, however, it is unable to perform when communication fails. We use the Oracle
baseline to better contextualize the performance of the methods developed for hybrid execution
against an optimal setting featuring no communication failures.
We employ the same RL controller networks across all evaluations. The RL networks include
recurrent layers to mitigate the effects of partial observability. We consider four different MARL
algorithms: IQL, QMIX, IPPO, and MAPPO. We perform 3 training runs for each experimental
setting and 100 evaluation rollouts for each training run. We report, both in tables and plots, the
95% bootstrapped confidence interval alongside the corresponding scalar mean value. We assume
that
p= 1
at
t= 0
for all algorithms. The algorithms are evaluated for
p∼ U(0,1)
whenever the
communication level is not explicitly referred, or for a given fixed communication level
p
when
explicitly specified. The Oracle baseline is always evaluated with
p= 1
. We refer to Appendix B.2 for
a complete description of the experimental methodology, including hyperparameters of the predictive
model and the RL controllers, as well as the code used for this work.
4.3 Results
We present the main evaluation results in Tables 1 and 2 for the value-based and actor critic-based
algorithms respectively. For each environment, RL algorithm and method, we present the values of
the accumulated rewards obtained, for
p∼ U(0,1)
. The values that are not significantly different than
the highest are presented in bold. The results show that MARO is the best-performing method overall.
In particular, out of the 24 algorithm-environment combinations considered, MARO performed equal
5
摘要:

CentralizedTrainingwithHybridExecutioninMulti-AgentReinforcementLearningPedroP.Santos,DiogoS.Carvalho,MiguelVasco,AlbertoSardinha,PedroA.Santos,AnaPaiva&FranciscoS.MeloINESC-ID&InstitutoSuperiorTécnico,UniversityofLisbonpedro.pinto.santos@tecnico.ulisboa.ptAbstractWeintroducehybridexecutioninmulti-a...

展开>> 收起<<
Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning Pedro P. Santos Diogo S. Carvalho Miguel Vasco.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:4.75MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注