too restrictive since such policies do not take into account the possibility of communication between
the agents. Other MARL strategies, which do take advantage of additional information shared among
the agents, can surely be developed [42].
In this work, we propose RL agents that are able to exploit the benefits of centralized training
while, simultaneously, taking advantage of information-sharing at execution time. We introduce
the paradigm of hybrid execution, in which agents act in scenarios with arbitrary (but unknown)
communication levels that can range from no communication (fully decentralized) to full commu-
nication between the agents (fully centralized). In particular, we consider scenarios with faulty
communication during execution, in which agents passively share their local observations to perform
partially observable cooperative tasks. To formalize our setting, we start by defining hybrid partially
observable Markov decision process (H-POMDP), a new class of multi-agent POMDPs that explicitly
considers a communication process between the agents. We then propose a novel method that allows
agents to solve H-POMDPs regardless of the communication process encountered at execution time.
Specifically, we propose multi-agent observation sharing under communication dropout (MARO).
MARO can be easily integrated with current deep MARL methods and comprises an auto-regressive
model, trained in a centralized manner, that explicitly predicts non-shared information from past
observations of the agents.
We evaluate the performance of MARO across different communication levels, in different MARL
benchmark environments and using multiple RL algorithms. Furthermore, we introduce novel
MARL environments that explicitly require communication during execution to successfully perform
cooperative tasks, currently missing in the literature. Experimental results show that our method
consistently outperforms the baselines, allowing agents to exploit shared information during execution
and perform tasks under various communication levels.
In summary, our contributions are three-fold: (i) we propose and formalize the setting of hybrid
execution in MARL, in which agents must perform partially-observable cooperative tasks across
all possible communication levels; (ii) we propose MARO, an approach that makes use of an
autoregressive predictive model of agents’ observations; and (iii) we evaluate MARO in multiple
environments using different RL algorithms, showing that our approach consistently allows agents to
act with different communication levels.
2 Hybrid Execution in Multi-Agent Reinforcement Learning
A fully cooperative multi-agent system with Markovian dynamics can be modeled as a decentral-
ized partially observable Markov decision process (Dec-POMDP) [
18
]. A Dec-POMDP is a tuple
([n],X,A,P, r, γ, Z,O)
, where
[n] = {1, . . . , n}
is the set of indexes of
n
agents,
X
is the set of
states of the environment,
A=×iAi
is the set of joint actions, where
Ai
is the set of individual
actions of agent
i
,
P
is the set of probability distributions over next states in
X
, one for each state
and action in
X × A
,
r:X × A → R
maps states and actions to expected rewards,
γ∈[0,1[
is a
discount factor,
Z=×iZi
is the set of joint observations, where
Zi
is the set of local observations
of agent
i
, and
O
is the set of probability distributions over joint observations in
Z
, one for each state
and action in
X × A
. A decentralized policy for agent
i
is
πi:Zi→ Ai
and the joint decentralized
policy is π:Z → A such that π(z1, . . . , zn) = π1(z1), . . . , πn(zn).
Fully decentralized approaches to MARL directly apply standard single-agent RL algorithms for
learning each agent’s policy
πi
in a decentralized manner. In independent
Q
-learning (IQL) [
30
],
each agent treats other agents as being part of the environment, ignoring the influence of other
agents’ observations and actions. Similarly, independent proximal policy optimization (IPPO), an
adaptation of the PPO algorithm [
27
], learns fully decentralized critic and actor networks, neglecting
the influence of other agents. More recently, under the paradigm of centralized training with
decentralized execution, QMIX [
26
] aims at learning decentralized policies with centralization at
training time while fostering cooperation among the agents. Multi-agent PPO (MAPPO) [
38
] learns
decentralized actors using a centralized critic during training. Finally, if we know that all agents can
share their local observations among themselves at execution time, we can use any of the approaches
above to learn fully centralized policies.
None of the aforementioned classes of methods assumes, however, that agents may sometimes have
access to other agents’ observations and sometimes not. Therefore, decentralized agents are unable to
take advantage of the additional information that they may receive from other agents at execution
2