2 Related Work
Our formulation of expectation alignment, a task-agnostic intrinsic reward for multi-agent training,
draws inspiration from the self-organization principle in Zoology, which posits that synchronized
group behavior is mediated by local behavioral rules (Couzin, 2007) and not by a centralized
controller (Camazine et al., 2020). Group cohesion emerges by predicting and adjusting one’s
behavior to that of near neighbors (Buhl et al., 2006). This principle underlies the coordination found
in multi-cellular organisms (Camazine et al., 2020), the migration of wingless locusts (Collett et al.,
1998), the collective swarms of bacteria (Ben-Jacob et al., 1994), the construction of bridge structures
by ants (Theraulaz & Bonabeau, 1995), and some human navigation behaviors (Couzin, 2007).
Intrinsic motivation for single agents.
Although we draw inspiration from Zoology for formaliz-
ing expectation alignment as an intrinsic reward, there is a rich body of work on intrinsic rewards
within the single-agent reinforcement learning community. To incentivize exploration, even when
non-optimal successful trajectories are uncovered first, scholars have argued for the use of intrinsic
motivation (Schmidhuber, 1991). Single-agent intrinsic motivation has focused on exploring previ-
ously unencountered states (Pathak et al., 2017; Burda et al., 2018a), which works particularly well
in discrete domains. In continuous domains, identifying unseen states requires keeping track of an
intractable number of visited states; instead, literature has recommended learning a forward dynamics
model to predict future states and identify novel states using the uncertainty of this model (Achiam &
Sastry, 2017). Other formulations encourage re-visiting states where the dynamics model’s prediction
of future states errs (Stadie et al., 2015; Pathak et al., 2017). Follow up papers have improved how
uncertainty (Kim et al., 2020) and model errors (Burda et al., 2018b; Sekar et al., 2020) are calculated.
Intrinsic motivation for multiple agents.
Most multi-agent intrinsic rewards have been adapted
from single-agent curiosity-based incentives (
?
Böhmer et al., 2019; Schafer, 2019) and have pri-
marily focused on cooperative tasks. They propose intrinsic rewards to improve either coordination,
collaboration, or deception: These rewards either maximize information conveyed by an agent’s
actions (
?
Chitnis et al., 2020; Wang et al., 2019), shape the influence of an agent (Jaques et al.,
2019; Foerster et al., 2017), incentivize agents to hide intentions (Strouse et al., 2018), build accurate
models of other agents’ policies (Hernandez-Leal et al., 2019; Jaques et al., 2019), or break extrinsic
rewards for better credit assignment (Du et al., 2019).
Several multi-agent intrinsic rewards (Hernandez-Leal et al., 2019; Jaques et al., 2019), including
ours, rely on the ability to model others’ dynamics in a shared environment. This ability is a key
component to coordination, closely related to Theory of Mind (Tomasello et al., 2005). Our work
can be interpreted as using a Theory of Mind model of others’ behaviors to calculate an intrinsic
motivation loss. Unlike existing Theory of Mind methods that learn a model per collaborator (Roy
et al., 2020), we learn a single dynamics model, allowing our method to scale as the number of
agents increase. Our proposal is related to model-based reinforcement learning (Jaderberg et al.,
2016; Wang et al., 2020a); however, instead of learning a dynamics model for control, we learn a
dynamics model as a source of reward. Our work is closely related to a recently proposed auxiliary
loss on predicting an agent’s own future states (Ndousse et al., 2021). However, there are three
key differences. First, their work predicts ego-agent observations, whereas our work additionally
predicts future observations from the other agents’ point of view. Second, their loss optimizes state
embeddings while ours optimizes agents’ policies. Third, their work focuses on cooperative tasks
whereas ours applies to both cooperative and competitive domains.
Multi-agent reinforcement learning algorithms.
Today, the predominant deep multi-agent frame-
work uses actor-critic methods with a centralized critic and decentralized execution (Lowe et al.,
2017; Foerster et al., 2018; Iqbal & Sha, 2019; Liu et al., 2020; Rashid et al., 2018). This framework
allows a critic to access the observations and actions of all agents to ease training. However, there are
several situations where centralized training may not be desirable or possible. Examples include low
bandwidth communication restrictions or human-robot tasks where observations cannot be easily
shared between agents (Ying & Dayong, 2005; Cao et al., 2012; Huang et al., 2015). Decentralized
training is therefore the most practical training paradigm but it suffers from unstable training: the
environment is nonstationary from a single-agent’s perspective (Lowe et al., 2017). Our work uses a
decentralized training framework and tackles the nonstationarity challenge with an intrinsic reward
designed to improve an agent’s ability to model others. We also apply expectation alignment to
centralized training and observe that it still aids cooperative and some competitive tasks.
3