ELIGN Expectation Alignment as a Multi-Agent Intrinsic Reward Zixian Ma1 Rose Wang1 Li Fei-Fei1 Michael Bernstein1 Ranjay Krishna12

2025-04-29 0 0 1.72MB 21 页 10玖币
侵权投诉
ELIGN: Expectation Alignment
as a Multi-Agent Intrinsic Reward
Zixian Ma1, Rose Wang1, Li Fei-Fei1, Michael Bernstein1, Ranjay Krishna1,2
Stanford University1, University of Washington2
{zixianma,rewang,feifeili,msb,ranjaykrishna}@cs.stanford.edu
Abstract
Modern multi-agent reinforcement learning frameworks rely on centralized training
and reward shaping to perform well. However, centralized training and dense
rewards are not readily available in the real world. Current multi-agent algorithms
struggle to learn in the alternative setup of decentralized training or sparse rewards.
To address these issues, we propose a self-supervised intrinsic reward ELIGN -
expectation alignment - inspired by the self-organization principle in Zoology.
Similar to how animals collaborate in a decentralized manner with those in their
vicinity, agents trained with expectation alignment learn behaviors that match their
neighbors’ expectations. This allows the agents to learn collaborative behaviors
without any external reward or centralized training. We demonstrate the efficacy
of our approach across 6 tasks in the multi-agent particle and the complex Google
Research football environments, comparing ELIGN to sparse and curiosity-based
intrinsic rewards. When the number of agents increases, ELIGN scales well in all
multi-agent tasks except for one where agents have different capabilities. We show
that agent coordination improves through expectation alignment because agents
learn to divide tasks amongst themselves, break coordination symmetries, and
confuse adversaries. These results identify tasks where expectation alignment is a
more useful strategy than curiosity-driven exploration for multi-agent coordination,
enabling agents to do zero-shot coordination.
1 Introduction
Many real world AI applications can be formulated as multi-agent systems, including autonomous
vehicles (Cao et al., 2012), resource management (Ying & Dayong, 2005), traffic control (Sunehag
et al., 2017), robot swarms (Swamy et al., 2020), and multi-player video games (Berner et al., 2019).
Agents must adapt their behaviors to each other in order to coordinate successfully in these systems.
However, adaptive coordination algorithms are challenging to develop because each agent is not privy
to other agents’ intentions and their future behaviors (Foerster et al., 2017).
These challenges are more acute in decentralized training under partial observability than centralized
training or full observability. In the real world, agents act under partial observability and learn in a
decentralized manner: they do not learn collaborative behaviors with a single centralized algorithm
with a complete knowledge of the environment (Iqbal & Sha, 2019; Liu et al., 2020). Unfortunately,
the most successful multi-agent algorithms train agents with a centralized critic, assuming access to
all agents’ observations and actions (Foerster et al., 2018; Rashid et al., 2018; Sunehag et al., 2017;
Lowe et al., 2017). The most successful multi-agent algorithms for decentralized training and partial
observability assume task-specific reward shaping (Jain et al., 2020; Iqbal & Sha, 2019), which is
expensive to generate. These algorithms struggle to learn with sparse reward structure.
Consider a cooperative navigation task, where
N
agents aim to simultaneously occupy
N
goal
locations. A centralized algorithm with full observability is capable of optimally assigning the nearest
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04365v2 [cs.MA] 9 Nov 2022
Figure 1: We introduce ELIGN, i.e, expectation alignment, a task-agnostic intrinsic reward to improve
multi-agent systems. Intuitively, ELIGN encourages agents to become more predictable to their
neighbors. An agent (e.g., agent
i
here) learns to behave in ways that match its neighbors’ (e.g., agent
j
s) predictions of its next observation. Here, agent
j
expects agent
i
to move up instead of down,
moving closer to a point of interest above it. Agent
i
attains (a) a higher reward when its action
(e.g., upward) aligns with this expectation or (b) a lower reward when its action (e.g., downward) is
misaligned.
goal location to each respective agent. However, with partial observability, agents can see only a
handful of goal locations and other agents. They are unaware of others’ observations, actions, and
intentions with decentralized training. We observe that agents simultaneously occupy the same goal;
they fail to collaborate because they do not predict which goal each agent is expected to occupy. To
overcome instances of miscoordination, decentralized algorithms have adapted single-agent curiosity-
based intrinsic rewards (Pathak et al., 2017; Stadie et al., 2015). Multi-agent curiosity-based rewards
incentivize agents to explore novel states (Iqbal & Sha, 2020). Although curiosity helps agents
discover new goal locations, it doesn’t solve the challenge of coordination, such as assigning goals to
each agent. Only a few attempts explore other forms of multi-agent intrinsic rewards (Iqbal & Sha,
2020; Böhmer et al., 2019; Schafer, 2019).
In this work, we propose ELIGN as a novel multi-agent self-supervised intrinsic reward, enabling
decentralized training under partial observability. Intuitively, expectation alignment encourages
agents to elicit behaviors that decrease future uncertainty for their team: it encourages each agent to
choose actions that match their teammates’ expectations. Going back to the cooperative navigation
task, expectation alignment encourages each agent to move to goals others expect it to occupy, like
goals that are either closest to the agent or goals that other agents aren’t moving towards (Figure 1).
We take inspiration from the self-organization principle in Zoology (Couzin, 2007). This principle
hypothesizes that collective animal intelligence emerges because groups synchronize their behaviors
using only their local environment; they do not rely on complete information about other agents and
can coordinate successfully by predicting the dynamics of agents within their field-of-view (Collett
et al., 1998; Theraulaz & Bonabeau, 1995; Ben-Jacob et al., 1994; Buhl et al., 2006). Similarly,
expectation alignment as an intrinsic reward is calculated based on the agent’s local observations and
its approximation of neighboring agents’ expectations. It does not require a centralized controller
nor full observability. ELIGN is task-agnostic and we apply it to both collaborative and competitive
multi-agent tasks.
We demonstrate the efficacy of our approach in the multi-agent particle and Google Research football
environments, two popular benchmarks for multi-agent reinforcement learning (Lowe et al., 2017;
Kurach et al., 2019). We evaluate ELIGN under partial and full observability, with decentralized and
centralized training, and in terms of scalability. We observe that expectation alignment outperforms
sparse and curiosity-based intrinsic rewards (Ndousse et al., 2021; Stadie et al., 2015; Iqbal &
Sha, 2020), especially under partial observability with decentralized training. We additionally test
expectation alignment as a way to perform zero-shot coordination with new agent partners, and
investigate why ELIGN improves coordination. We show that agent coordination improves through
expectation alignment because agents learn to divide tasks amongst themselves and break coordination
symmetries (Hu et al., 2020).
2
2 Related Work
Our formulation of expectation alignment, a task-agnostic intrinsic reward for multi-agent training,
draws inspiration from the self-organization principle in Zoology, which posits that synchronized
group behavior is mediated by local behavioral rules (Couzin, 2007) and not by a centralized
controller (Camazine et al., 2020). Group cohesion emerges by predicting and adjusting one’s
behavior to that of near neighbors (Buhl et al., 2006). This principle underlies the coordination found
in multi-cellular organisms (Camazine et al., 2020), the migration of wingless locusts (Collett et al.,
1998), the collective swarms of bacteria (Ben-Jacob et al., 1994), the construction of bridge structures
by ants (Theraulaz & Bonabeau, 1995), and some human navigation behaviors (Couzin, 2007).
Intrinsic motivation for single agents.
Although we draw inspiration from Zoology for formaliz-
ing expectation alignment as an intrinsic reward, there is a rich body of work on intrinsic rewards
within the single-agent reinforcement learning community. To incentivize exploration, even when
non-optimal successful trajectories are uncovered first, scholars have argued for the use of intrinsic
motivation (Schmidhuber, 1991). Single-agent intrinsic motivation has focused on exploring previ-
ously unencountered states (Pathak et al., 2017; Burda et al., 2018a), which works particularly well
in discrete domains. In continuous domains, identifying unseen states requires keeping track of an
intractable number of visited states; instead, literature has recommended learning a forward dynamics
model to predict future states and identify novel states using the uncertainty of this model (Achiam &
Sastry, 2017). Other formulations encourage re-visiting states where the dynamics model’s prediction
of future states errs (Stadie et al., 2015; Pathak et al., 2017). Follow up papers have improved how
uncertainty (Kim et al., 2020) and model errors (Burda et al., 2018b; Sekar et al., 2020) are calculated.
Intrinsic motivation for multiple agents.
Most multi-agent intrinsic rewards have been adapted
from single-agent curiosity-based incentives (
?
Böhmer et al., 2019; Schafer, 2019) and have pri-
marily focused on cooperative tasks. They propose intrinsic rewards to improve either coordination,
collaboration, or deception: These rewards either maximize information conveyed by an agent’s
actions (
?
Chitnis et al., 2020; Wang et al., 2019), shape the influence of an agent (Jaques et al.,
2019; Foerster et al., 2017), incentivize agents to hide intentions (Strouse et al., 2018), build accurate
models of other agents’ policies (Hernandez-Leal et al., 2019; Jaques et al., 2019), or break extrinsic
rewards for better credit assignment (Du et al., 2019).
Several multi-agent intrinsic rewards (Hernandez-Leal et al., 2019; Jaques et al., 2019), including
ours, rely on the ability to model others’ dynamics in a shared environment. This ability is a key
component to coordination, closely related to Theory of Mind (Tomasello et al., 2005). Our work
can be interpreted as using a Theory of Mind model of others’ behaviors to calculate an intrinsic
motivation loss. Unlike existing Theory of Mind methods that learn a model per collaborator (Roy
et al., 2020), we learn a single dynamics model, allowing our method to scale as the number of
agents increase. Our proposal is related to model-based reinforcement learning (Jaderberg et al.,
2016; Wang et al., 2020a); however, instead of learning a dynamics model for control, we learn a
dynamics model as a source of reward. Our work is closely related to a recently proposed auxiliary
loss on predicting an agent’s own future states (Ndousse et al., 2021). However, there are three
key differences. First, their work predicts ego-agent observations, whereas our work additionally
predicts future observations from the other agents’ point of view. Second, their loss optimizes state
embeddings while ours optimizes agents’ policies. Third, their work focuses on cooperative tasks
whereas ours applies to both cooperative and competitive domains.
Multi-agent reinforcement learning algorithms.
Today, the predominant deep multi-agent frame-
work uses actor-critic methods with a centralized critic and decentralized execution (Lowe et al.,
2017; Foerster et al., 2018; Iqbal & Sha, 2019; Liu et al., 2020; Rashid et al., 2018). This framework
allows a critic to access the observations and actions of all agents to ease training. However, there are
several situations where centralized training may not be desirable or possible. Examples include low
bandwidth communication restrictions or human-robot tasks where observations cannot be easily
shared between agents (Ying & Dayong, 2005; Cao et al., 2012; Huang et al., 2015). Decentralized
training is therefore the most practical training paradigm but it suffers from unstable training: the
environment is nonstationary from a single-agent’s perspective (Lowe et al., 2017). Our work uses a
decentralized training framework and tackles the nonstationarity challenge with an intrinsic reward
designed to improve an agent’s ability to model others. We also apply expectation alignment to
centralized training and observe that it still aids cooperative and some competitive tasks.
3
3 Background
We formulate our setting as a partially observable Markov game
(S,O,A,T, rex, N)
(Littman, 1994).
A Markov game for
N
agents is defined by a state space
S
describing the possible configurations
of the environment. The observation space for agents is
O= (O1,...,ON)
and the action space is
A= (A1,...,AN)
. Each agent
i
observes
oi∈ Oi
, a private partial view of the state, and performs
actions
ai∈ Ai
. Using the observation, each agent uses a stochastic policy
πθi:Oi× Ai[0,1]
,
where
θi
parameterizes the policy. The environment changes according to the state transition function
which transitions to the next state using the current state and each agent’s actions,
T:S × A → S
.
The team of agents obtains a shared extrinsic reward as a function of the environment state,
rex :
S × A R
. The team’s goal is to maximize the total expected return:
R=PT
t=0 γtrt
ex
where
0γ1
is the discount factor,
t
is the time step, and
T
is the time horizon. The environment may
also contain adversarial agents who have their own reward structure.
4 Expectation Alignment
To understand expectation alignment intuitively, let’s revisit the cooperative navigation task, where
N
agents are rewarded for simultaneously occupying as many goal locations as possible. In Figure 1,
agent
i
has a dynamics model trained on its past experiences. It predicts how future states will
evolve from the point of view of agent
j
, who is within
i
s view. In this example,
j
will expect
i
to move towards the goal since
i
is closer to it. ELIGN encourages
i
to pursue the action that
j
expects (Figure 1(a)). In turn,
j
can now assume that the observed goal location will eventually be
occupied by
i
and should therefore explore to find another goal. By aligning shared expectations,
agent behaviors become more predictable. Conversely, when neighbors behave opposite to an agent’s
predictions, the agent can infer about the environment outside of its own receptive field (Krause et al.,
2002). For example, in Figure 1 (b), if agent
j
observes
i
running away from a goal, this surprising
behavior might indicate the existence of an adversary outside js receptive field.
Our training algorithm consists of three interwoven phases of learning a dynamics model, calculating
the ELIGN reward, and optimizing the agent’s policy (Algorithm 1).
4.1 Training the dynamics model
Similar to prior work (Wang et al., 2018; Kidambi et al., 2020), each agent
i
learns a dynamics model
fθito predict the next observation ˆo0
igiven its current observation and action oi, ai, i.e,
ˆo0
i=fθi(oi, ai).
We use a three-layer Multi-Layer Perceptron with ReLU non-linearities as the dynamics model. We
minimize the mean squared error between its prediction and ground truth next observation o0
i.
4.2 Calculating intrinsic reward
The intrinsic reward captures how well agent
i
aligns to its neighbors’ (e.g., agent
j
s) expectations
on its next state. Calculating this reward requires
j
to accurately predict
i
s behavior, simulating a
Theory of Mind (Tomasello et al., 2005). As suggested by the self-organization principle,
i
must
learn to align to js predictions. Ideally, the ELIGN intrinsic reward is calculated as:
rin(oi, ai) = 1
|N(i)|X
j∈N (i)ko0
ifθj(oi, ai)k
where
N(i)
is the set of neighbors within
i
s receptive field, including
i
itself. The ELIGN reward
is high when the average
L2
loss is small, i.e, when
i
s actual next observation is close to agent
j
s
predicted observation of
i
for all
j
in its neighbors. In that case,
i
has chosen an action that aligns
with js expectations of how ishould act.
In a decentralized training setup, however,
i
doesn’t have access to
j
s dynamics model
fθj
, so
i
approximates
j
s dynamic model with a proxy: its own dynamics model
fθi
and the knowledge of
agent
j
s observation radius. Such an approximation is ecologically valid since we often approximate
others’ behaviors using a second-order cognitive Theory of Mind (Morin, 2006). Additionally,
i
4
doesn’t have access to
j
s entire observation; so, we restrict the future prediction from
j
s point
of view by using the portion of
j
s observation
i
can see:
oij=oioj
. Agent
i
s decentralized
intrinsic reward then becomes:
rin(oi, ai) = 1
|N(i)|X
j∈N (i)ko0
ijfθi(oij, ai)k
We found that the approximation of
fθj
using
fθi
works well empirically. Dynamics model losses
for all agents quickly decrease within 5-10 training epochs. we validate its applicability in small-
scale heterogeneous multi-agent tasks where agents have variable capabilities, although we find the
methods perform similarly when more heterogeneous agents are added.
4.3 Policy learning
Algorithm 1 ELIGN: Expectation Alignment
1: Initialize replay buffer Dand D0
2:
Initialize
N
agents with random
θi
:
i[1, N]
3: while not converged do
4: for b= 1 . . . B do
5:
Populate buffer
D
with episode using
policies (πθ1, . . . , πθN)
6: end for
7: // TRAIN DYNAMICS MODEL
8: for agent i= 1 . . . N do
9:
Sample transitions:
{(oi, ai, rex, o0
i)} ∼
Di
10: Predict ˆo0
i=fθi(oi, ai)
11: Update dynamics θiusing o0
i.
12: end for
13: // CALCULATE ELIGN REWARD
14: for agent i= 1 . . . N do
15:
Sample
B
transitions:
{(oi, ai, rex, o0
i)} ∼ Di
16: Compute intrinsic rewards rin(oi, ai)
17: Add {(oi, ai, rex +βrin, o0
i)}to D0
i
18: end for
19: // POLICY LEARNING
20: Update all θis using transitions from D0
21: end while
Once the ELIGN rewards are calculated, the to-
tal rewards at each step for each agent
i
is:
ri=
rex +βrin(oi, ai)
where
rex
is the extrinsic reward
provided by the environment and
β
is a hyper-
parameter for weighing the intrinsic reward in
the agent’s overall reward calculation. In prac-
tice, we set
β
to be
1
|Oi|
where
|Oi|
is the obser-
vation dimension; we find this scale generalizes
well across tasks. Since our contribution is ag-
nostic to any particular multi-agent training algo-
rithm, the team of agents can now be trained using
any multi-agent training algorithm to maximize
returns R=PT
t=0 γtr.
Both centralized and decentralized training al-
gorithms can make use of these rewards. We
primarily use the multi-agent decentralized vari-
ant of the soft-actor critic algorithm in our ex-
periments (Haarnoja et al., 2018; Iqbal & Sha,
2019). Compared to centralized joint-action train-
ing, whose action space grows exponentially in
N
agents, our decentralized method has linear
space complexity. Further, decentralized training
can parallelize training time to be less than linear
with respect to
N
. Although we present results
with one centralized training framework, studying the impact of expectation alignment with all the
centralized critic frameworks is out of scope for this paper.
4.4 Extending expectation alignment to competitive tasks
We extend the ELIGN formulation to competitive tasks where a team of agents compete against
adversaries. In this case, agents are encouraged to misalign with their adversaries’ expectations, i.e,
agents are incentivized to be unpredictable to their adversaries within its receptive field (Nadv(i)):
rin =1
|Nadv(i)|X
k∈Nadv(i)ko0
ikfθi(oik, ai))k
5 Experiments
Our experiments explore the utility of using expectation alignment as an intrinsic reward compared
to sparse and curiosity-based intrinsic rewards. We primarily focus on decentralized training under
partial observability. However, we also demonstrate that ELIGN can easily augment centralized
methods and assist in fully observable tasks. We vary the number of agents in the multi-agent particle
tasks to test scalability. We end by investigating how and why ELIGN improves coordination by
designing three evaluation conditions. First, does expectation alignment improve coordination by
5
摘要:

ELIGN:ExpectationAlignmentasaMulti-AgentIntrinsicRewardZixianMa1,RoseWang1,LiFei-Fei1,MichaelBernstein1,RanjayKrishna1;2StanfordUniversity1,UniversityofWashington2{zixianma,rewang,feifeili,msb,ranjaykrishna}@cs.stanford.eduAbstractModernmulti-agentreinforcementlearningframeworksrelyoncentralizedtrai...

展开>> 收起<<
ELIGN Expectation Alignment as a Multi-Agent Intrinsic Reward Zixian Ma1 Rose Wang1 Li Fei-Fei1 Michael Bernstein1 Ranjay Krishna12.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.72MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注