ELIGN Expectation Alignment as a Multi-Agent Intrinsic Reward Zixian Ma1 Rose Wang1 Li Fei-Fei1 Michael Bernstein1 Ranjay Krishna12

2025-04-29 0 0 1.72MB 21 页 10玖币

侵权投诉

ELIGN: Expectation Alignment

as a Multi-Agent Intrinsic Reward

Zixian Ma1, Rose Wang1, Li Fei-Fei1, Michael Bernstein1, Ranjay Krishna1,2

Stanford University1, University of Washington2

{zixianma,rewang,feifeili,msb,ranjaykrishna}@cs.stanford.edu

Abstract

Modern multi-agent reinforcement learning frameworks rely on centralized training

and reward shaping to perform well. However, centralized training and dense

rewards are not readily available in the real world. Current multi-agent algorithms

struggle to learn in the alternative setup of decentralized training or sparse rewards.

To address these issues, we propose a self-supervised intrinsic reward ELIGN -

expectation alignment - inspired by the self-organization principle in Zoology.

Similar to how animals collaborate in a decentralized manner with those in their

vicinity, agents trained with expectation alignment learn behaviors that match their

neighbors’ expectations. This allows the agents to learn collaborative behaviors

without any external reward or centralized training. We demonstrate the efﬁcacy

of our approach across 6 tasks in the multi-agent particle and the complex Google

Research football environments, comparing ELIGN to sparse and curiosity-based

intrinsic rewards. When the number of agents increases, ELIGN scales well in all

multi-agent tasks except for one where agents have different capabilities. We show

that agent coordination improves through expectation alignment because agents

learn to divide tasks amongst themselves, break coordination symmetries, and

confuse adversaries. These results identify tasks where expectation alignment is a

more useful strategy than curiosity-driven exploration for multi-agent coordination,

enabling agents to do zero-shot coordination.

1 Introduction

Many real world AI applications can be formulated as multi-agent systems, including autonomous

vehicles (Cao et al., 2012), resource management (Ying & Dayong, 2005), trafﬁc control (Sunehag

et al., 2017), robot swarms (Swamy et al., 2020), and multi-player video games (Berner et al., 2019).

Agents must adapt their behaviors to each other in order to coordinate successfully in these systems.

However, adaptive coordination algorithms are challenging to develop because each agent is not privy

to other agents’ intentions and their future behaviors (Foerster et al., 2017).

These challenges are more acute in decentralized training under partial observability than centralized

training or full observability. In the real world, agents act under partial observability and learn in a

decentralized manner: they do not learn collaborative behaviors with a single centralized algorithm

with a complete knowledge of the environment (Iqbal & Sha, 2019; Liu et al., 2020). Unfortunately,

the most successful multi-agent algorithms train agents with a centralized critic, assuming access to

all agents’ observations and actions (Foerster et al., 2018; Rashid et al., 2018; Sunehag et al., 2017;

Lowe et al., 2017). The most successful multi-agent algorithms for decentralized training and partial

observability assume task-speciﬁc reward shaping (Jain et al., 2020; Iqbal & Sha, 2019), which is

expensive to generate. These algorithms struggle to learn with sparse reward structure.

Consider a cooperative navigation task, where

agents aim to simultaneously occupy

goal

locations. A centralized algorithm with full observability is capable of optimally assigning the nearest

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04365v2 [cs.MA] 9 Nov 2022

→

Figure 1: We introduce ELIGN, i.e, expectation alignment, a task-agnostic intrinsic reward to improve

multi-agent systems. Intuitively, ELIGN encourages agents to become more predictable to their

neighbors. An agent (e.g., agent

here) learns to behave in ways that match its neighbors’ (e.g., agent

’s) predictions of its next observation. Here, agent

expects agent

to move up instead of down,

moving closer to a point of interest above it. Agent

attains (a) a higher reward when its action

(e.g., upward) aligns with this expectation or (b) a lower reward when its action (e.g., downward) is

misaligned.

goal location to each respective agent. However, with partial observability, agents can see only a

handful of goal locations and other agents. They are unaware of others’ observations, actions, and

intentions with decentralized training. We observe that agents simultaneously occupy the same goal;

they fail to collaborate because they do not predict which goal each agent is expected to occupy. To

overcome instances of miscoordination, decentralized algorithms have adapted single-agent curiosity-

based intrinsic rewards (Pathak et al., 2017; Stadie et al., 2015). Multi-agent curiosity-based rewards

incentivize agents to explore novel states (Iqbal & Sha, 2020). Although curiosity helps agents

discover new goal locations, it doesn’t solve the challenge of coordination, such as assigning goals to

each agent. Only a few attempts explore other forms of multi-agent intrinsic rewards (Iqbal & Sha,

2020; Böhmer et al., 2019; Schafer, 2019).

In this work, we propose ELIGN as a novel multi-agent self-supervised intrinsic reward, enabling

decentralized training under partial observability. Intuitively, expectation alignment encourages

agents to elicit behaviors that decrease future uncertainty for their team: it encourages each agent to

choose actions that match their teammates’ expectations. Going back to the cooperative navigation

task, expectation alignment encourages each agent to move to goals others expect it to occupy, like

goals that are either closest to the agent or goals that other agents aren’t moving towards (Figure 1).

We take inspiration from the self-organization principle in Zoology (Couzin, 2007). This principle

hypothesizes that collective animal intelligence emerges because groups synchronize their behaviors

using only their local environment; they do not rely on complete information about other agents and

can coordinate successfully by predicting the dynamics of agents within their ﬁeld-of-view (Collett

et al., 1998; Theraulaz & Bonabeau, 1995; Ben-Jacob et al., 1994; Buhl et al., 2006). Similarly,

expectation alignment as an intrinsic reward is calculated based on the agent’s local observations and

its approximation of neighboring agents’ expectations. It does not require a centralized controller

nor full observability. ELIGN is task-agnostic and we apply it to both collaborative and competitive

multi-agent tasks.

We demonstrate the efﬁcacy of our approach in the multi-agent particle and Google Research football

environments, two popular benchmarks for multi-agent reinforcement learning (Lowe et al., 2017;

Kurach et al., 2019). We evaluate ELIGN under partial and full observability, with decentralized and

centralized training, and in terms of scalability. We observe that expectation alignment outperforms

sparse and curiosity-based intrinsic rewards (Ndousse et al., 2021; Stadie et al., 2015; Iqbal &

Sha, 2020), especially under partial observability with decentralized training. We additionally test

expectation alignment as a way to perform zero-shot coordination with new agent partners, and

investigate why ELIGN improves coordination. We show that agent coordination improves through

expectation alignment because agents learn to divide tasks amongst themselves and break coordination

symmetries (Hu et al., 2020).

2 Related Work

Our formulation of expectation alignment, a task-agnostic intrinsic reward for multi-agent training,

draws inspiration from the self-organization principle in Zoology, which posits that synchronized

group behavior is mediated by local behavioral rules (Couzin, 2007) and not by a centralized

controller (Camazine et al., 2020). Group cohesion emerges by predicting and adjusting one’s

behavior to that of near neighbors (Buhl et al., 2006). This principle underlies the coordination found

in multi-cellular organisms (Camazine et al., 2020), the migration of wingless locusts (Collett et al.,

1998), the collective swarms of bacteria (Ben-Jacob et al., 1994), the construction of bridge structures

by ants (Theraulaz & Bonabeau, 1995), and some human navigation behaviors (Couzin, 2007).

Intrinsic motivation for single agents.

Although we draw inspiration from Zoology for formaliz-

ing expectation alignment as an intrinsic reward, there is a rich body of work on intrinsic rewards

within the single-agent reinforcement learning community. To incentivize exploration, even when

non-optimal successful trajectories are uncovered ﬁrst, scholars have argued for the use of intrinsic

motivation (Schmidhuber, 1991). Single-agent intrinsic motivation has focused on exploring previ-

ously unencountered states (Pathak et al., 2017; Burda et al., 2018a), which works particularly well

in discrete domains. In continuous domains, identifying unseen states requires keeping track of an

intractable number of visited states; instead, literature has recommended learning a forward dynamics

model to predict future states and identify novel states using the uncertainty of this model (Achiam &

Sastry, 2017). Other formulations encourage re-visiting states where the dynamics model’s prediction

of future states errs (Stadie et al., 2015; Pathak et al., 2017). Follow up papers have improved how

uncertainty (Kim et al., 2020) and model errors (Burda et al., 2018b; Sekar et al., 2020) are calculated.

Intrinsic motivation for multiple agents.

Most multi-agent intrinsic rewards have been adapted

from single-agent curiosity-based incentives (

Böhmer et al., 2019; Schafer, 2019) and have pri-

marily focused on cooperative tasks. They propose intrinsic rewards to improve either coordination,

collaboration, or deception: These rewards either maximize information conveyed by an agent’s

actions (

Chitnis et al., 2020; Wang et al., 2019), shape the inﬂuence of an agent (Jaques et al.,

2019; Foerster et al., 2017), incentivize agents to hide intentions (Strouse et al., 2018), build accurate

models of other agents’ policies (Hernandez-Leal et al., 2019; Jaques et al., 2019), or break extrinsic

rewards for better credit assignment (Du et al., 2019).

Several multi-agent intrinsic rewards (Hernandez-Leal et al., 2019; Jaques et al., 2019), including

ours, rely on the ability to model others’ dynamics in a shared environment. This ability is a key

component to coordination, closely related to Theory of Mind (Tomasello et al., 2005). Our work

can be interpreted as using a Theory of Mind model of others’ behaviors to calculate an intrinsic

motivation loss. Unlike existing Theory of Mind methods that learn a model per collaborator (Roy

et al., 2020), we learn a single dynamics model, allowing our method to scale as the number of

agents increase. Our proposal is related to model-based reinforcement learning (Jaderberg et al.,

2016; Wang et al., 2020a); however, instead of learning a dynamics model for control, we learn a

dynamics model as a source of reward. Our work is closely related to a recently proposed auxiliary

loss on predicting an agent’s own future states (Ndousse et al., 2021). However, there are three

key differences. First, their work predicts ego-agent observations, whereas our work additionally

predicts future observations from the other agents’ point of view. Second, their loss optimizes state

embeddings while ours optimizes agents’ policies. Third, their work focuses on cooperative tasks

whereas ours applies to both cooperative and competitive domains.

Multi-agent reinforcement learning algorithms.

Today, the predominant deep multi-agent frame-

work uses actor-critic methods with a centralized critic and decentralized execution (Lowe et al.,

2017; Foerster et al., 2018; Iqbal & Sha, 2019; Liu et al., 2020; Rashid et al., 2018). This framework

allows a critic to access the observations and actions of all agents to ease training. However, there are

several situations where centralized training may not be desirable or possible. Examples include low

bandwidth communication restrictions or human-robot tasks where observations cannot be easily

shared between agents (Ying & Dayong, 2005; Cao et al., 2012; Huang et al., 2015). Decentralized

training is therefore the most practical training paradigm but it suffers from unstable training: the

environment is nonstationary from a single-agent’s perspective (Lowe et al., 2017). Our work uses a

decentralized training framework and tackles the nonstationarity challenge with an intrinsic reward

designed to improve an agent’s ability to model others. We also apply expectation alignment to

centralized training and observe that it still aids cooperative and some competitive tasks.

3 Background

We formulate our setting as a partially observable Markov game

(S,O,A,T, rex, N)

(Littman, 1994).

A Markov game for

agents is deﬁned by a state space

describing the possible conﬁgurations

of the environment. The observation space for agents is

O= (O1,...,ON)

and the action space is

A= (A1,...,AN)

. Each agent

observes

oi∈ Oi

, a private partial view of the state, and performs

actions

ai∈ Ai

. Using the observation, each agent uses a stochastic policy

πθi:Oi× Ai→[0,1]

where

θi

parameterizes the policy. The environment changes according to the state transition function

which transitions to the next state using the current state and each agent’s actions,

T:S × A → S

The team of agents obtains a shared extrinsic reward as a function of the environment state,

rex :

S × A → R

. The team’s goal is to maximize the total expected return:

R=PT

t=0 γtrt

where

0≤γ≤1

is the discount factor,

is the time step, and

is the time horizon. The environment may

also contain adversarial agents who have their own reward structure.

4 Expectation Alignment

To understand expectation alignment intuitively, let’s revisit the cooperative navigation task, where

agents are rewarded for simultaneously occupying as many goal locations as possible. In Figure 1,

agent

has a dynamics model trained on its past experiences. It predicts how future states will

evolve from the point of view of agent

, who is within

’s view. In this example,

will expect

to move towards the goal since

is closer to it. ELIGN encourages

to pursue the action that

expects (Figure 1(a)). In turn,

can now assume that the observed goal location will eventually be

occupied by

and should therefore explore to ﬁnd another goal. By aligning shared expectations,

agent behaviors become more predictable. Conversely, when neighbors behave opposite to an agent’s

predictions, the agent can infer about the environment outside of its own receptive ﬁeld (Krause et al.,

2002). For example, in Figure 1 (b), if agent

observes

running away from a goal, this surprising

behavior might indicate the existence of an adversary outside j’s receptive ﬁeld.

Our training algorithm consists of three interwoven phases of learning a dynamics model, calculating

the ELIGN reward, and optimizing the agent’s policy (Algorithm 1).

4.1 Training the dynamics model

Similar to prior work (Wang et al., 2018; Kidambi et al., 2020), each agent

learns a dynamics model

fθito predict the next observation ˆo0

igiven its current observation and action oi, ai, i.e,

ˆo0

i=fθi(oi, ai).

We use a three-layer Multi-Layer Perceptron with ReLU non-linearities as the dynamics model. We

minimize the mean squared error between its prediction and ground truth next observation o0

4.2 Calculating intrinsic reward

The intrinsic reward captures how well agent

aligns to its neighbors’ (e.g., agent

’s) expectations

on its next state. Calculating this reward requires

to accurately predict

’s behavior, simulating a

Theory of Mind (Tomasello et al., 2005). As suggested by the self-organization principle,

must

learn to align to j’s predictions. Ideally, the ELIGN intrinsic reward is calculated as:

rin(oi, ai) = −1

|N(i)|X

j∈N (i)ko0

i−fθj(oi, ai)k

where

N(i)

is the set of neighbors within

’s receptive ﬁeld, including

itself. The ELIGN reward

is high when the average

loss is small, i.e, when

’s actual next observation is close to agent

’s

predicted observation of

for all

in its neighbors. In that case,

has chosen an action that aligns

with j’s expectations of how ishould act.

In a decentralized training setup, however,

doesn’t have access to

’s dynamics model

fθj

, so

approximates

’s dynamic model with a proxy: its own dynamics model

fθi

and the knowledge of

agent

’s observation radius. Such an approximation is ecologically valid since we often approximate

others’ behaviors using a second-order cognitive Theory of Mind (Morin, 2006). Additionally,

doesn’t have access to

’s entire observation; so, we restrict the future prediction from

’s point

of view by using the portion of

’s observation

can see:

oi∩j=oioj

. Agent

’s decentralized

intrinsic reward then becomes:

rin(oi, ai) = −1

|N(i)|X

j∈N (i)ko0

i∩j−fθi(oi∩j, ai)k

We found that the approximation of

fθj

using

fθi

works well empirically. Dynamics model losses

for all agents quickly decrease within 5-10 training epochs. we validate its applicability in small-

scale heterogeneous multi-agent tasks where agents have variable capabilities, although we ﬁnd the

methods perform similarly when more heterogeneous agents are added.

4.3 Policy learning

Algorithm 1 ELIGN: Expectation Alignment

1: Initialize replay buffer Dand D0

Initialize

agents with random

θi

i∈[1, N]

3: while not converged do

4: for b= 1 . . . B do

Populate buffer

with episode using

policies (πθ1, . . . , πθN)

6: end for

7: // TRAIN DYNAMICS MODEL

8: for agent i= 1 . . . N do

Sample transitions:

{(oi, ai, rex, o0

i)} ∼

10: Predict ˆo0

i=fθi(oi, ai)

11: Update dynamics θiusing o0

12: end for

13: // CALCULATE ELIGN REWARD

14: for agent i= 1 . . . N do

15:

Sample

transitions:

{(oi, ai, rex, o0

i)} ∼ Di

16: Compute intrinsic rewards rin(oi, ai)

17: Add {(oi, ai, rex +βrin, o0

i)}to D0

18: end for

19: // POLICY LEARNING

20: Update all θis using transitions from D0

21: end while

Once the ELIGN rewards are calculated, the to-

tal rewards at each step for each agent

is:

ri=

rex +βrin(oi, ai)

where

rex

is the extrinsic reward

provided by the environment and

is a hyper-

parameter for weighing the intrinsic reward in

the agent’s overall reward calculation. In prac-

tice, we set

to be

|Oi|

where

|Oi|

is the obser-

vation dimension; we ﬁnd this scale generalizes

well across tasks. Since our contribution is ag-

nostic to any particular multi-agent training algo-

rithm, the team of agents can now be trained using

any multi-agent training algorithm to maximize

returns R=PT

t=0 γtr.

Both centralized and decentralized training al-

gorithms can make use of these rewards. We

primarily use the multi-agent decentralized vari-

ant of the soft-actor critic algorithm in our ex-

periments (Haarnoja et al., 2018; Iqbal & Sha,

2019). Compared to centralized joint-action train-

ing, whose action space grows exponentially in

agents, our decentralized method has linear

space complexity. Further, decentralized training

can parallelize training time to be less than linear

with respect to

. Although we present results

with one centralized training framework, studying the impact of expectation alignment with all the

centralized critic frameworks is out of scope for this paper.

4.4 Extending expectation alignment to competitive tasks

We extend the ELIGN formulation to competitive tasks where a team of agents compete against

adversaries. In this case, agents are encouraged to misalign with their adversaries’ expectations, i.e,

agents are incentivized to be unpredictable to their adversaries within its receptive ﬁeld (Nadv(i)):

rin =1

|Nadv(i)|X

k∈Nadv(i)ko0

i∩k−fθi(oi∩k, ai))k

5 Experiments

Our experiments explore the utility of using expectation alignment as an intrinsic reward compared

to sparse and curiosity-based intrinsic rewards. We primarily focus on decentralized training under

partial observability. However, we also demonstrate that ELIGN can easily augment centralized

methods and assist in fully observable tasks. We vary the number of agents in the multi-agent particle

tasks to test scalability. We end by investigating how and why ELIGN improves coordination by

designing three evaluation conditions. First, does expectation alignment improve coordination by

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ELIGN:ExpectationAlignmentasaMulti-AgentIntrinsicRewardZixianMa1,RoseWang1,LiFei-Fei1,MichaelBernstein1,RanjayKrishna1;2StanfordUniversity1,UniversityofWashington2{zixianma,rewang,feifeili,msb,ranjaykrishna}@cs.stanford.eduAbstractModernmulti-agentreinforcementlearningframeworksrelyoncentralizedtrai...

展开>> 收起<<

ELIGN Expectation Alignment as a Multi-Agent Intrinsic Reward Zixian Ma1 Rose Wang1 Li Fei-Fei1 Michael Bernstein1 Ranjay Krishna12.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ELIGN Expectation Alignment as a Multi-Agent Intrinsic Reward Zixian Ma1 Rose Wang1 Li Fei-Fei1 Michael Bernstein1 Ranjay Krishna12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: