Equivariant Networks for Zero-Shot Coordination Darius Muglich University of Oxford

2025-05-06 1 0 1002.44KB 17 页 10玖币

侵权投诉

Equivariant Networks for Zero-Shot Coordination

Darius Muglich

University of Oxford

dariusm1997@yahoo.com

Christian Schroeder de Witt

FLAIR, University of Oxford

cs@robots.ox.ac.uk

Elise van der Pol

Microsoft Research AI4Science

evanderpol@microsoft.com

Shimon Whiteson

University of Oxford

shimon.whiteson@cs.ox.ac.uk

Jakob Foerster

FLAIR, University of Oxford

jakob.foerster@eng.ox.ac.uk

Abstract

Successful coordination in Dec-POMDPs requires agents to adopt robust strategies

and interpretable styles of play for their partner. A common failure mode is

symmetry breaking, when agents arbitrarily converge on one out of many equivalent

but mutually incompatible policies. Commonly these examples include partial

observability, e.g. waving your right hand vs. left hand to convey a covert message.

In this paper, we present a novel equivariant network architecture for use in Dec-

POMDPs that effectively leverages environmental symmetry for improving zero-

shot coordination, doing so more effectively than prior methods. Our method also

acts as a “coordination-improvement operator” for generic, pre-trained policies,

and thus may be applied at test-time in conjunction with any self-play algorithm.

We provide theoretical guarantees of our work and test on the AI benchmark task of

Hanabi, where we demonstrate our methods outperforming other symmetry-aware

baselines in zero-shot coordination, as well as able to improve the coordination

ability of a variety of pre-trained policies. In particular, we show our method can

be used to improve on the state of the art for zero-shot coordination on the Hanabi

benchmark.

1 Introduction

A popular method for learning strategies in partially-observable cooperative Markov games is via

self-play (SP), where a joint policy controls all players during training and at test time [21, 37, 45, 53].

While SP can yield highly effective strategies, these strategies often fail in zero-shot coordination

(ZSC), where independently trained strategies are paired together for one step at test time. A common

cause for this failure is mutually incompatible symmetry breaking (e.g. signalling

to refer to a “cat”

vs. a “dog”). Speciﬁcally, this symmetry breaking results in policies that act differently in situations

which are equivalent with respect to the symmetries of the environment [22, 38].

To address this, we are interested in devising equivariant strategies, which we illustrate in Figure

1. Equivariant policies are such that symmetric changes to their observation cause a corresponding

change to their output. In doing so, we fundamentally prevent the agent from breaking symmetries

over the course of training.

In earlier work, the Other-Play (OP) learning algorithm [22] addressed symmetry in this setting by

training agents to maximize expected return when randomly matched with symmetry-equivalent

policies of their training time partners. This learning rule is implemented by independently applying

a random symmetry permutation to the action-observation history of each agent during each episode.

OP, however, has some pitfalls, namely: 1) it does not guarantee equivariance; 2) it poses the burden

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12124v2 [cs.LG] 10 Apr 2024

Figure 1: Illustrating different kinds of symmetry-robust policies, where the symmetries in this

example are colors. Policies make actions based on the number of circles and the color. Invariant

policies act irrespective of the change in color (i.e. only the number of circles matters), while

equivariant policies act in correspondence with the change in color (i.e. changing the color will cause

a corresponding change to the action).

of learning environmental symmetry onto the agent; and 3) it is only applicable during training time

and cannot be used as a policy-improvement operator.

To address all of this, our approach is to encode symmetry not in the data, but in the network itself.

There are many approaches to building equivariant networks in the literature. Most work focuses

on analytical derivations of equivariant layers [11, 49, 51, 52], which can prove time-consuming

to engineer for each problem instance and inhibits rapid prototyping. Algorithmic approaches to

constructing equivariant networks are restricted to layer-wise constraints [10, 17, 18, 35], which

require computational overhead to enforce. Furthermore, for complex problems such as ZSC in

challenging benchmarks such as Hanabi [5], these approaches are insufﬁcient because 1) they are not

scalable; and 2) they usually do not allow for recurrent layers, which are the standard method for

encoding action-observation histories in partially observable settings.

In this paper, we propose the equivariant coordinator (EQC), which uses our novel, scalable approach

towards equivariant modelling in Dec-POMDPs. Speciﬁcally, our main contributions are:

Our method, EQC, that mathematically guarantees symmetry-equivariance of multi-agent

policies, and can be applied solely at test time as a coordination-improvement operator; and

Showing that EQC outperforms prior symmetry-robust baselines on the AI benchmark

Hanabi.

2 Background

This section formalises the problem setting (2.1), provides a brief primer to some group-theoretic

notions (2.2), introduces equivariance (2.3), provides background to prior multi-agent work on

symmetry (2.4), and offers a small, concrete example to help tie these notions together and facilitate

intuition (2.5).

2.1 Dec-POMDPs

We formalise the cooperative multi-agent setting as a decentralized partially-observable Markov

decision process (Dec-POMDP) [32] which is a 9-tuple

(S,N,{Ai}n

i=1,{Oi}n

i=1,T,R,{Ui}n

i=1,

T, γ

), for ﬁnite sets

S,N,{Ai}n

i=1,{Oi}n

i=1

, denoting the set of states, agents, actions and observa-

tions, respectively, where a superscript

denotes the set pertaining to agent

i∈ N ={1, . . . , n}

(i.e.,

and

are the action and observation sets for agent

, and

ai∈ Ai

and

oi∈ Oi

are a speciﬁc

action and observation of agent

). We also write

A=×iAi

and

O=×iOi

, the sets of joint actions

and observations, respectively.

st∈ S

is the state at time

and

st={sk

t}k

, where

is state feature

at∈ A

is the joint action of all agents taken at time

, which changes the state according to

the transition distribution

st+1 ∼ T (st+1 |st, at)

. The subsequent joint observation of the agents

ot+1 ∈ O

, distributed according to

ot+1 ∼ U(ot+1 |st+1, at)

, where

U=×iUi

; observation

features of

t+1 ∈ Oi

are notated analogously to state features; that is,

t+1 ={oi,k

t+1}k

. The reward

rt+1 ∈R

is distributed according to

rt+1 ∼ R(rt+1 |st+1, at)

is the horizon and

γ∈[0,1]

the discount factor.

Notating

τi

t= (ai

0, oi

1, . . . , ai

t−1, oi

for the action-observation history of agent

, agent

acts

according to a policy

t∼πi(ai

t|τi

. The agents seek to maximize the return, i.e., the expected

discounted sum of rewards:

JT:= Ep(τT)[X

t′≤T

γt′−1rt′],(1)

where τt= (s0, a0, o1, r1, . . . , at−1, ot, rt, st)is the trajectory until time t.

2.2 Groups

A group

is a set with an associative binary operation on

, such that, relative to the binary operation,

the inverse of each element in the set is also in the set, and the set contains an identity element. For

our purposes, each element

g∈G

can be thought of as an automorphism over some space

(i.e. an

isomorphism from

), and the binary operation is function composition. The cardinality of the

group is referred to as its order. If a subset

is also a group under the binary operation of

(a.k.a. a subgroup), we denote it K < G.

A group action

is a function

G×X→X

satisfying

ex =x

(where

is the identity element) and

(g·h)x=g·(hx)

, for all

g, h ∈G

. In particular, for a ﬁxed

, we get a function

αg:X→X

given

x7→ gx

can be thought to be “relabeling” the elements of

, which aligns with our deﬁnition

for Dec-POMDP symmetry introduced Section 2.4.

Groups have speciﬁc mathematical properties, such as associativity, containing the inverse of each

element, and closure (closure is implicit of the binary operation being over

itself) which are

necessary for proving Propositions 1 and 2 and ensuring equivariance.

2.3 Equivariance

Let

be a group. Consider the set of all neural networks

of a given architecture whose ﬁrst and

ultimate layers are linear; that is,

Ψ:= {ψ|ψ=h(ˆ

ψ(f))}

, where

f, h

are linear functions, and

is a non-linear function. We say a network

ψ∈Ψ

is equivariant (with respect to a group

, or

G-equivariant) if

Kgψ(x) = ψ(Lgx),for all g∈G, x∈Rd,(2)

where

is the group action of

g∈G

on inputs,

is similarly the group action on outputs, and

is the dimension of our data. If

satisﬁes Equation 2, then we say

ψ∈Ψequiv

. In the context of

reinforcement learning, one may consider

as the neural parameterization of a policy

π2

, so a policy

is G-equivariant if

π(Kg(a)|τ) = π(a|Lg(τ)),for all g∈Gand for all τ. (3)

If for any choice of

g∈G

we have that

Kg=I

, the identity function/matrix, then we say

(or

) is

invariant to g.

2.4 Other-Play and Dec-POMDP Symmetry

The OP learning rule [22] of a joint policy,

π= (π1, π2)

, is formally the maximization of the

following objective (which we state for the two-player case for ease of notation):

π∗=arg max

Eϕ∼ΦJ(π1, ϕ(π2)),(4)

To disambiguate between actions of mathematical groups and actions of reinforcement learning agents, we

always refer to the former as “group actions”.

2We use ψand πfor this reason more or less interchangeably.

where Φis the class of equivalence mappings (symmetries) for the given Dec-POMDP, so that each

ϕ∈Φ

is an automorphism of

S,A,O

whose application leaves the Dec-POMDP unchanged up

to relabeling (

is shorthand for

ϕ={ϕS, ϕA, ϕO}

). The expectation in Equation 4 is taken with

respect to the uniform distribution on Φ.

Each ϕ∈Φacts on the action-observation history as

ϕ(τi

t)=(ϕ(ai

0), ϕ(oi

1), . . . , ϕ(ai

t−1), ϕ(oi

t)),(5)

and acts on policies as

ˆπ=ϕ(π)⇐⇒ ˆπ(ϕ(a)|ϕ(τ)) = π(a|τ).(6)

Policies

π, ˆπ

in Equation 6 are said to be symmetry-equivalent to one another with respect to

. To

correspond with Equation 3, we have Lϕ={ϕA, ϕO}and Kϕ=ϕA.

2.5 Idealized Example

Consider a game where you coordinate with an unknown stranger. Each of you must pick a lever

from a set of

levers. Nine of the levers have

written on them, and one has

0.9

written. If you

and the stranger pick the same lever then you are paid out the amount that is written on the lever,

otherwise you are paid out nothing.

We have

G:= S9≤Φ

, where

is the symmetric group over

elements (all possible ways to

permute

elements), since the nine levers with

written on them are symmetric. If you (using policy

π1

) pick the

-point lever

and the stranger (using policy

π2

) picks the

-point lever

, then

π1

and

π2

would be symmetry-equivalent with respect to any

ϕ∈G

that permutes the lever choices (e.g.

ϕ(l1) = l2

, or equivalently

Lϕ(l1) = Kϕ(l1) = l2

, as the levers are both observations and actions).

If, as is likely,

l1̸=l2

, you and the stranger leave empty-handed, illustrating symmetry breaking.

If during training you and the stranger constrained yourselves to the space of equivariant policies

whilst trying to maximize payoff, you would both converge to the policy that picks the

0.9

-point lever

(which is not only equivariant, but invariant to all

ϕ∈G

), and so the choice of using equivariant

policies allows symmetry to be maintained and payout to be guaranteed.

3 Method

We now present our methodology and architectural choices, as well as several theoretical results.

3.1 Architecture

We ﬁrst consider the symmetrizer introduced in [35], which is a functional that transforms linear

maps to equivariant linear maps. Inspired by this we present a generalisation of the symmetrizer that

maps from Ψto the equivariant subspace, Ψequiv, namely S:Ψ→Ψequiv, which we deﬁne as

S(ψ) = 1

|G|X

g∈G

K−1

gψ(Lg),(7)

so that the ﬁrst layer of

, denoted

, being linear, is composed with each

, and the ultimate

layer, denoted

, is similarly composed with

K−1

(i.e.

K−1

gψ(Lg) = K−1

g◦h(ˆ

ψ(f◦Lg))

, where

ψ=h(ˆ

ψ(f)).

Below we establish properties of

that were analogously shown for the linear symmetrizer case [35]:

Proposition 1. (Symmetric Property)

S(ψ)∈Ψequiv,

for all

ψ∈Ψ

; that is,

maps neural networks

to equivariant neural networks.

Proposition 2. (Fixing Property)

Ψequiv =Ran(S);

that is, the range of

covers the entire equivari-

ant subspace.

The proofs of these propositions can be found in the Appendix. The proof of Proposition 1 relies on

properties of the group structure (such as closure of the group); it is thus critical for the symmetrizer

to sum over permutations that together form a group structure, and not any random collection of

permutations.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EquivariantNetworksforZero-ShotCoordinationDariusMuglichUniversityofOxforddariusm1997@yahoo.comChristianSchroederdeWittFLAIR,UniversityofOxfordcs@robots.ox.ac.ukElisevanderPolMicrosoftResearchAI4Scienceevanderpol@microsoft.comShimonWhitesonUniversityofOxfordshimon.whiteson@cs.ox.ac.ukJakobFoersterFL...

展开>> 收起<<

Equivariant Networks for Zero-Shot Coordination Darius Muglich University of Oxford.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Equivariant Networks for Zero-Shot Coordination Darius Muglich University of Oxford

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: