Equivariant Networks for Zero-Shot Coordination Darius Muglich University of Oxford

2025-05-06 0 0 1002.44KB 17 页 10玖币
侵权投诉
Equivariant Networks for Zero-Shot Coordination
Darius Muglich
University of Oxford
dariusm1997@yahoo.com
Christian Schroeder de Witt
FLAIR, University of Oxford
cs@robots.ox.ac.uk
Elise van der Pol
Microsoft Research AI4Science
evanderpol@microsoft.com
Shimon Whiteson
University of Oxford
shimon.whiteson@cs.ox.ac.uk
Jakob Foerster
FLAIR, University of Oxford
jakob.foerster@eng.ox.ac.uk
Abstract
Successful coordination in Dec-POMDPs requires agents to adopt robust strategies
and interpretable styles of play for their partner. A common failure mode is
symmetry breaking, when agents arbitrarily converge on one out of many equivalent
but mutually incompatible policies. Commonly these examples include partial
observability, e.g. waving your right hand vs. left hand to convey a covert message.
In this paper, we present a novel equivariant network architecture for use in Dec-
POMDPs that effectively leverages environmental symmetry for improving zero-
shot coordination, doing so more effectively than prior methods. Our method also
acts as a “coordination-improvement operator” for generic, pre-trained policies,
and thus may be applied at test-time in conjunction with any self-play algorithm.
We provide theoretical guarantees of our work and test on the AI benchmark task of
Hanabi, where we demonstrate our methods outperforming other symmetry-aware
baselines in zero-shot coordination, as well as able to improve the coordination
ability of a variety of pre-trained policies. In particular, we show our method can
be used to improve on the state of the art for zero-shot coordination on the Hanabi
benchmark.
1 Introduction
A popular method for learning strategies in partially-observable cooperative Markov games is via
self-play (SP), where a joint policy controls all players during training and at test time [21, 37, 45, 53].
While SP can yield highly effective strategies, these strategies often fail in zero-shot coordination
(ZSC), where independently trained strategies are paired together for one step at test time. A common
cause for this failure is mutually incompatible symmetry breaking (e.g. signalling
0
to refer to a “cat”
vs. a “dog”). Specifically, this symmetry breaking results in policies that act differently in situations
which are equivalent with respect to the symmetries of the environment [22, 38].
To address this, we are interested in devising equivariant strategies, which we illustrate in Figure
1. Equivariant policies are such that symmetric changes to their observation cause a corresponding
change to their output. In doing so, we fundamentally prevent the agent from breaking symmetries
over the course of training.
In earlier work, the Other-Play (OP) learning algorithm [22] addressed symmetry in this setting by
training agents to maximize expected return when randomly matched with symmetry-equivalent
policies of their training time partners. This learning rule is implemented by independently applying
a random symmetry permutation to the action-observation history of each agent during each episode.
OP, however, has some pitfalls, namely: 1) it does not guarantee equivariance; 2) it poses the burden
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12124v2 [cs.LG] 10 Apr 2024
Figure 1: Illustrating different kinds of symmetry-robust policies, where the symmetries in this
example are colors. Policies make actions based on the number of circles and the color. Invariant
policies act irrespective of the change in color (i.e. only the number of circles matters), while
equivariant policies act in correspondence with the change in color (i.e. changing the color will cause
a corresponding change to the action).
of learning environmental symmetry onto the agent; and 3) it is only applicable during training time
and cannot be used as a policy-improvement operator.
To address all of this, our approach is to encode symmetry not in the data, but in the network itself.
There are many approaches to building equivariant networks in the literature. Most work focuses
on analytical derivations of equivariant layers [11, 49, 51, 52], which can prove time-consuming
to engineer for each problem instance and inhibits rapid prototyping. Algorithmic approaches to
constructing equivariant networks are restricted to layer-wise constraints [10, 17, 18, 35], which
require computational overhead to enforce. Furthermore, for complex problems such as ZSC in
challenging benchmarks such as Hanabi [5], these approaches are insufficient because 1) they are not
scalable; and 2) they usually do not allow for recurrent layers, which are the standard method for
encoding action-observation histories in partially observable settings.
In this paper, we propose the equivariant coordinator (EQC), which uses our novel, scalable approach
towards equivariant modelling in Dec-POMDPs. Specifically, our main contributions are:
1.
Our method, EQC, that mathematically guarantees symmetry-equivariance of multi-agent
policies, and can be applied solely at test time as a coordination-improvement operator; and
2.
Showing that EQC outperforms prior symmetry-robust baselines on the AI benchmark
Hanabi.
2 Background
This section formalises the problem setting (2.1), provides a brief primer to some group-theoretic
notions (2.2), introduces equivariance (2.3), provides background to prior multi-agent work on
symmetry (2.4), and offers a small, concrete example to help tie these notions together and facilitate
intuition (2.5).
2.1 Dec-POMDPs
We formalise the cooperative multi-agent setting as a decentralized partially-observable Markov
decision process (Dec-POMDP) [32] which is a 9-tuple
(S,N,{Ai}n
i=1,{Oi}n
i=1,T,R,{Ui}n
i=1,
T, γ
), for finite sets
S,N,{Ai}n
i=1,{Oi}n
i=1
, denoting the set of states, agents, actions and observa-
tions, respectively, where a superscript
i
denotes the set pertaining to agent
i∈ N ={1, . . . , n}
(i.e.,
2
Ai
and
Oi
are the action and observation sets for agent
i
, and
ai∈ Ai
and
oi∈ Oi
are a specific
action and observation of agent
i
). We also write
A=×iAi
and
O=×iOi
, the sets of joint actions
and observations, respectively.
st∈ S
is the state at time
t
and
st={sk
t}k
, where
sk
t
is state feature
k
of
st
.
at∈ A
is the joint action of all agents taken at time
t
, which changes the state according to
the transition distribution
st+1 T (st+1 |st, at)
. The subsequent joint observation of the agents
is
ot+1 ∈ O
, distributed according to
ot+1 ∼ U(ot+1 |st+1, at)
, where
U=×iUi
; observation
features of
oi
t+1 ∈ Oi
are notated analogously to state features; that is,
oi
t+1 ={oi,k
t+1}k
. The reward
rt+1 R
is distributed according to
rt+1 ∼ R(rt+1 |st+1, at)
.
T
is the horizon and
γ[0,1]
is
the discount factor.
Notating
τi
t= (ai
0, oi
1, . . . , ai
t1, oi
t)
for the action-observation history of agent
i
, agent
i
acts
according to a policy
ai
tπi(ai
t|τi
t)
. The agents seek to maximize the return, i.e., the expected
discounted sum of rewards:
JT:= Ep(τT)[X
tT
γt1rt],(1)
where τt= (s0, a0, o1, r1, . . . , at1, ot, rt, st)is the trajectory until time t.
2.2 Groups
A group
G
is a set with an associative binary operation on
G
, such that, relative to the binary operation,
the inverse of each element in the set is also in the set, and the set contains an identity element. For
our purposes, each element
gG
can be thought of as an automorphism over some space
X
(i.e. an
isomorphism from
X
to
X
), and the binary operation is function composition. The cardinality of the
group is referred to as its order. If a subset
K
of
G
is also a group under the binary operation of
G
(a.k.a. a subgroup), we denote it K < G.
A group action
1
is a function
G×XX
satisfying
ex =x
(where
e
is the identity element) and
(g·h)x=g·(hx)
, for all
g, h G
. In particular, for a fixed
g
, we get a function
αg:XX
given
by
x7→ gx
.
g
can be thought to be “relabeling” the elements of
X
, which aligns with our definition
for Dec-POMDP symmetry introduced Section 2.4.
Groups have specific mathematical properties, such as associativity, containing the inverse of each
element, and closure (closure is implicit of the binary operation being over
G
itself) which are
necessary for proving Propositions 1 and 2 and ensuring equivariance.
2.3 Equivariance
Let
G
be a group. Consider the set of all neural networks
Ψ
of a given architecture whose first and
ultimate layers are linear; that is,
Ψ:= {ψ|ψ=h(ˆ
ψ(f))}
, where
f, h
are linear functions, and
ˆ
ψ
is a non-linear function. We say a network
ψΨ
is equivariant (with respect to a group
G
, or
G-equivariant) if
Kgψ(x) = ψ(Lgx),for all gG, xRd,(2)
where
Lg
is the group action of
gG
on inputs,
Kg
is similarly the group action on outputs, and
d
is the dimension of our data. If
ψ
satisfies Equation 2, then we say
ψΨequiv
. In the context of
reinforcement learning, one may consider
ψ
as the neural parameterization of a policy
π2
, so a policy
is G-equivariant if
π(Kg(a)|τ) = π(a|Lg(τ)),for all gGand for all τ. (3)
If for any choice of
gG
we have that
Kg=I
, the identity function/matrix, then we say
π
(or
ψ
) is
invariant to g.
2.4 Other-Play and Dec-POMDP Symmetry
The OP learning rule [22] of a joint policy,
π= (π1, π2)
, is formally the maximization of the
following objective (which we state for the two-player case for ease of notation):
π=arg max
π
EϕΦJ(π1, ϕ(π2)),(4)
1
To disambiguate between actions of mathematical groups and actions of reinforcement learning agents, we
always refer to the former as “group actions”.
2We use ψand πfor this reason more or less interchangeably.
3
where Φis the class of equivalence mappings (symmetries) for the given Dec-POMDP, so that each
ϕΦ
is an automorphism of
S,A,O
whose application leaves the Dec-POMDP unchanged up
to relabeling (
ϕ
is shorthand for
ϕ={ϕS, ϕA, ϕO}
). The expectation in Equation 4 is taken with
respect to the uniform distribution on Φ.
Each ϕΦacts on the action-observation history as
ϕ(τi
t)=(ϕ(ai
0), ϕ(oi
1), . . . , ϕ(ai
t1), ϕ(oi
t)),(5)
and acts on policies as
ˆπ=ϕ(π)ˆπ(ϕ(a)|ϕ(τ)) = π(a|τ).(6)
Policies
π, ˆπ
in Equation 6 are said to be symmetry-equivalent to one another with respect to
ϕ
. To
correspond with Equation 3, we have Lϕ={ϕA, ϕO}and Kϕ=ϕA.
2.5 Idealized Example
Consider a game where you coordinate with an unknown stranger. Each of you must pick a lever
from a set of
10
levers. Nine of the levers have
1
written on them, and one has
0.9
written. If you
and the stranger pick the same lever then you are paid out the amount that is written on the lever,
otherwise you are paid out nothing.
We have
G:= S9Φ
, where
S9
is the symmetric group over
9
elements (all possible ways to
permute
9
elements), since the nine levers with
1
written on them are symmetric. If you (using policy
π1
) pick the
1
-point lever
l1
and the stranger (using policy
π2
) picks the
1
-point lever
l2
, then
π1
and
π2
would be symmetry-equivalent with respect to any
ϕG
that permutes the lever choices (e.g.
ϕ(l1) = l2
, or equivalently
Lϕ(l1) = Kϕ(l1) = l2
, as the levers are both observations and actions).
If, as is likely,
l1̸=l2
, you and the stranger leave empty-handed, illustrating symmetry breaking.
If during training you and the stranger constrained yourselves to the space of equivariant policies
whilst trying to maximize payoff, you would both converge to the policy that picks the
0.9
-point lever
(which is not only equivariant, but invariant to all
ϕG
), and so the choice of using equivariant
policies allows symmetry to be maintained and payout to be guaranteed.
3 Method
We now present our methodology and architectural choices, as well as several theoretical results.
3.1 Architecture
We first consider the symmetrizer introduced in [35], which is a functional that transforms linear
maps to equivariant linear maps. Inspired by this we present a generalisation of the symmetrizer that
maps from Ψto the equivariant subspace, Ψequiv, namely S:ΨΨequiv, which we define as
S(ψ) = 1
|G|X
gG
K1
gψ(Lg),(7)
so that the first layer of
ψ
, denoted
f
, being linear, is composed with each
Lg
, and the ultimate
layer, denoted
h
, is similarly composed with
K1
g
(i.e.
K1
gψ(Lg) = K1
gh(ˆ
ψ(fLg))
, where
ψ=h(ˆ
ψ(f)).
Below we establish properties of
S
that were analogously shown for the linear symmetrizer case [35]:
Proposition 1. (Symmetric Property)
S(ψ)Ψequiv,
for all
ψΨ
; that is,
S
maps neural networks
to equivariant neural networks.
Proposition 2. (Fixing Property)
Ψequiv =Ran(S);
that is, the range of
S
covers the entire equivari-
ant subspace.
The proofs of these propositions can be found in the Appendix. The proof of Proposition 1 relies on
properties of the group structure (such as closure of the group); it is thus critical for the symmetrizer
to sum over permutations that together form a group structure, and not any random collection of
permutations.
4
摘要:

EquivariantNetworksforZero-ShotCoordinationDariusMuglichUniversityofOxforddariusm1997@yahoo.comChristianSchroederdeWittFLAIR,UniversityofOxfordcs@robots.ox.ac.ukElisevanderPolMicrosoftResearchAI4Scienceevanderpol@microsoft.comShimonWhitesonUniversityofOxfordshimon.whiteson@cs.ox.ac.ukJakobFoersterFL...

展开>> 收起<<
Equivariant Networks for Zero-Shot Coordination Darius Muglich University of Oxford.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1002.44KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注