Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient Wubing Chen1 Wenbin Li1 Xiao Liu1 Shangdong Yang2 1 Yang Gao1

2025-04-29 0 0 838.52KB 9 页 10玖币
侵权投诉
Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement
Learning via Polarization Policy Gradient
Wubing Chen1, Wenbin Li1*, Xiao Liu1, Shangdong Yang2, 1, Yang Gao1
1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
2Nanjing University of Posts and Telecommunications, Nanjing 210023, China
wuzbingchen@gmail.com, liwenbin@nju.edu.cn, liuxiao730@outlook.com, sdyang@njupt.edu.cn, gaoy@nju.edu.cn
Abstract
Cooperative multi-agent policy gradient (MAPG) algorithms
have recently attracted wide attention and are regarded as
a general scheme for the multi-agent system. Credit as-
signment plays an important role in MAPG and can in-
duce cooperation among multiple agents. However, most
MAPG algorithms cannot achieve good credit assignment be-
cause of the game-theoretic pathology known as centralized-
decentralized mismatch. To address this issue, this paper
presents a novel method, Multi-Agent Polarization Policy
Gradient (MAPPG). MAPPG takes a simple but efficient po-
larization function to transform the optimal consistency of
joint and individual actions into easily realized constraints,
thus enabling efficient credit assignment in MAPG. Theo-
retically, we prove that individual policies of MAPPG can
converge to the global optimum. Empirically, we evaluate
MAPPG on the well-known matrix game and differential
game, and verify that MAPPG can converge to the global
optimum for both discrete and continuous action spaces. We
also evaluate MAPPG on a set of StarCraft II micromanage-
ment tasks and demonstrate that MAPPG outperforms the
state-of-the-art MAPG algorithms.
1 Introduction
Multi-agent reinforcement learning (MARL) is a critical
learning technology to solve sequential decision problems
with multiple agents. Recent developments in MARL have
heightened the need for fully cooperative MARL that max-
imizes a reward shared by all agents. Cooperative MARL
has made remarkable advances in many domains, includ-
ing autonomous driving (Cao et al. 2021) and cooperative
transport (Shibata, Jimbo, and Matsubara 2021). To mitigate
the combinatorial nature (Hernandez-Leal, Kartal, and Tay-
lor 2019) and partial observability (Omidshafiei et al. 2017)
in MARL, centralized training with decentralized execution
(CTDE) (Oliehoek, Spaan, and Vlassis 2008; Kraemer and
Banerjee 2016) has become one of the mainstream settings
for MARL, where global information is provided to promote
collaboration in the training phase and learned policies are
executed only based on local observations.
Multi-agent credit assignment is a crucial challenge in
the MARL under the CTDE setting, which refers to at-
*Corresponding authors.
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
tributing a global environmental reward to the individ-
ual agents’ actions (Zhou et al. 2020). Multiple indepen-
dent agents can learn effective collaboration policies to ac-
complish challenging tasks with the proper credit assign-
ment. MARL algorithms can be divided into value-based
and policy-based. Cooperative multi-agent policy gradient
(MAPG) algorithms can handle both discrete and contin-
uous action spaces, which is the focus of our study. Dif-
ferent MAPG algorithms adopt different credit assignment
paradigms, which can be divided into implicit and explicit
credit assignment (Zhou et al. 2020). Solving the credit as-
signment problem implicitly needs to represent the joint ac-
tion value as a function of individual policies (Lowe et al.
2017; Zhou et al. 2020; Wang et al. 2021b; Zhang et al.
2021; Peng et al. 2021). Current state-of-the-art MAPG al-
gorithms (Wang et al. 2021b; Zhang et al. 2021; Peng et al.
2021) impose a monotonic constraint between the joint ac-
tion value and individual policies. While some algorithms
allow more expressive value function classes, the capacity
of the value mixing network is still limited by the monotonic
constraints (Son et al. 2019; Wang et al. 2021a). The other
algorithms that achieve explicit credit assignment mainly
provide a shaped reward for each individual agent’s action
(Proper and Tumer 2012; Foerster et al. 2018; Su, Adams,
and Beling 2021). However, there is a large discrepancy be-
tween the performance of algorithms with explicit credit as-
signment and algorithms with implicit credit assignment.
In this paper, we analyze this discrepancy and pin-
point that the centralized-decentralized mismatch hinders
the performance of MAPG algorithms with explicit credit
assignment. The centralized-decentralized mismatch can
arise when the sub-optimal policies of agents could nega-
tively affect the assessment of other agents’ actions, which
leads to catastrophic miscoordination. Note that the issue
of centralized-decentralized mismatch was raised by DOP
(Wang et al. 2021b). However, the linearly decomposed
critic adopted by DOP (Wang et al. 2021b) limits their rep-
resentation expressiveness for the value function.
Inspired by Polarized-VAE (Balasubramanian et al. 2021)
and Weighted QMIX (Rashid et al. 2020), we propose
a policy-based algorithm called Multi-Agent Polarization
Policy Gradient (MAPPG) for learning explicit credit as-
signment to address the centralized-decentralized mismatch.
MAPPG encourages increasing the distance between the
arXiv:2210.05367v2 [cs.LG] 6 Mar 2023
global optimal joint action value and the non-optimal joint
action values while shortening the distance between multiple
non-optimal joint action values via polarization policy gradi-
ent. MAPPG facilitates large-scale multi-agent cooperations
and presents a new multi-agent credit assignment paradigm,
enabling multi-agent policy learning like single-agent policy
learning (Wei and Luke 2016). Theoretically, we prove that
individual policies of MAPPG can converge to the global
optimum. Empirically, we verify that MAPPG can con-
verge to the global optimum compared to existing MAPG
algorithms in the well-known matrix (Son et al. 2019) and
differential games (Wei et al. 2018). We also show that
MAPPG outperforms the state-of-the-art MAPG algorithms
on StarCraft II unit micromanagement tasks (Samvelyan
et al. 2019), demonstrating its scalability in complex scenar-
ios. Finally, the results of ablation experiments match our
theoretical predictions.
2 Related Work
Implicit Credit Assignment
In general, implicit MAPG algorithms utilize the learned
function between the individual policies and the joint action
values for credit assignment. MADDPG (Lowe et al. 2017)
and LICA (Zhou et al. 2020) learn the individual policies by
directly ascending the approximate joint action value gra-
dients. The state-of-the-art MAPG algorithms (Wang et al.
2021b; Zhang et al. 2021; Peng et al. 2021; Su, Adams,
and Beling 2021) introduce the idea of value function de-
composition (Sunehag et al. 2018; Rashid et al. 2018; Son
et al. 2019; Wang et al. 2021a; Rashid et al. 2020) into
the multi-agent actor-critic framework. DOP (Wang et al.
2021b) decomposes the centralized critic as a weighted lin-
ear summation of individual critics that condition local ac-
tions. FOP (Zhang et al. 2021) imposes a multiplicative form
between the optimal joint policy and the individual opti-
mal policy, and optimizes both policies based on maximum
entropy reinforcement learning objectives. FACMAC (Peng
et al. 2021) proposes a new credit-assignment actor-critic
framework that factors the joint action value into individ-
ual action values and uses the centralized gradient estimator
for credit assignment. VDAC (Su, Adams, and Beling 2021)
achieves the credit assignment by enforcing the monotonic
relationship between the joint action values and the shaped
individual action values. Although these algorithms allow
more expressive value function classes, the capacity of the
value mixing network is still limited by the monotonic con-
straints, and this claim will be verified in our experiments.
Explicit Credit Assignment
In contrast to implicit algorithms, explicit MAPG algo-
rithms provide the contribution of each individual agent’s
action, and the individual actor is updated by following
policy gradients tailored by the contribution. COMA (Fo-
erster et al. 2018) evaluates the contribution of individual
agents’ actions by using the centralized critic to compute
an agent-specific advantage function. SQDDPG (Wang et al.
2020) proposes a local reward algorithm, Shapley Q-value,
which takes the expectation of marginal contributions of
all possible coalitions. Although explicit algorithms pro-
vide valuable insights into the assessment of the contribu-
tion of individual agents’ actions to the global reward and
thus can significantly facilitate policy optimization, the is-
sue of centralized-decentralized mismatch hinders their per-
formance in complex scenarios. Compared to explicit algo-
rithms, the proposed MAPPG can theoretically tackle the
challenge of centralized-decentralized mismatch and exper-
imentally outperforms existing MAPG algorithms in both
convergence speed and final performance in challenging en-
vironments.
3 Background
Dec-POMDP
A decentralized partially observable Markov decision pro-
cess (Dec-POMDP) is a tuple hS, U, r, P, Z, O, n, γi, where
nagents identified by aA≡ {1, ..., n}choose sequen-
tial actions, sSis the state. At each time step, each
agent chooses an action uaU, forming a joint action
uUUnwhich induces a transition in the environ-
ment according to the state transition function P(s0|s, u) :
S×U×S[0,1]. Agents receive the same reward ac-
cording to the reward function r(s, u) : S×UR. Each
agent has an observation function O(s, a) : S×AZ,
where a partial observation zaZis drawn. γ[0,1) is
the discount factor. Throughout this paper, we denote joint
quantities over agents in bold, quantities with the subscript
adenote quantities over agent a, and joint quantities over
agents other than a given agent awith the subscript a.
Each agent tries to learn a stochastic policy for action se-
lection: πa:T×U[0,1], where τaT(Z×U)
is an action-observation history for agent a. MARL agents
try to maximize the cumulative return, Rt=P
t=1 γt1rt,
where rtis the reward obtained from the environment by all
agents at step t.
Multi-Agent Policy Gradient
We first provide the background on single-agent policy gra-
dient algorithms, and then introduce multi-agent policy gra-
dient algorithms. In single-agent continuous control tasks,
policy gradient algorithms (Sutton et al. 1999) optimise a
single agent’s policy, parameterised by θ, by performing gra-
dient ascent on an estimator of the expected discounted to-
tal reward θJ(π) = Eπθlog π(u|s)R0, where the
gradient is estimated from trajectories sampled from the en-
vironment. Actor-critic (Sutton et al. 1999; Konda and Tsit-
siklis 1999; Schulman et al. 2016) algorithms use an esti-
mated action value instead of the discounted return to solve
the high variance caused by the likelihood-ratio trick in the
above formula. The gradient of the policy for a single-agent
setting can be defined as:
θJ(π) = Eπ[θlog π(u|s)Q(s, u)] .(1)
A natural extension to multi-agent settings leads to the
multi-agent stochastic policy gradient theorem with agent
as policy parameterized by θa(Foerster et al. 2018; Wei
摘要:

LearningExplicitCreditAssignmentforCooperativeMulti-AgentReinforcementLearningviaPolarizationPolicyGradientWubingChen1,WenbinLi1*,XiaoLiu1,ShangdongYang2,1,YangGao11StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing210023,China2NanjingUniversityofPostsandTelecommunications,Nanji...

展开>> 收起<<
Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient Wubing Chen1 Wenbin Li1 Xiao Liu1 Shangdong Yang2 1 Yang Gao1.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:838.52KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注