Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient Wubing Chen1 Wenbin Li1 Xiao Liu1 Shangdong Yang2 1 Yang Gao1

2025-04-29 0 0 838.52KB 9 页 10玖币

侵权投诉

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement

Learning via Polarization Policy Gradient

Wubing Chen1, Wenbin Li1*, Xiao Liu1, Shangdong Yang2, 1, Yang Gao1∗

1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

2Nanjing University of Posts and Telecommunications, Nanjing 210023, China

wuzbingchen@gmail.com, liwenbin@nju.edu.cn, liuxiao730@outlook.com, sdyang@njupt.edu.cn, gaoy@nju.edu.cn

Abstract

Cooperative multi-agent policy gradient (MAPG) algorithms

have recently attracted wide attention and are regarded as

a general scheme for the multi-agent system. Credit as-

signment plays an important role in MAPG and can in-

duce cooperation among multiple agents. However, most

MAPG algorithms cannot achieve good credit assignment be-

cause of the game-theoretic pathology known as centralized-

decentralized mismatch. To address this issue, this paper

presents a novel method, Multi-Agent Polarization Policy

Gradient (MAPPG). MAPPG takes a simple but efﬁcient po-

larization function to transform the optimal consistency of

joint and individual actions into easily realized constraints,

thus enabling efﬁcient credit assignment in MAPG. Theo-

retically, we prove that individual policies of MAPPG can

converge to the global optimum. Empirically, we evaluate

MAPPG on the well-known matrix game and differential

game, and verify that MAPPG can converge to the global

optimum for both discrete and continuous action spaces. We

also evaluate MAPPG on a set of StarCraft II micromanage-

ment tasks and demonstrate that MAPPG outperforms the

state-of-the-art MAPG algorithms.

1 Introduction

Multi-agent reinforcement learning (MARL) is a critical

learning technology to solve sequential decision problems

with multiple agents. Recent developments in MARL have

heightened the need for fully cooperative MARL that max-

imizes a reward shared by all agents. Cooperative MARL

has made remarkable advances in many domains, includ-

ing autonomous driving (Cao et al. 2021) and cooperative

transport (Shibata, Jimbo, and Matsubara 2021). To mitigate

the combinatorial nature (Hernandez-Leal, Kartal, and Tay-

lor 2019) and partial observability (Omidshaﬁei et al. 2017)

in MARL, centralized training with decentralized execution

(CTDE) (Oliehoek, Spaan, and Vlassis 2008; Kraemer and

Banerjee 2016) has become one of the mainstream settings

for MARL, where global information is provided to promote

collaboration in the training phase and learned policies are

executed only based on local observations.

Multi-agent credit assignment is a crucial challenge in

the MARL under the CTDE setting, which refers to at-

*Corresponding authors.

tributing a global environmental reward to the individ-

ual agents’ actions (Zhou et al. 2020). Multiple indepen-

dent agents can learn effective collaboration policies to ac-

complish challenging tasks with the proper credit assign-

ment. MARL algorithms can be divided into value-based

and policy-based. Cooperative multi-agent policy gradient

(MAPG) algorithms can handle both discrete and contin-

uous action spaces, which is the focus of our study. Dif-

ferent MAPG algorithms adopt different credit assignment

paradigms, which can be divided into implicit and explicit

credit assignment (Zhou et al. 2020). Solving the credit as-

signment problem implicitly needs to represent the joint ac-

tion value as a function of individual policies (Lowe et al.

2017; Zhou et al. 2020; Wang et al. 2021b; Zhang et al.

2021; Peng et al. 2021). Current state-of-the-art MAPG al-

gorithms (Wang et al. 2021b; Zhang et al. 2021; Peng et al.

2021) impose a monotonic constraint between the joint ac-

tion value and individual policies. While some algorithms

allow more expressive value function classes, the capacity

of the value mixing network is still limited by the monotonic

constraints (Son et al. 2019; Wang et al. 2021a). The other

algorithms that achieve explicit credit assignment mainly

provide a shaped reward for each individual agent’s action

(Proper and Tumer 2012; Foerster et al. 2018; Su, Adams,

and Beling 2021). However, there is a large discrepancy be-

tween the performance of algorithms with explicit credit as-

signment and algorithms with implicit credit assignment.

In this paper, we analyze this discrepancy and pin-

point that the centralized-decentralized mismatch hinders

the performance of MAPG algorithms with explicit credit

assignment. The centralized-decentralized mismatch can

arise when the sub-optimal policies of agents could nega-

tively affect the assessment of other agents’ actions, which

leads to catastrophic miscoordination. Note that the issue

of centralized-decentralized mismatch was raised by DOP

(Wang et al. 2021b). However, the linearly decomposed

critic adopted by DOP (Wang et al. 2021b) limits their rep-

resentation expressiveness for the value function.

Inspired by Polarized-VAE (Balasubramanian et al. 2021)

and Weighted QMIX (Rashid et al. 2020), we propose

a policy-based algorithm called Multi-Agent Polarization

Policy Gradient (MAPPG) for learning explicit credit as-

signment to address the centralized-decentralized mismatch.

MAPPG encourages increasing the distance between the

arXiv:2210.05367v2 [cs.LG] 6 Mar 2023

global optimal joint action value and the non-optimal joint

action values while shortening the distance between multiple

non-optimal joint action values via polarization policy gradi-

ent. MAPPG facilitates large-scale multi-agent cooperations

and presents a new multi-agent credit assignment paradigm,

enabling multi-agent policy learning like single-agent policy

learning (Wei and Luke 2016). Theoretically, we prove that

individual policies of MAPPG can converge to the global

optimum. Empirically, we verify that MAPPG can con-

verge to the global optimum compared to existing MAPG

algorithms in the well-known matrix (Son et al. 2019) and

differential games (Wei et al. 2018). We also show that

MAPPG outperforms the state-of-the-art MAPG algorithms

on StarCraft II unit micromanagement tasks (Samvelyan

et al. 2019), demonstrating its scalability in complex scenar-

ios. Finally, the results of ablation experiments match our

theoretical predictions.

2 Related Work

Implicit Credit Assignment

In general, implicit MAPG algorithms utilize the learned

function between the individual policies and the joint action

values for credit assignment. MADDPG (Lowe et al. 2017)

and LICA (Zhou et al. 2020) learn the individual policies by

directly ascending the approximate joint action value gra-

dients. The state-of-the-art MAPG algorithms (Wang et al.

2021b; Zhang et al. 2021; Peng et al. 2021; Su, Adams,

and Beling 2021) introduce the idea of value function de-

composition (Sunehag et al. 2018; Rashid et al. 2018; Son

et al. 2019; Wang et al. 2021a; Rashid et al. 2020) into

the multi-agent actor-critic framework. DOP (Wang et al.

2021b) decomposes the centralized critic as a weighted lin-

ear summation of individual critics that condition local ac-

tions. FOP (Zhang et al. 2021) imposes a multiplicative form

between the optimal joint policy and the individual opti-

mal policy, and optimizes both policies based on maximum

entropy reinforcement learning objectives. FACMAC (Peng

et al. 2021) proposes a new credit-assignment actor-critic

framework that factors the joint action value into individ-

ual action values and uses the centralized gradient estimator

for credit assignment. VDAC (Su, Adams, and Beling 2021)

achieves the credit assignment by enforcing the monotonic

relationship between the joint action values and the shaped

individual action values. Although these algorithms allow

more expressive value function classes, the capacity of the

value mixing network is still limited by the monotonic con-

straints, and this claim will be veriﬁed in our experiments.

Explicit Credit Assignment

In contrast to implicit algorithms, explicit MAPG algo-

rithms provide the contribution of each individual agent’s

action, and the individual actor is updated by following

policy gradients tailored by the contribution. COMA (Fo-

erster et al. 2018) evaluates the contribution of individual

agents’ actions by using the centralized critic to compute

an agent-speciﬁc advantage function. SQDDPG (Wang et al.

2020) proposes a local reward algorithm, Shapley Q-value,

which takes the expectation of marginal contributions of

all possible coalitions. Although explicit algorithms pro-

vide valuable insights into the assessment of the contribu-

tion of individual agents’ actions to the global reward and

thus can signiﬁcantly facilitate policy optimization, the is-

sue of centralized-decentralized mismatch hinders their per-

formance in complex scenarios. Compared to explicit algo-

rithms, the proposed MAPPG can theoretically tackle the

challenge of centralized-decentralized mismatch and exper-

imentally outperforms existing MAPG algorithms in both

convergence speed and ﬁnal performance in challenging en-

vironments.

3 Background

Dec-POMDP

A decentralized partially observable Markov decision pro-

cess (Dec-POMDP) is a tuple hS, U, r, P, Z, O, n, γi, where

nagents identiﬁed by a∈A≡ {1, ..., n}choose sequen-

tial actions, s∈Sis the state. At each time step, each

agent chooses an action ua∈U, forming a joint action

u∈U≡Unwhich induces a transition in the environ-

ment according to the state transition function P(s0|s, u) :

S×U×S→[0,1]. Agents receive the same reward ac-

cording to the reward function r(s, u) : S×U→R. Each

agent has an observation function O(s, a) : S×A→Z,

where a partial observation za∈Zis drawn. γ∈[0,1) is

the discount factor. Throughout this paper, we denote joint

quantities over agents in bold, quantities with the subscript

adenote quantities over agent a, and joint quantities over

agents other than a given agent awith the subscript −a.

Each agent tries to learn a stochastic policy for action se-

lection: πa:T×U→[0,1], where τa∈T≡(Z×U)∗

is an action-observation history for agent a. MARL agents

try to maximize the cumulative return, Rt=P∞

t=1 γt−1rt,

where rtis the reward obtained from the environment by all

agents at step t.

Multi-Agent Policy Gradient

We ﬁrst provide the background on single-agent policy gra-

dient algorithms, and then introduce multi-agent policy gra-

dient algorithms. In single-agent continuous control tasks,

policy gradient algorithms (Sutton et al. 1999) optimise a

single agent’s policy, parameterised by θ, by performing gra-

dient ascent on an estimator of the expected discounted to-

tal reward ∇θJ(π) = Eπ∇θlog π(u|s)R0, where the

gradient is estimated from trajectories sampled from the en-

vironment. Actor-critic (Sutton et al. 1999; Konda and Tsit-

siklis 1999; Schulman et al. 2016) algorithms use an esti-

mated action value instead of the discounted return to solve

the high variance caused by the likelihood-ratio trick in the

above formula. The gradient of the policy for a single-agent

setting can be deﬁned as:

∇θJ(π) = Eπ[∇θlog π(u|s)Q(s, u)] .(1)

A natural extension to multi-agent settings leads to the

multi-agent stochastic policy gradient theorem with agent

a’s policy parameterized by θa(Foerster et al. 2018; Wei

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningExplicitCreditAssignmentforCooperativeMulti-AgentReinforcementLearningviaPolarizationPolicyGradientWubingChen1,WenbinLi1*,XiaoLiu1,ShangdongYang2,1,YangGao11StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,Nanjing210023,China2NanjingUniversityofPostsandTelecommunications,Nanji...

展开>> 收起<<

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient Wubing Chen1 Wenbin Li1 Xiao Liu1 Shangdong Yang2 1 Yang Gao1.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient Wubing Chen1 Wenbin Li1 Xiao Liu1 Shangdong Yang2 1 Yang Gao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: