global optimal joint action value and the non-optimal joint
action values while shortening the distance between multiple
non-optimal joint action values via polarization policy gradi-
ent. MAPPG facilitates large-scale multi-agent cooperations
and presents a new multi-agent credit assignment paradigm,
enabling multi-agent policy learning like single-agent policy
learning (Wei and Luke 2016). Theoretically, we prove that
individual policies of MAPPG can converge to the global
optimum. Empirically, we verify that MAPPG can con-
verge to the global optimum compared to existing MAPG
algorithms in the well-known matrix (Son et al. 2019) and
differential games (Wei et al. 2018). We also show that
MAPPG outperforms the state-of-the-art MAPG algorithms
on StarCraft II unit micromanagement tasks (Samvelyan
et al. 2019), demonstrating its scalability in complex scenar-
ios. Finally, the results of ablation experiments match our
theoretical predictions.
2 Related Work
Implicit Credit Assignment
In general, implicit MAPG algorithms utilize the learned
function between the individual policies and the joint action
values for credit assignment. MADDPG (Lowe et al. 2017)
and LICA (Zhou et al. 2020) learn the individual policies by
directly ascending the approximate joint action value gra-
dients. The state-of-the-art MAPG algorithms (Wang et al.
2021b; Zhang et al. 2021; Peng et al. 2021; Su, Adams,
and Beling 2021) introduce the idea of value function de-
composition (Sunehag et al. 2018; Rashid et al. 2018; Son
et al. 2019; Wang et al. 2021a; Rashid et al. 2020) into
the multi-agent actor-critic framework. DOP (Wang et al.
2021b) decomposes the centralized critic as a weighted lin-
ear summation of individual critics that condition local ac-
tions. FOP (Zhang et al. 2021) imposes a multiplicative form
between the optimal joint policy and the individual opti-
mal policy, and optimizes both policies based on maximum
entropy reinforcement learning objectives. FACMAC (Peng
et al. 2021) proposes a new credit-assignment actor-critic
framework that factors the joint action value into individ-
ual action values and uses the centralized gradient estimator
for credit assignment. VDAC (Su, Adams, and Beling 2021)
achieves the credit assignment by enforcing the monotonic
relationship between the joint action values and the shaped
individual action values. Although these algorithms allow
more expressive value function classes, the capacity of the
value mixing network is still limited by the monotonic con-
straints, and this claim will be verified in our experiments.
Explicit Credit Assignment
In contrast to implicit algorithms, explicit MAPG algo-
rithms provide the contribution of each individual agent’s
action, and the individual actor is updated by following
policy gradients tailored by the contribution. COMA (Fo-
erster et al. 2018) evaluates the contribution of individual
agents’ actions by using the centralized critic to compute
an agent-specific advantage function. SQDDPG (Wang et al.
2020) proposes a local reward algorithm, Shapley Q-value,
which takes the expectation of marginal contributions of
all possible coalitions. Although explicit algorithms pro-
vide valuable insights into the assessment of the contribu-
tion of individual agents’ actions to the global reward and
thus can significantly facilitate policy optimization, the is-
sue of centralized-decentralized mismatch hinders their per-
formance in complex scenarios. Compared to explicit algo-
rithms, the proposed MAPPG can theoretically tackle the
challenge of centralized-decentralized mismatch and exper-
imentally outperforms existing MAPG algorithms in both
convergence speed and final performance in challenging en-
vironments.
3 Background
Dec-POMDP
A decentralized partially observable Markov decision pro-
cess (Dec-POMDP) is a tuple hS, U, r, P, Z, O, n, γi, where
nagents identified by a∈A≡ {1, ..., n}choose sequen-
tial actions, s∈Sis the state. At each time step, each
agent chooses an action ua∈U, forming a joint action
u∈U≡Unwhich induces a transition in the environ-
ment according to the state transition function P(s0|s, u) :
S×U×S→[0,1]. Agents receive the same reward ac-
cording to the reward function r(s, u) : S×U→R. Each
agent has an observation function O(s, a) : S×A→Z,
where a partial observation za∈Zis drawn. γ∈[0,1) is
the discount factor. Throughout this paper, we denote joint
quantities over agents in bold, quantities with the subscript
adenote quantities over agent a, and joint quantities over
agents other than a given agent awith the subscript −a.
Each agent tries to learn a stochastic policy for action se-
lection: πa:T×U→[0,1], where τa∈T≡(Z×U)∗
is an action-observation history for agent a. MARL agents
try to maximize the cumulative return, Rt=P∞
t=1 γt−1rt,
where rtis the reward obtained from the environment by all
agents at step t.
Multi-Agent Policy Gradient
We first provide the background on single-agent policy gra-
dient algorithms, and then introduce multi-agent policy gra-
dient algorithms. In single-agent continuous control tasks,
policy gradient algorithms (Sutton et al. 1999) optimise a
single agent’s policy, parameterised by θ, by performing gra-
dient ascent on an estimator of the expected discounted to-
tal reward ∇θJ(π) = Eπ∇θlog π(u|s)R0, where the
gradient is estimated from trajectories sampled from the en-
vironment. Actor-critic (Sutton et al. 1999; Konda and Tsit-
siklis 1999; Schulman et al. 2016) algorithms use an esti-
mated action value instead of the discounted return to solve
the high variance caused by the likelihood-ratio trick in the
above formula. The gradient of the policy for a single-agent
setting can be defined as:
∇θJ(π) = Eπ[∇θlog π(u|s)Q(s, u)] .(1)
A natural extension to multi-agent settings leads to the
multi-agent stochastic policy gradient theorem with agent
a’s policy parameterized by θa(Foerster et al. 2018; Wei