MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN -REGULARIZED REINFORCEMENT LEARNING AND PLANNING

2025-04-24 0 0 749.7KB 30 页 10玖币

侵权投诉

MASTERING THE GAME OF NO-PRESS

DIPLOMACY VIA HUMAN-REGULARIZED

REINFORCEMENT LEARNING AND PLANNING

Anton Bakhtin∗

Meta AI

David J Wu∗

Meta AI

Adam Lerer∗

Meta AI

Jonathan Gray∗

Meta AI

Athul Paul Jacob∗

MIT

Gabriele Farina∗

Meta AI

Alexander H Miller

Meta AI

Noam Brown

Meta AI

ABSTRACT

No-press Diplomacy is a complex strategy game involving both cooperation and

competition that has served as a benchmark for multi-agent AI research. While

self-play reinforcement learning has resulted in numerous successes in purely

adversarial games like chess, Go, and poker, self-play alone is insufﬁcient for

achieving optimal performance in domains involving cooperation with humans.

We address this shortcoming by ﬁrst introducing a planning algorithm we call

DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-

learned policy. We prove that this is a no-regret learning algorithm under a modi-

ﬁed utility function. We then show that DiL-piKL can be extended into a self-play

reinforcement learning algorithm we call RL-DiL-piKL that provides a model of

human play while simultaneously training an agent that responds well to this hu-

man model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a

200-game no-press Diplomacy tournament involving 62 human participants span-

ning skill levels from beginner to expert, two Diplodocus agents both achieved a

higher average score than all other participants who played more than two games,

and ranked ﬁrst and third according to an Elo ratings model.

1 INTRODUCTION

In two-player zero-sum (2p0s) settings, principled self-play algorithms converge to a minimax equi-

librium, which in a balanced game ensures that a player will not lose in expectation regardless of

the opponent’s strategy (Neumann, 1928). This fact has allowed self-play, even without human

data, to achieve remarkable success in 2p0s games like chess (Silver et al., 2018), Go (Silver et al.,

2017), poker (Bowling et al., 2015; Brown & Sandholm, 2017), and Dota 2 (Berner et al., 2019).1

In principle, any ﬁnite 2p0s game can be solved via self-play given sufﬁcient compute and memory.

However, in games involving cooperation, self-play alone no longer guarantees good performance

when playing with humans, even with inﬁnite compute and memory. This is because in complex

domains there may be arbitrarily many conventions and expectations for how to cooperate, of which

humans may use only a small subset (Lerer & Peysakhovich, 2019). The clearest example of this

is language. A self-play agent trained from scratch without human data in a cooperative game in-

volving free-form communication channels would almost certainly not converge to using English

as the medium of communication. Obviously, such an agent would perform poorly when paired

with a human English speaker. Indeed, prior work has shown that na¨

ıve extensions of self-play from

scratch without human data perform poorly when playing with humans or human-like agents even in

dialogue-free domains that involve cooperation rather than just competition, such as the benchmark

games no-press Diplomacy (Bakhtin et al., 2021) and Hanabi (Siu et al., 2021; Cui et al., 2021).

∗Equal ﬁrst author contribution.

1Dota 2 is a two-team zero-sum game, but the presence of full information sharing between teammates

makes it equivalent to 2p0s. Beyond 2p0s settings, self-play algorithms have also proven successful in highly

adversarial games like six-player poker Brown & Sandholm (2019).

arXiv:2210.05492v1 [cs.GT] 11 Oct 2022

Recently, (Jacob et al., 2022) introduced piKL, which models human behavior in many games better

than pure behavioral cloning (BC) on human data by regularizing inference-time planning toward a

BC policy. In this work, we introduce an extension of piKL, called DiL-piKL, that replaces piKL’s

single ﬁxed regularization parameter λwith a probability distribution over λparameters. We then

show how DiL-piKL can be combined with self-play reinforcement learning, allowing us to train a

strong agent that performs well with humans. We call this algorithm RL-DiL-piKL.

Using RL-DiL-piKL we trained an agent, Diplodocus, to play no-press Diplomacy, a difﬁcult bench-

mark for multi-agent AI that has been actively studied in recent years (Paquette et al., 2019; Anthony

et al., 2020; Gray et al., 2020; Bakhtin et al., 2021; Jacob et al., 2022). We conducted a 200-game

no-press Diplomacy tournament with a diverse pool of human players, including expert humans, in

which we tested two versions of Diplodocus using different RL-DiL-piKL settings, and other base-

line agents. All games consisted of one bot and six humans, with all players being anonymous for

the duration of the game. These two versions of Diplodocus achieved the top two average scores

in the tournament among all 48 participants who played more than two games, and ranked ﬁrst and

third overall among all participants according to an Elo ratings model.

2 BACKGROUND AND PRIOR WORK

Diplomacy is a benchmark 7-player mixed cooperative/competitive game featuring simultaneous

moves and a heavy emphasis on negotiation and coordination. In the no-press variant of the game,

there is no cheap talk communication. Instead, players only implicitly communicate through moves.

In the game, seven players compete for majority control of 34 “supply centers” (SCs) on a map.

On each turn, players simultaneously choose actions consisting of an order for each of their units to

hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining

players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use

a common scoring system in which the score of player iis C2

i/Pi0C2

i0, where Ciis the number of

SCs player iowns. A more detailed description of the rules is provided in Appendix B.

Most recent successes in no-press Diplomacy use deep learning to imitate human behavior given a

corpus of human games. The ﬁrst Diplomacy agent to leverage deep imitation learning was Paquette

et al. (2019). Subsequent work on no-press Diplomacy have mostly relied on a similar architecture

with some modeling improvements (Gray et al., 2020; Anthony et al., 2020; Bakhtin et al., 2021).

Gray et al. (2020) proposed an agent that plays an improved policy via one-ply search. It uses policy

and value functions trained on human data to to conduct search using regret minimization.

Several works explored applying self-play to compute improved policies. Paquette et al. (2019)

applied an actor-critic approach and found that while the agent plays stronger in populations of

other self-play agents, it plays worse against a population of human-imitation agents. Anthony

et al. (2020) used a self-play approach based on a modiﬁcation of ﬁctitious play in order to reduce

drift from human conventions. The resulting policy is stronger than pure imitation learning in both

1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, Bakhtin et al. (2021)

combined one-ply search based on equilibrium computation with value iteration to produce an agent

called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without

human data, but in the full 7-player game plays poorly with agents other than itself.

Jacob et al. (2022) showed that regularizing inference-time search techniques can produce agents

that are not only strong but can also model human behaviour well. In the domain of no-press Diplo-

macy, they show that regularizing hedge (an equilibrium-ﬁnding algorithm) with a KL-divergence

penalty towards a human imitation learning policy can match or exceed the human action prediction

accuracy of imitation learning while being substantially stronger. KL-regularization toward human

behavioral policies has previously been proposed in various forms in single- and multi-agent RL

algorithms (Nair et al., 2018; Siegel et al., 2020; Nair et al., 2020), and was notably employed in

AlphaStar (Vinyals et al., 2019), but this has typically been used to improve sample efﬁciency and

aid exploration rather than to better model and coordinate with human play.

An alternative line of research has attempted to build human-compatible agents without relying

on human data (Hu et al., 2020; 2021; Strouse et al., 2021). These techniques have shown some

success in simpliﬁed settings but have not been shown to be competitive with humans in large-scale

collaborative environments.

2.1 MARKOV GAMES

In this work, we focus on multiplayer Markov games (Shapley, 1953).

Deﬁnition. An n-player Markov game ∆is a tuple hS, A1, . . . , An, r1, . . . , rn, piwhere Sis the

state space, Aiis the action space of player i(i= 1, . . . , n), ri:S×A1× ··· × An→Ris the

reward function for player i,f:S×A1× ··· × An→Sis the transition function.

The goal of each player i, is to choose a policy πi(s) : S→∆Aithat maximizes the expected

reward for that player, given the policies of all other players. In case of n= 1, a Markov game

reduces to a Markov Decision Process (MDP) where an agent interacts with a ﬁxed environment.

At each state s, each player isimultaneously chooses an action aifrom a set of actions Ai. We

denote the actions of all players other than ias a−i. Players may also choose a probability distribu-

tion over actions, where the probability of action aiis denoted πi(s, ai)or σi(ai)and the vector of

probabilities is denoted πi(s)or πi.

2.2 HEDGE

Hedge Littlestone & Warmuth (1994); Freund & Schapire (1997) is an iterative algorithm that con-

verges to an equilibrium. We use variants of hedge for planning by using them to compute an

equilibrium policy on each turn of the game and then playing that policy.

Assume that after player ichooses an action aiand all other players choose actions a−i, player i

receives a reward of ui(ai,a−i), where uiwill come from our RL-trained value function. We denote

the average reward in hindsight for action aiup to iteration tas Qt(ai) = 1

tPt0≤tui(ai, at0

−i).

On each iteration tof hedge, the policy πt

i(ai)is set according to πt

i(ai)∝exp Qt−1(ai)/κt−1

where κtis a temperature parameter.2

It is proven that if κtis set to 1

√tthen as t→ ∞ the average policy over all iterations converges to a

coarse correlated equilibrium, though in practice it often comes close to a Nash equilibrium as well.

In all experiments we set κt=3St

10√ton iteration t, where Stis the observed standard deviation of

the player’s utility up to iteration t, based on a heuristic from Brown et al. (2017). A simpler choice

is to set κt= 0, which makes the algorithm equivalent to ﬁctitious play (Brown, 1951).

Regret matching (RM) (Blackwell et al., 1956; Hart & Mas-Colell, 2000) is an alternative

equilibrium-ﬁnding algorithm that has similar theoretical guarantees to hedge and was used in pre-

viously work on Diplomacy Gray et al. (2020); Bakhtin et al. (2021). We do not use this algorithm

but we do evaluate baseline agents that use RM.

2.3 DORA: SELF-PLAY LEARNING IN MARKOV GAMES

Our approach draws signiﬁcantly from DORA (Bakhtin et al., 2021), which we describe in more

detail here. In this approach, the authors run an algorithm that is similar to past model-based

reinforcement-learning methods such as AlphaZero (Silver et al., 2018), except in place of Monte

Carlo tree search, which is unsound in simultaneous-action games such as Diplomacy or other im-

perfect information games, it instead uses an equilibrium-ﬁnding algorithm such as hedge or RM

to iteratively approximate a Nash equilibrium for the current state (i.e., one-step lookahead search).

A deep neural net trained to predict the policy is used to sample plausible actions for all players to

reduce the large action space in Diplomacy down to a tractable subset for the equilibrium-ﬁnding

procedure, and a deep neural net trained to predict state values is used to evaluate the results of

joint actions sampled by this procedure. Beginning with a policy and value network randomly ini-

tialized from scratch, a large number of self-play games are played and the resulting equilibrium

policies and the improved 1-step value estimates computed on every turn from equilibrium-ﬁnding

are added to a replay buffer used for subsequently improving the policy and value. Additionally, a

double-oracle (McMahan et al., 2003) method was used to allow the policy to explore and discover

additional actions, and the same equilibrium-ﬁnding procedure was also used at test time.

For the core update step, Bakhtin et al. (2021) propose Deep Nash Value Iteration (DNVI), a value

iteration procedure similar to Nash Q-Learning (Hu & Wellman, 2003), which is a generalization

2We use κtrather than ηused in Jacob et al. (2022) in order to clean up notation. κt= 1/(η·t).

of Q-learning (Watkins, 1989) from MDPs to Stochastic games. The idea of Nash-Q is to compute

equilibrium policies σin a subgame where the actions correspond to the possible actions in a current

state and the payoffs are deﬁned using the current approximation of the value function. Bakhtin et al.

(2021) propose an equivalent update that uses a state value function V(s)instead of a state-action

value function Q(s, a):

V(s)←(1 −α)V(s) + α(r+γX

σ(a0)V(f(s, a0))) (1)

where αis the learning rate, σ(·)is the probability of joint action in equilibrium, a0is joint action,

and fis the transition function. For 2p0s games and certain other game classes, this algorithm con-

verges to a Nash equilibrium in the original stochastic game under the assumption that an exploration

policy is used such that each state is visited inﬁnitely often .

The tabular approach of Nash-Q does not scale to large games such as Diplomacy. DNVI replaces

the explicit value function table and update rule in 1 with a value function parameterized by a neural

network, V(s;θv)and uses gradient descent to update it using the following loss:

ValueLoss(θv) = 1

2 V(s;θv)−r(s)−γX

σ(a0)Vf(s, a0); ˆ

θv!2

(2)

The summation used in 2 is not feasible in games with large action spaces as the number of joint

actions grow exponentially with the number of players. Bakhtin et al. (2021) address this issue by

considering only a subset of actions at each step. An auxiliary function, a policy proposal network

πi(s, ai;θπ), models the probability that an action aiof player iis in the support of the equilib-

rium σ. Only the top-ksampled actions from this distribution are considered when solving for the

equilibrium policy σand computing the above value loss. Once the equilibrium is computed, the

equilibrium policy is also used to further train the policy proposal network using cross entropy loss:

PolicyLoss(θπ) = −X

ai∈Ai

σi(a) log πi(s, ai;θπ).(3)

Bakhtin et al. (2021) report that the resulting agent DORA does very well when playing with other

copies of itself. However, DORA performs poorly in games with 6 human human-like agents.

2.4 PIKL: MODELING HUMANS WITH IMITATION-ANCHORED PLANNING

Behavioral cloning (BC) is the standard approach for modeling human behaviors given data. Behav-

ioral cloning learns a policy that maximizes the likelihood of the human data by gradient descent on

a cross-entropy loss. However, as observed and discussed in Jacob et al. (2022), BC often falls short

of accurately modeling or matching human-level performance, with BC models underperforming the

human players they are trained to imitate in games such as Chess, Go, and Diplomacy. Intuitively,

it might seem that initializing self-play with an imitation-learned policy would result in an agent

that is both strong and human-like. Indeed, Bakhtin et al. (2021) showed improved performance

against human-like agents when initializing the DORA training procedure from a human imitation

policy and value, rather than starting from scratch. However, we show in subsection 5.3 that such an

approach still results in policies that deviate from human-compatible equilibria.

Jacob et al. (2022) found that an effective solution was to perform search with a regularization

penalty proportional to the KL divergance from a human imitation policy. This algorithm is referred

to as piKL. The form of piKL we focus on in this paper is a variant of hedge called piKL-hedge, in

which each player iseeks to maximize expected reward, while at the same time playing “close” to

a ﬁxed anchor policy τi. The two goals can be reconciled by deﬁning a composite utility function

that adds a penalty based on the “distance” between the player policy and their anchor policy, with

coefﬁcient λi∈[0,∞)scaling the penalty.

For each player i, we deﬁne i’s utility as a function of the agent policy πi∈∆(Ai)given policies

π−iof all other agents:

˜ui,λi(πi,π−i):=ui(πi,π−i)−λiDKL(πikτi)(4)

Algorithm 1: DIL-PIKL (for Player i)

Data: •Aiset of actions for Player i;

•uireward function for Player i;

•Λia set of λvalues to consider for

Player i;

•βia belief distribution over λvalues for

Player i.

1function INITIALIZE()

2t←0

3for each action ai∈Aido

4Q0

i(ai)←0

5function PLAY()

6t←t+ 1

7sample λ∼βi

8let πt

i,λ be the policy such that

πt

i,λ(ai)∝expQt−1(ai) + λlog τi(ai)

κt−1+λ

9sample an action at

i∼πt

i,λ

10 play at

i∈Aiand observe actions at

−iplayed

by the opponents

11 for each ai∈Aido

12 Qt(ai)←t−1

tQt−1(ai) + 1

tui(ai,at

−i)

Figure 1: DiL-piKL algorithm. Lines with highlights show

the main differences between this algorithm and piKL-Hedge

algorithm proposed in Jacob et al. (2022).

Figure 2: λpop represents the common-

knowledge belief about the λparameter or

distribution used by all players. λagent repre-

sents the λvalue actually used by the agent to

determine its policy. By having λagent differ

from λpop, DiL-piKL interpolates between an

equilibrium under the utility function ui, be-

havioral cloning and best response to behav-

ioral cloning policies. piKL assumed a com-

mon λ, which moved it along one axis of the

space. Our agent models and coordinates with

high-λplayers while playing a lower λitself.

This results in a modiﬁcation of hedge such that on each iteration t,πt

i(ai)is set according to

πt

i(ai)∝expQt−1(ai) + λlog τi(ai)

κt−1+λ(5)

When λis large, the utility function is dominated by the KL-divergence term λiDKL(πikτi), and

so the agent will naturally tend to play a policy πiclose to the anchor policy τi. When λiis small, the

dominating term is the rewards ui(πi,at

−i)and so the agent will tend to maximize reward without

as closely matching the anchor policy τi.

3 DISTRIBUTIONAL LAMBDA PIKL (DIL-PIKL)

piKL trades off between the strength of the agent and the closeness to the anchor policy using a

single ﬁxed λparameter. In practice, we ﬁnd that sampling λfrom a probability distribution each

iteration produces better performance. In this section, we introduce distributional lambda piKL

(DiL-piKL), which replaces the single λparameter in piKL with a probability distribution βover λ

values. On each iteration, each player isamples a λvalue from βiand then chooses a policy based

on Equation 5 using that sampled λ. Figure 1 highlights the difference between piKL and DiL-piKL.

One interpretation of DiL-piKL is that each choice of λis an agent type, where agent types with high

λchoose policies closer to τwhile agent types with low λchoose policies that are more “optimal”

and less constrained to a common-knowledge anchor policy. A priori, each player is randomly sam-

pled from this population of agent types, and the distribution βirepresents the common-knowledge

uncertainty about which of the agent types player imay be. Another interpretation is that piKL

assumed an exponential relation between action EV and likelihood, whereas DiL-piKL results in a

fatter-tailed distribution that may more robustly model different playing styles or game situations.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MASTERINGTHEGAMEOFNO-PRESSDIPLOMACYVIAHUMAN-REGULARIZEDREINFORCEMENTLEARNINGANDPLANNINGAntonBakhtinMetaAIDavidJWuMetaAIAdamLererMetaAIJonathanGrayMetaAIAthulPaulJacobMITGabrieleFarinaMetaAIAlexanderHMillerMetaAINoamBrownMetaAIABSTRACTNo-pressDiplomacyisacomplexstrategygameinvolvingbothcooperat...

展开>> 收起<<

MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN -REGULARIZED REINFORCEMENT LEARNING AND PLANNING.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN -REGULARIZED REINFORCEMENT LEARNING AND PLANNING

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: