MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN -REGULARIZED REINFORCEMENT LEARNING AND PLANNING

2025-04-24 0 0 749.7KB 30 页 10玖币
侵权投诉
MASTERING THE GAME OF NO-PRESS
DIPLOMACY VIA HUMAN-REGULARIZED
REINFORCEMENT LEARNING AND PLANNING
Anton Bakhtin
Meta AI
David J Wu
Meta AI
Adam Lerer
Meta AI
Jonathan Gray
Meta AI
Athul Paul Jacob
MIT
Gabriele Farina
Meta AI
Alexander H Miller
Meta AI
Noam Brown
Meta AI
ABSTRACT
No-press Diplomacy is a complex strategy game involving both cooperation and
competition that has served as a benchmark for multi-agent AI research. While
self-play reinforcement learning has resulted in numerous successes in purely
adversarial games like chess, Go, and poker, self-play alone is insufficient for
achieving optimal performance in domains involving cooperation with humans.
We address this shortcoming by first introducing a planning algorithm we call
DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-
learned policy. We prove that this is a no-regret learning algorithm under a modi-
fied utility function. We then show that DiL-piKL can be extended into a self-play
reinforcement learning algorithm we call RL-DiL-piKL that provides a model of
human play while simultaneously training an agent that responds well to this hu-
man model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a
200-game no-press Diplomacy tournament involving 62 human participants span-
ning skill levels from beginner to expert, two Diplodocus agents both achieved a
higher average score than all other participants who played more than two games,
and ranked first and third according to an Elo ratings model.
1 INTRODUCTION
In two-player zero-sum (2p0s) settings, principled self-play algorithms converge to a minimax equi-
librium, which in a balanced game ensures that a player will not lose in expectation regardless of
the opponent’s strategy (Neumann, 1928). This fact has allowed self-play, even without human
data, to achieve remarkable success in 2p0s games like chess (Silver et al., 2018), Go (Silver et al.,
2017), poker (Bowling et al., 2015; Brown & Sandholm, 2017), and Dota 2 (Berner et al., 2019).1
In principle, any finite 2p0s game can be solved via self-play given sufficient compute and memory.
However, in games involving cooperation, self-play alone no longer guarantees good performance
when playing with humans, even with infinite compute and memory. This is because in complex
domains there may be arbitrarily many conventions and expectations for how to cooperate, of which
humans may use only a small subset (Lerer & Peysakhovich, 2019). The clearest example of this
is language. A self-play agent trained from scratch without human data in a cooperative game in-
volving free-form communication channels would almost certainly not converge to using English
as the medium of communication. Obviously, such an agent would perform poorly when paired
with a human English speaker. Indeed, prior work has shown that na¨
ıve extensions of self-play from
scratch without human data perform poorly when playing with humans or human-like agents even in
dialogue-free domains that involve cooperation rather than just competition, such as the benchmark
games no-press Diplomacy (Bakhtin et al., 2021) and Hanabi (Siu et al., 2021; Cui et al., 2021).
Equal first author contribution.
1Dota 2 is a two-team zero-sum game, but the presence of full information sharing between teammates
makes it equivalent to 2p0s. Beyond 2p0s settings, self-play algorithms have also proven successful in highly
adversarial games like six-player poker Brown & Sandholm (2019).
1
arXiv:2210.05492v1 [cs.GT] 11 Oct 2022
Recently, (Jacob et al., 2022) introduced piKL, which models human behavior in many games better
than pure behavioral cloning (BC) on human data by regularizing inference-time planning toward a
BC policy. In this work, we introduce an extension of piKL, called DiL-piKL, that replaces piKLs
single fixed regularization parameter λwith a probability distribution over λparameters. We then
show how DiL-piKL can be combined with self-play reinforcement learning, allowing us to train a
strong agent that performs well with humans. We call this algorithm RL-DiL-piKL.
Using RL-DiL-piKL we trained an agent, Diplodocus, to play no-press Diplomacy, a difficult bench-
mark for multi-agent AI that has been actively studied in recent years (Paquette et al., 2019; Anthony
et al., 2020; Gray et al., 2020; Bakhtin et al., 2021; Jacob et al., 2022). We conducted a 200-game
no-press Diplomacy tournament with a diverse pool of human players, including expert humans, in
which we tested two versions of Diplodocus using different RL-DiL-piKL settings, and other base-
line agents. All games consisted of one bot and six humans, with all players being anonymous for
the duration of the game. These two versions of Diplodocus achieved the top two average scores
in the tournament among all 48 participants who played more than two games, and ranked first and
third overall among all participants according to an Elo ratings model.
2 BACKGROUND AND PRIOR WORK
Diplomacy is a benchmark 7-player mixed cooperative/competitive game featuring simultaneous
moves and a heavy emphasis on negotiation and coordination. In the no-press variant of the game,
there is no cheap talk communication. Instead, players only implicitly communicate through moves.
In the game, seven players compete for majority control of 34 “supply centers” (SCs) on a map.
On each turn, players simultaneously choose actions consisting of an order for each of their units to
hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining
players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use
a common scoring system in which the score of player iis C2
i/Pi0C2
i0, where Ciis the number of
SCs player iowns. A more detailed description of the rules is provided in Appendix B.
Most recent successes in no-press Diplomacy use deep learning to imitate human behavior given a
corpus of human games. The first Diplomacy agent to leverage deep imitation learning was Paquette
et al. (2019). Subsequent work on no-press Diplomacy have mostly relied on a similar architecture
with some modeling improvements (Gray et al., 2020; Anthony et al., 2020; Bakhtin et al., 2021).
Gray et al. (2020) proposed an agent that plays an improved policy via one-ply search. It uses policy
and value functions trained on human data to to conduct search using regret minimization.
Several works explored applying self-play to compute improved policies. Paquette et al. (2019)
applied an actor-critic approach and found that while the agent plays stronger in populations of
other self-play agents, it plays worse against a population of human-imitation agents. Anthony
et al. (2020) used a self-play approach based on a modification of fictitious play in order to reduce
drift from human conventions. The resulting policy is stronger than pure imitation learning in both
1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, Bakhtin et al. (2021)
combined one-ply search based on equilibrium computation with value iteration to produce an agent
called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without
human data, but in the full 7-player game plays poorly with agents other than itself.
Jacob et al. (2022) showed that regularizing inference-time search techniques can produce agents
that are not only strong but can also model human behaviour well. In the domain of no-press Diplo-
macy, they show that regularizing hedge (an equilibrium-finding algorithm) with a KL-divergence
penalty towards a human imitation learning policy can match or exceed the human action prediction
accuracy of imitation learning while being substantially stronger. KL-regularization toward human
behavioral policies has previously been proposed in various forms in single- and multi-agent RL
algorithms (Nair et al., 2018; Siegel et al., 2020; Nair et al., 2020), and was notably employed in
AlphaStar (Vinyals et al., 2019), but this has typically been used to improve sample efficiency and
aid exploration rather than to better model and coordinate with human play.
An alternative line of research has attempted to build human-compatible agents without relying
on human data (Hu et al., 2020; 2021; Strouse et al., 2021). These techniques have shown some
success in simplified settings but have not been shown to be competitive with humans in large-scale
collaborative environments.
2
2.1 MARKOV GAMES
In this work, we focus on multiplayer Markov games (Shapley, 1953).
Definition. An n-player Markov game is a tuple hS, A1, . . . , An, r1, . . . , rn, piwhere Sis the
state space, Aiis the action space of player i(i= 1, . . . , n), ri:S×A1× ··· × AnRis the
reward function for player i,f:S×A1× ··· × AnSis the transition function.
The goal of each player i, is to choose a policy πi(s) : SAithat maximizes the expected
reward for that player, given the policies of all other players. In case of n= 1, a Markov game
reduces to a Markov Decision Process (MDP) where an agent interacts with a fixed environment.
At each state s, each player isimultaneously chooses an action aifrom a set of actions Ai. We
denote the actions of all players other than ias ai. Players may also choose a probability distribu-
tion over actions, where the probability of action aiis denoted πi(s, ai)or σi(ai)and the vector of
probabilities is denoted πi(s)or πi.
2.2 HEDGE
Hedge Littlestone & Warmuth (1994); Freund & Schapire (1997) is an iterative algorithm that con-
verges to an equilibrium. We use variants of hedge for planning by using them to compute an
equilibrium policy on each turn of the game and then playing that policy.
Assume that after player ichooses an action aiand all other players choose actions ai, player i
receives a reward of ui(ai,ai), where uiwill come from our RL-trained value function. We denote
the average reward in hindsight for action aiup to iteration tas Qt(ai) = 1
tPt0tui(ai, at0
i).
On each iteration tof hedge, the policy πt
i(ai)is set according to πt
i(ai)exp Qt1(ai)t1
where κtis a temperature parameter.2
It is proven that if κtis set to 1
tthen as t the average policy over all iterations converges to a
coarse correlated equilibrium, though in practice it often comes close to a Nash equilibrium as well.
In all experiments we set κt=3St
10ton iteration t, where Stis the observed standard deviation of
the player’s utility up to iteration t, based on a heuristic from Brown et al. (2017). A simpler choice
is to set κt= 0, which makes the algorithm equivalent to fictitious play (Brown, 1951).
Regret matching (RM) (Blackwell et al., 1956; Hart & Mas-Colell, 2000) is an alternative
equilibrium-finding algorithm that has similar theoretical guarantees to hedge and was used in pre-
viously work on Diplomacy Gray et al. (2020); Bakhtin et al. (2021). We do not use this algorithm
but we do evaluate baseline agents that use RM.
2.3 DORA: SELF-PLAY LEARNING IN MARKOV GAMES
Our approach draws significantly from DORA (Bakhtin et al., 2021), which we describe in more
detail here. In this approach, the authors run an algorithm that is similar to past model-based
reinforcement-learning methods such as AlphaZero (Silver et al., 2018), except in place of Monte
Carlo tree search, which is unsound in simultaneous-action games such as Diplomacy or other im-
perfect information games, it instead uses an equilibrium-finding algorithm such as hedge or RM
to iteratively approximate a Nash equilibrium for the current state (i.e., one-step lookahead search).
A deep neural net trained to predict the policy is used to sample plausible actions for all players to
reduce the large action space in Diplomacy down to a tractable subset for the equilibrium-finding
procedure, and a deep neural net trained to predict state values is used to evaluate the results of
joint actions sampled by this procedure. Beginning with a policy and value network randomly ini-
tialized from scratch, a large number of self-play games are played and the resulting equilibrium
policies and the improved 1-step value estimates computed on every turn from equilibrium-finding
are added to a replay buffer used for subsequently improving the policy and value. Additionally, a
double-oracle (McMahan et al., 2003) method was used to allow the policy to explore and discover
additional actions, and the same equilibrium-finding procedure was also used at test time.
For the core update step, Bakhtin et al. (2021) propose Deep Nash Value Iteration (DNVI), a value
iteration procedure similar to Nash Q-Learning (Hu & Wellman, 2003), which is a generalization
2We use κtrather than ηused in Jacob et al. (2022) in order to clean up notation. κt= 1/(η·t).
3
of Q-learning (Watkins, 1989) from MDPs to Stochastic games. The idea of Nash-Q is to compute
equilibrium policies σin a subgame where the actions correspond to the possible actions in a current
state and the payoffs are defined using the current approximation of the value function. Bakhtin et al.
(2021) propose an equivalent update that uses a state value function V(s)instead of a state-action
value function Q(s, a):
V(s)(1 α)V(s) + α(r+γX
a0
σ(a0)V(f(s, a0))) (1)
where αis the learning rate, σ(·)is the probability of joint action in equilibrium, a0is joint action,
and fis the transition function. For 2p0s games and certain other game classes, this algorithm con-
verges to a Nash equilibrium in the original stochastic game under the assumption that an exploration
policy is used such that each state is visited infinitely often .
The tabular approach of Nash-Q does not scale to large games such as Diplomacy. DNVI replaces
the explicit value function table and update rule in 1 with a value function parameterized by a neural
network, V(s;θv)and uses gradient descent to update it using the following loss:
ValueLoss(θv) = 1
2 V(s;θv)r(s)γX
a0
σ(a0)Vf(s, a0); ˆ
θv!2
(2)
The summation used in 2 is not feasible in games with large action spaces as the number of joint
actions grow exponentially with the number of players. Bakhtin et al. (2021) address this issue by
considering only a subset of actions at each step. An auxiliary function, a policy proposal network
πi(s, ai;θπ), models the probability that an action aiof player iis in the support of the equilib-
rium σ. Only the top-ksampled actions from this distribution are considered when solving for the
equilibrium policy σand computing the above value loss. Once the equilibrium is computed, the
equilibrium policy is also used to further train the policy proposal network using cross entropy loss:
PolicyLoss(θπ) = X
iX
aiAi
σi(a) log πi(s, ai;θπ).(3)
Bakhtin et al. (2021) report that the resulting agent DORA does very well when playing with other
copies of itself. However, DORA performs poorly in games with 6 human human-like agents.
2.4 PIKL: MODELING HUMANS WITH IMITATION-ANCHORED PLANNING
Behavioral cloning (BC) is the standard approach for modeling human behaviors given data. Behav-
ioral cloning learns a policy that maximizes the likelihood of the human data by gradient descent on
a cross-entropy loss. However, as observed and discussed in Jacob et al. (2022), BC often falls short
of accurately modeling or matching human-level performance, with BC models underperforming the
human players they are trained to imitate in games such as Chess, Go, and Diplomacy. Intuitively,
it might seem that initializing self-play with an imitation-learned policy would result in an agent
that is both strong and human-like. Indeed, Bakhtin et al. (2021) showed improved performance
against human-like agents when initializing the DORA training procedure from a human imitation
policy and value, rather than starting from scratch. However, we show in subsection 5.3 that such an
approach still results in policies that deviate from human-compatible equilibria.
Jacob et al. (2022) found that an effective solution was to perform search with a regularization
penalty proportional to the KL divergance from a human imitation policy. This algorithm is referred
to as piKL. The form of piKL we focus on in this paper is a variant of hedge called piKL-hedge, in
which each player iseeks to maximize expected reward, while at the same time playing “close” to
a fixed anchor policy τi. The two goals can be reconciled by defining a composite utility function
that adds a penalty based on the “distance” between the player policy and their anchor policy, with
coefficient λi[0,)scaling the penalty.
For each player i, we define is utility as a function of the agent policy πi∆(Ai)given policies
πiof all other agents:
˜ui,λi(πi,πi):=ui(πi,πi)λiDKL(πikτi)(4)
4
Algorithm 1: DIL-PIKL (for Player i)
Data: Aiset of actions for Player i;
uireward function for Player i;
Λia set of λvalues to consider for
Player i;
βia belief distribution over λvalues for
Player i.
1function INITIALIZE()
2t0
3for each action aiAido
4Q0
i(ai)0
5function PLAY()
6tt+ 1
7sample λβi
8let πt
i,λ be the policy such that
πt
i,λ(ai)expQt1(ai) + λlog τi(ai)
κt1+λ
9sample an action at
iπt
i,λ
10 play at
iAiand observe actions at
iplayed
by the opponents
11 for each aiAido
12 Qt(ai)t1
tQt1(ai) + 1
tui(ai,at
i)
Figure 1: DiL-piKL algorithm. Lines with highlights show
the main differences between this algorithm and piKL-Hedge
algorithm proposed in Jacob et al. (2022).
Figure 2: λpop represents the common-
knowledge belief about the λparameter or
distribution used by all players. λagent repre-
sents the λvalue actually used by the agent to
determine its policy. By having λagent differ
from λpop, DiL-piKL interpolates between an
equilibrium under the utility function ui, be-
havioral cloning and best response to behav-
ioral cloning policies. piKL assumed a com-
mon λ, which moved it along one axis of the
space. Our agent models and coordinates with
high-λplayers while playing a lower λitself.
This results in a modification of hedge such that on each iteration t,πt
i(ai)is set according to
πt
i(ai)expQt1(ai) + λlog τi(ai)
κt1+λ(5)
When λis large, the utility function is dominated by the KL-divergence term λiDKL(πikτi), and
so the agent will naturally tend to play a policy πiclose to the anchor policy τi. When λiis small, the
dominating term is the rewards ui(πi,at
i)and so the agent will tend to maximize reward without
as closely matching the anchor policy τi.
3 DISTRIBUTIONAL LAMBDA PIKL (DIL-PIKL)
piKL trades off between the strength of the agent and the closeness to the anchor policy using a
single fixed λparameter. In practice, we find that sampling λfrom a probability distribution each
iteration produces better performance. In this section, we introduce distributional lambda piKL
(DiL-piKL), which replaces the single λparameter in piKL with a probability distribution βover λ
values. On each iteration, each player isamples a λvalue from βiand then chooses a policy based
on Equation 5 using that sampled λ. Figure 1 highlights the difference between piKL and DiL-piKL.
One interpretation of DiL-piKL is that each choice of λis an agent type, where agent types with high
λchoose policies closer to τwhile agent types with low λchoose policies that are more “optimal”
and less constrained to a common-knowledge anchor policy. A priori, each player is randomly sam-
pled from this population of agent types, and the distribution βirepresents the common-knowledge
uncertainty about which of the agent types player imay be. Another interpretation is that piKL
assumed an exponential relation between action EV and likelihood, whereas DiL-piKL results in a
fatter-tailed distribution that may more robustly model different playing styles or game situations.
5
摘要:

MASTERINGTHEGAMEOFNO-PRESSDIPLOMACYVIAHUMAN-REGULARIZEDREINFORCEMENTLEARNINGANDPLANNINGAntonBakhtinMetaAIDavidJWuMetaAIAdamLererMetaAIJonathanGrayMetaAIAthulPaulJacobMITGabrieleFarinaMetaAIAlexanderHMillerMetaAINoamBrownMetaAIABSTRACTNo-pressDiplomacyisacomplexstrategygameinvolvingbothcooperat...

展开>> 收起<<
MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN -REGULARIZED REINFORCEMENT LEARNING AND PLANNING.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:749.7KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注