
Recently, (Jacob et al., 2022) introduced piKL, which models human behavior in many games better
than pure behavioral cloning (BC) on human data by regularizing inference-time planning toward a
BC policy. In this work, we introduce an extension of piKL, called DiL-piKL, that replaces piKL’s
single fixed regularization parameter λwith a probability distribution over λparameters. We then
show how DiL-piKL can be combined with self-play reinforcement learning, allowing us to train a
strong agent that performs well with humans. We call this algorithm RL-DiL-piKL.
Using RL-DiL-piKL we trained an agent, Diplodocus, to play no-press Diplomacy, a difficult bench-
mark for multi-agent AI that has been actively studied in recent years (Paquette et al., 2019; Anthony
et al., 2020; Gray et al., 2020; Bakhtin et al., 2021; Jacob et al., 2022). We conducted a 200-game
no-press Diplomacy tournament with a diverse pool of human players, including expert humans, in
which we tested two versions of Diplodocus using different RL-DiL-piKL settings, and other base-
line agents. All games consisted of one bot and six humans, with all players being anonymous for
the duration of the game. These two versions of Diplodocus achieved the top two average scores
in the tournament among all 48 participants who played more than two games, and ranked first and
third overall among all participants according to an Elo ratings model.
2 BACKGROUND AND PRIOR WORK
Diplomacy is a benchmark 7-player mixed cooperative/competitive game featuring simultaneous
moves and a heavy emphasis on negotiation and coordination. In the no-press variant of the game,
there is no cheap talk communication. Instead, players only implicitly communicate through moves.
In the game, seven players compete for majority control of 34 “supply centers” (SCs) on a map.
On each turn, players simultaneously choose actions consisting of an order for each of their units to
hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining
players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use
a common scoring system in which the score of player iis C2
i/Pi0C2
i0, where Ciis the number of
SCs player iowns. A more detailed description of the rules is provided in Appendix B.
Most recent successes in no-press Diplomacy use deep learning to imitate human behavior given a
corpus of human games. The first Diplomacy agent to leverage deep imitation learning was Paquette
et al. (2019). Subsequent work on no-press Diplomacy have mostly relied on a similar architecture
with some modeling improvements (Gray et al., 2020; Anthony et al., 2020; Bakhtin et al., 2021).
Gray et al. (2020) proposed an agent that plays an improved policy via one-ply search. It uses policy
and value functions trained on human data to to conduct search using regret minimization.
Several works explored applying self-play to compute improved policies. Paquette et al. (2019)
applied an actor-critic approach and found that while the agent plays stronger in populations of
other self-play agents, it plays worse against a population of human-imitation agents. Anthony
et al. (2020) used a self-play approach based on a modification of fictitious play in order to reduce
drift from human conventions. The resulting policy is stronger than pure imitation learning in both
1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, Bakhtin et al. (2021)
combined one-ply search based on equilibrium computation with value iteration to produce an agent
called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without
human data, but in the full 7-player game plays poorly with agents other than itself.
Jacob et al. (2022) showed that regularizing inference-time search techniques can produce agents
that are not only strong but can also model human behaviour well. In the domain of no-press Diplo-
macy, they show that regularizing hedge (an equilibrium-finding algorithm) with a KL-divergence
penalty towards a human imitation learning policy can match or exceed the human action prediction
accuracy of imitation learning while being substantially stronger. KL-regularization toward human
behavioral policies has previously been proposed in various forms in single- and multi-agent RL
algorithms (Nair et al., 2018; Siegel et al., 2020; Nair et al., 2020), and was notably employed in
AlphaStar (Vinyals et al., 2019), but this has typically been used to improve sample efficiency and
aid exploration rather than to better model and coordinate with human play.
An alternative line of research has attempted to build human-compatible agents without relying
on human data (Hu et al., 2020; 2021; Strouse et al., 2021). These techniques have shown some
success in simplified settings but have not been shown to be competitive with humans in large-scale
collaborative environments.
2