Whilst this work (MDAM) shares our architecture and goal of training a population, our approach
to enforcing diversity differs substantially. MDAM explicitly trades off performance with diversity
by jointly optimizing policies and their KL divergence. Moreover, as computing the KL divergence
for the whole trajectory is intractable, MDAM is restricted to only using it to drive diversity at the
first timestep. In contrast, Poppy drives diversity by maximizing population-level performance (i.e.
without any explicit diversity metric), uses the whole trajectory and scales better with the population
size (we have used up to 32 agents instead of only 5).
Additionally, ML approaches usually rely on mechanisms to generate multiple candidate solutions
[Mazyavkina et al.,2021]. One such mechanism consists in using improvement methods on an initial
solution: de O. da Costa et al. [2020] use policy gradients to learn a policy that selects local operators
(2-opt) given a current solution in TSP, while Lu et al. [2020] and Wu et al. [2021] extend this method
to CVRP. This idea has been extended to enable searching a learned latent space of solutions [Hottung
et al.,2021]. However, these approaches have two limitations: they are environment-specific, and the
search procedure is inherently biased by the initial solution.
An alternative exploration mechanism is to generate a diverse set of trajectories by stochastically
sampling a learned policy, potentially with additional beam search [Joshi et al.,2019], Monte Carlo
tree search [Fu et al.,2021], dynamic programming [Kool et al.,2021], active search [Hottung et al.,
2022], or simulation-guided search [Choo et al.,2022]. However, intuitively, the generated solutions
tend to remain close to the underlying deterministic policy, implying that the benefits of additional
sampled candidates diminish quickly.
Population-Based RL Populations have already been used in RL to learn diverse behaviors. In a
different context, Gupta et al. [2018], Eysenbach et al. [2019], Hartikainen et al. [2020] and Pong et al.
[2020] use a single policy conditioned on a set of goals as an implicit population for unsupervised
skill discovery. Closer to our approach, another line of work revolves around explicitly storing a
set of distinct policy parameters. Hong et al. [2018], Doan et al. [2020], Jung et al. [2020] and
Parker-Holder et al. [2020] use a population to achieve a better coverage of the policy space. However,
they enforce explicit attraction-repulsion mechanisms, which is a major difference with respect to our
approach where diversity is a pure byproduct of performance optimization.
Our method is also related to approaches combining RL with evolutionary algorithms [EA; Khadka
and Tumer,2018,Khadka et al.,2019,Pourchot and Sigaud,2019], which benefit from the sample-
efficient RL policy updates while enjoying evolutionary population-level exploration. However, the
population is a means to learn a unique strong policy, whereas Poppy learns a set of complementary
strategies. More closely related, Quality-Diversity [QD; Pugh et al.,2016,Cully and Demiris,2018]
is a popular EA framework that maintains a portfolio of diverse policies. Pierrot et al. [2022] has
recently combined RL with a QD algorithm, Map-Elites [Mouret and Clune,2015]; unlike Poppy,
QD methods rely on handcrafted behavioral markers, which is not easily amenable to the CO context.
One of the drawbacks of population-based RL is its expensive cost. However, recent approaches have
shown that modern hardware, as well as targeted frameworks, enable efficient vectorized population
training [Flajolet et al.,2022], opening the door to a wider range of applications.
3 Methods
3.1 Background and Motivation
RL Formulation A CO problem instance
ρ
sampled from some distribution
D
consists of a discrete
set of
N
variables (e.g. city locations in TSP). We model a CO problem as a Markov decision process
(MDP) defined by a state space
S
, an action space
A
, a transition function
T
, and a reward function
R
. A state is a trajectory through the problem instance
τt= (x1, . . . , xt)∈ S
where
xi∈ρ
, and thus
consists of an ordered list of variables (not necessarily of length
N
). An action,
a∈ A ⊆ ρ
, consists
of choosing the next variable to add; thus, given state
τt= (x1, . . . , xt)
and action
a
, the next state is
τt+1 =T(τt, a)=(x1, . . . , xt, a)
. Let
S∗⊆ S
be the set of solutions, i.e. states that comply with
the problem’s constraints (e.g., a sequence of cities such that each city is visited once and ends with
the starting city in TSP). The reward function
R:S∗→R
maps solutions into scalars. We assume
the reward is maximized by the optimal solution (e.g. Rreturns the negative tour length in TSP).
3