Winner Takes It All Training Performant RL Populations for Combinatorial Optimization Nathan Grinsztajn

2025-04-29 0 0 1.34MB 25 页 10玖币
侵权投诉
Winner Takes It All: Training Performant RL
Populations for Combinatorial Optimization
Nathan Grinsztajn
InstaDeep
n.grinsztajn@instadeep.com
Daniel Furelos-Blanco
Imperial College London
Shikha Surana
InstaDeep
Clément Bonnet
InstaDeep
Thomas D. Barrett
InstaDeep
Abstract
Applying reinforcement learning (RL) to combinatorial optimization problems
is attractive as it removes the need for expert knowledge or pre-solved instances.
However, it is unrealistic to expect an agent to solve these
(often NP-)hard
prob-
lems in a single shot at inference due to their inherent complexity. Thus, leading
approaches often implement additional search strategies, from stochastic sampling
and beam-search to explicit fine-tuning. In this paper, we argue for the benefits
of learning a population of complementary policies, which can be simultaneously
rolled out at inference. To this end, we introduce Poppy, a simple training proce-
dure for populations. Instead of relying on a predefined or hand-crafted notion of
diversity, Poppy induces an unsupervised specialization targeted solely at maxi-
mizing the performance of the population. We show that Poppy produces a set of
complementary policies, and obtains state-of-the-art RL results on four popular
NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack,
and job-shop scheduling.
1 Introduction
In recent years, machine learning (ML) approaches have overtaken algorithms that use handcrafted
features and strategies across a variety of challenging tasks [Mnih et al.,2015,van den Oord et al.,
2016,Silver et al.,2018,Brown et al.,2020]. In particular, solving combinatorial optimization (CO)
problems – where the maxima or minima of an objective function acting on a finite set of discrete
variables is sought – has attracted significant interest [Bengio et al.,2021] due to both their (often
NP) hard nature and numerous practical applications across domains varying from logistics [Sbihi
and Eglese,2007] to fundamental science [Wagner,2020].
As the search space of feasible solutions typically grows exponentially with the problem size, exact
solvers can be challenging to scale; hence, CO problems are often also tackled with handcrafted
heuristics using expert knowledge. Whilst a diversity of ML-based heuristics have been proposed,
reinforcement learning [RL; Sutton and Barto,2018] is a promising paradigm as it does not require
pre-solved examples of these hard problems. Indeed, algorithmic improvements to RL-based CO
solvers, coupled with low inference cost, and the fact that they are by design targeted at specific
problem distributions, have progressively narrowed the gap with traditional solvers.
RL methods frame CO as sequential decision-making problems, and can be divided into two families
[Mazyavkina et al.,2021]. First, improvement methods start from a feasible solution and iteratively
Work completed during an internship at InstaDeep.
36th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.03475v2 [cs.AI] 13 Nov 2023
improve it through small modifications (actions). However, such incremental search cannot quickly
access very different solutions, and requires handcrafted procedures to define a sensible action space.
Second, construction methods incrementally build a solution by selecting one element at a time. In
practice, it is often unrealistic for a learned heuristic to solve NP-hard problems in a single shot,
therefore these methods are typically combined with search strategies, such as stochastic sampling
or beam search. However, just as improvement methods are biased by the initial starting solution,
construction methods are biased by the single underlying policy. Thus, a balance must be struck
between the exploitation of the learned policy (which may be ill-suited for a given problem instance)
and the exploration of different solutions (where the extreme case of a purely random policy will
likely be highly inefficient).
In this work, we propose Poppy, a construction method that uses a population of agents with
suitably diverse policies to improve the exploration of the solution space of hard CO problems.
Whereas a single agent aims to perform well across the entire problem distribution, and thus has
to make compromises, a population can learn a set of heuristics such that only one of these has to
be performant on any given problem instance. However, realizing this intuition presents several
challenges: (i) naïvely training a population of agents is expensive and challenging to scale, (ii) the
trained population should have complementary policies that propose different solutions, and (iii) the
training approach should not impose any handcrafted notion of diversity within the set of policies
given the absence of clear behavioral markers aligned with performance for typical CO problems.
Challenge (i) can be addressed by sharing a large fraction of the computations across the population,
specializing only lightweight policy heads to realize the diversity of agents. Moreover, this can be
done on top of a pre-trained model, which we clone to produce the population. Challenges (ii) and (iii)
are jointly achieved by introducing an RL objective aimed at specializing agents on distinct subsets
of the problem distribution. Concretely, we derive a policy gradient method for the population-level
objective, which corresponds to training only the agent which performs best on each problem. This
is intuitively justified as the performance of the population on a given problem is not improved by
training an agent on an instance where another agent already has better performance. Strikingly, we
find that judicious application of this conceptually simple objective gives rise to a population where
the diversity of policies is obtained without explicit supervision (and hence is applicable across a
range of problems without modification) and essential for strong performance.
Our contributions are summarized as follows:
1.
We motivate the use of populations for CO problems as an efficient way to explore environ-
ments that are not reliably solved by single-shot inference.
2.
We derive a new training objective and present a practical training procedure that encourages
performance-driven diversity (i.e. effective diversity without the use of explicit behavioral
markers or other external supervision).
3.
We evaluate Poppy on four CO problems: traveling salesman (TSP), capacitated vehicle
routing (CVRP), 0-1 knapsack (KP), and job-shop scheduling (JSSP). In these four problems,
Poppy consistently outperforms all other RL-based approaches.
2 Related Work
ML for Combinatorial Optimization The first attempt to solve TSP with neural networks is due
to Hopfield and Tank [1985], which only scaled up to 30 cities. Recent developments of bespoke
neural architectures [Vinyals et al.,2015,Vaswani et al.,2017] and performant hardware have made
ML approaches increasingly efficient. Indeed, several architectures have been used to address CO
problems, such as graph neural networks [Dai et al.,2017], recurrent neural networks [Nazari et al.,
2018], and attention mechanisms [Deudon et al.,2018]. Kool et al. [2019] proposed an encoder-
decoder architecture that we employ for TSP, CVRP and KP. The costly encoder is run once per
problem instance, and the resulting embeddings are fed to a small decoder iteratively rolled out to
get the whole trajectory, which enables efficient inference. This approach was furthered by Kwon
et al. [2020] and Kim et al. [2022], who leveraged the underlying symmetries of typical CO problems
(e.g. of starting positions and rotations) to realize improved training and inference performance using
instance augmentations. Kim et al. [2021] also draw on Kool et al. and use a hierarchical strategy
where a seeder proposes solution candidates, which are refined bit-by-bit by a reviser. Closer to
our work, Xin et al. [2021] train multiple policies using a shared encoder and separate decoders.
2
Whilst this work (MDAM) shares our architecture and goal of training a population, our approach
to enforcing diversity differs substantially. MDAM explicitly trades off performance with diversity
by jointly optimizing policies and their KL divergence. Moreover, as computing the KL divergence
for the whole trajectory is intractable, MDAM is restricted to only using it to drive diversity at the
first timestep. In contrast, Poppy drives diversity by maximizing population-level performance (i.e.
without any explicit diversity metric), uses the whole trajectory and scales better with the population
size (we have used up to 32 agents instead of only 5).
Additionally, ML approaches usually rely on mechanisms to generate multiple candidate solutions
[Mazyavkina et al.,2021]. One such mechanism consists in using improvement methods on an initial
solution: de O. da Costa et al. [2020] use policy gradients to learn a policy that selects local operators
(2-opt) given a current solution in TSP, while Lu et al. [2020] and Wu et al. [2021] extend this method
to CVRP. This idea has been extended to enable searching a learned latent space of solutions [Hottung
et al.,2021]. However, these approaches have two limitations: they are environment-specific, and the
search procedure is inherently biased by the initial solution.
An alternative exploration mechanism is to generate a diverse set of trajectories by stochastically
sampling a learned policy, potentially with additional beam search [Joshi et al.,2019], Monte Carlo
tree search [Fu et al.,2021], dynamic programming [Kool et al.,2021], active search [Hottung et al.,
2022], or simulation-guided search [Choo et al.,2022]. However, intuitively, the generated solutions
tend to remain close to the underlying deterministic policy, implying that the benefits of additional
sampled candidates diminish quickly.
Population-Based RL Populations have already been used in RL to learn diverse behaviors. In a
different context, Gupta et al. [2018], Eysenbach et al. [2019], Hartikainen et al. [2020] and Pong et al.
[2020] use a single policy conditioned on a set of goals as an implicit population for unsupervised
skill discovery. Closer to our approach, another line of work revolves around explicitly storing a
set of distinct policy parameters. Hong et al. [2018], Doan et al. [2020], Jung et al. [2020] and
Parker-Holder et al. [2020] use a population to achieve a better coverage of the policy space. However,
they enforce explicit attraction-repulsion mechanisms, which is a major difference with respect to our
approach where diversity is a pure byproduct of performance optimization.
Our method is also related to approaches combining RL with evolutionary algorithms [EA; Khadka
and Tumer,2018,Khadka et al.,2019,Pourchot and Sigaud,2019], which benefit from the sample-
efficient RL policy updates while enjoying evolutionary population-level exploration. However, the
population is a means to learn a unique strong policy, whereas Poppy learns a set of complementary
strategies. More closely related, Quality-Diversity [QD; Pugh et al.,2016,Cully and Demiris,2018]
is a popular EA framework that maintains a portfolio of diverse policies. Pierrot et al. [2022] has
recently combined RL with a QD algorithm, Map-Elites [Mouret and Clune,2015]; unlike Poppy,
QD methods rely on handcrafted behavioral markers, which is not easily amenable to the CO context.
One of the drawbacks of population-based RL is its expensive cost. However, recent approaches have
shown that modern hardware, as well as targeted frameworks, enable efficient vectorized population
training [Flajolet et al.,2022], opening the door to a wider range of applications.
3 Methods
3.1 Background and Motivation
RL Formulation A CO problem instance
ρ
sampled from some distribution
D
consists of a discrete
set of
N
variables (e.g. city locations in TSP). We model a CO problem as a Markov decision process
(MDP) defined by a state space
S
, an action space
A
, a transition function
T
, and a reward function
R
. A state is a trajectory through the problem instance
τt= (x1, . . . , xt)∈ S
where
xiρ
, and thus
consists of an ordered list of variables (not necessarily of length
N
). An action,
a∈ A ⊆ ρ
, consists
of choosing the next variable to add; thus, given state
τt= (x1, . . . , xt)
and action
a
, the next state is
τt+1 =T(τt, a)=(x1, . . . , xt, a)
. Let
S⊆ S
be the set of solutions, i.e. states that comply with
the problem’s constraints (e.g., a sequence of cities such that each city is visited once and ends with
the starting city in TSP). The reward function
R:SR
maps solutions into scalars. We assume
the reward is maximized by the optimal solution (e.g. Rreturns the negative tour length in TSP).
3
Apolicy
πθ
parameterized by
θ
can be used to generate solutions for any instance
ρ∼ D
by
iteratively sampling the next action
a∈ A
according to the probability distribution
πθ(· | ρ, τt)
. We
learn
πθ
using REINFORCE [Williams,1992]. This method aims at maximizing the RL objective
J(θ).
=Eρ∼DEτπθR(τ)
by adjusting
θ
such that good trajectories are more likely to be sampled
in the future. Formally, the policy parameters
θ
are updated by gradient ascent using
θJ(θ) =
Eρ∼DEτπθ(R(τ)bρ)θlog(pθ(τ))
, where
pθ(τ) = Qtπθ(at+1 |ρ, τt)
and
bρ
is a baseline.
The gradient of the objective, θJ, can be estimated empirically using Monte Carlo simulations.
Figure 1: In this environment, the upward path always leads to a medium reward, while the left and
right paths are intricate such that either one may lead to a low reward or high reward with equal
probability. Left: An agent trained to maximize its expected reward converges to taking the safe
upward road since acting optimally is too computationally intensive, as it requires solving the maze.
Right: A 2-agent population can always take the left and right paths and thus get the largest reward.
Motivating Example We argue for the benefits of training a population using the example in
Figure 1. In this environment, there are three actions: Left,Right, and Up.Up leads to a medium
reward, while Left/Right lead to low/high or high/low rewards (the configuration is determined with
equal probability at the start of each episode). Crucially, the left and right paths are intricate, so the
agent cannot easily infer from its observation which one leads to a higher reward. Then, the best
strategy for a computationally limited agent is to always go Up, as the guaranteed medium reward
(2 scoops) is higher than the expected reward of guessing left or right (1.5 scoops). In contrast, two
agents in a population can go in opposite directions and always find the maximum reward. There
are two striking observations: (i) the agents do not need to perform optimally for the population
performance to be optimal (one agent gets the maximum reward), and (ii) the performance of each
individual agent is worse than in the single-agent case.
The discussed phenomenon can occur when (i) some optimal actions are too difficult to infer from
observations and (ii) choices are irreversible (i.e. it is not possible to recover from a sub-optimal
decision). This problem setting can be seen as a toy model for the challenge of solving an NP-hard
CO problem in a sequential decision-making setting. In the case of TSP, for example, the number of
possible unique tours that could follow from each action is exponentially large and, for any reasonably
finite agent capacity, essentially provides the same obfuscation over the final returns. In this situation,
as shown above, maximizing the performance of a population will require agents to specialize and
likely yield better results than in the single-agent case.
3.2 Poppy
We present the components of Poppy: an RL objective encouraging agent specialization, and an
efficient training procedure taking advantage of a pre-trained policy.
Population-Based Objective At inference, reinforcement learning methods usually sample several
candidates to find better solutions. This process, though, is not anticipated during training, which
optimizes the 1-shot performance with the usual RL objective
J(θ) = Eρ∼DEτπθR(τ)
previously
presented in Section 3.1. Intuitively, given
K
trials, we would like to find the best set of policies
{π1, . . . , πK}to rollout once on a given problem. This gives the following population objective:
4
Jpop(θ1, . . . , θK).
=Eρ∼DEτ1πθ1,...,τKπθKmax [R(τ1), . . . , R(τK)] ,
where each trajectory
τi
is sampled according to the policy
πθi
. Maximizing
Jpop
leads to finding the
best set of Kagents which can be rolled out in parallel for any problem.
Theorem 1 (Policy gradient for populations).The gradient of the population objective is:
Jpop(θ1, θ2, . . . , θK) = Eρ∼DEτ1πθ1,...,τKπθKR(τi)R(τi∗∗ )log pθi(τi),
where:
i= arg maxi∈{1,...,K}R(τi)
(index of the agent that got the highest reward) and
i∗∗ =
arg maxi̸=iR(τi)(index of the agent that got the second highest reward).
The proof is provided in Appendix B.1. Remarkably, it corresponds to rolling out every agent and
only training the one that got the highest reward on any problem. This formulation applies across
various problems and directly optimizes for population-level performance without explicit supervision
or handcrafted behavioral markers.
The theorem trivially holds when instead of a population, one single agent is sampled
K
times, which
falls back to the specific case where θ1=θ2=··· =θK. In this case, the gradient becomes:
Jpop(θ) = Eρ∼DEτ1πθ,...,τKπθR(τi)R(τi∗∗ )log pθ(τi),
Remark. The occurrence of
R(τi∗∗ )
in the new gradient formulation can be surprising. However,
Theorem 1can be intuitively understood as training each agent on its true contribution to the
population’s performance: if for any problem instance the best agent
i
was removed, the population
performance would fallback to the level of the second-best agent
i∗∗
, hence its contribution is indeed
R(τi)R(τi∗∗ ).
Optimizing the presented objective does not provide any strict diversity guarantee. However, note that
diversity maximizes our objective in the highly probable case that, within the bounds of finite capacity
and training, a single agent does not perform optimally on all subsets of the training distribution.
Therefore, intuitively and later shown empirically, diversity emerges over training in the pursuit of
maximizing the objective.
Algorithm 1 Poppy training
1:
Input: problem distribution
D
, number of agents
K
, batch size
B
, number of training steps
H
,
pre-trained parameters θ.
2: θ1, θ2, . . . , θKCLONE(θ){Clone the pre-trained agent Ktimes.}
3: for step 1 to Hdo
4: ρiSample(D)i1, . . . , B
5: τk
iRollout(ρi, θk)i1, . . . , B, k1, . . . , K
6: k
iarg maxkKR(τk
i)i1, . . . , B {Select the best agent for each problem ρi.}
7: L(θ1, . . . , θK)1
BPiBREINFORCE(τk
i
i){Backpropagate through these only.}
8: (θ1, . . . , θK)(θ1, . . . , θK)αL(θ1, . . . , θK)
Training Procedure The training procedure consists of two phases:
1.
We train (or reuse) a single agent using an architecture suitable for solving the CO problem
at hand. We later outline the architectures used for the different problems.
2.
The agent trained in Phase 1 is cloned
K
times to form a
K
-agent population. The population
is trained as described in Algorithm 1: only the best agent is trained on any problem. Agents
implicitly specialize in different types of problem instances during this phase.
Phase 1 enables training the model without the computational overhead of a population. Moreover,
we informally note that applying the Poppy objective directly to a population of untrained agents can
be unstable. Randomly initialized agents are often ill-distributed, hence a single (or few) agent(s)
dominate the performance across all instances. In this case, only the initially dominating agents
receive a training signal, further widening the performance gap. Whilst directly training a population
5
摘要:

WinnerTakesItAll:TrainingPerformantRLPopulationsforCombinatorialOptimizationNathanGrinsztajnInstaDeepn.grinsztajn@instadeep.comDanielFurelos-BlancoImperialCollegeLondon∗ShikhaSuranaInstaDeepClémentBonnetInstaDeepThomasD.BarrettInstaDeepAbstractApplyingreinforcementlearning(RL)tocombinatorialoptimiza...

展开>> 收起<<
Winner Takes It All Training Performant RL Populations for Combinatorial Optimization Nathan Grinsztajn.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.34MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注