Winner Takes It All Training Performant RL Populations for Combinatorial Optimization Nathan Grinsztajn

2025-04-29 1 0 1.34MB 25 页 10玖币

侵权投诉

Winner Takes It All: Training Performant RL

Populations for Combinatorial Optimization

Nathan Grinsztajn

InstaDeep

n.grinsztajn@instadeep.com

Daniel Furelos-Blanco

Imperial College London∗

Shikha Surana

InstaDeep

Clément Bonnet

InstaDeep

Thomas D. Barrett

InstaDeep

Abstract

Applying reinforcement learning (RL) to combinatorial optimization problems

is attractive as it removes the need for expert knowledge or pre-solved instances.

However, it is unrealistic to expect an agent to solve these

(often NP-)hard

prob-

lems in a single shot at inference due to their inherent complexity. Thus, leading

approaches often implement additional search strategies, from stochastic sampling

and beam-search to explicit ﬁne-tuning. In this paper, we argue for the beneﬁts

of learning a population of complementary policies, which can be simultaneously

rolled out at inference. To this end, we introduce Poppy, a simple training proce-

dure for populations. Instead of relying on a predeﬁned or hand-crafted notion of

diversity, Poppy induces an unsupervised specialization targeted solely at maxi-

mizing the performance of the population. We show that Poppy produces a set of

complementary policies, and obtains state-of-the-art RL results on four popular

NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack,

and job-shop scheduling.

1 Introduction

In recent years, machine learning (ML) approaches have overtaken algorithms that use handcrafted

features and strategies across a variety of challenging tasks [Mnih et al.,2015,van den Oord et al.,

2016,Silver et al.,2018,Brown et al.,2020]. In particular, solving combinatorial optimization (CO)

problems – where the maxima or minima of an objective function acting on a ﬁnite set of discrete

variables is sought – has attracted signiﬁcant interest [Bengio et al.,2021] due to both their (often

NP) hard nature and numerous practical applications across domains varying from logistics [Sbihi

and Eglese,2007] to fundamental science [Wagner,2020].

As the search space of feasible solutions typically grows exponentially with the problem size, exact

solvers can be challenging to scale; hence, CO problems are often also tackled with handcrafted

heuristics using expert knowledge. Whilst a diversity of ML-based heuristics have been proposed,

reinforcement learning [RL; Sutton and Barto,2018] is a promising paradigm as it does not require

pre-solved examples of these hard problems. Indeed, algorithmic improvements to RL-based CO

solvers, coupled with low inference cost, and the fact that they are by design targeted at speciﬁc

problem distributions, have progressively narrowed the gap with traditional solvers.

RL methods frame CO as sequential decision-making problems, and can be divided into two families

[Mazyavkina et al.,2021]. First, improvement methods start from a feasible solution and iteratively

∗Work completed during an internship at InstaDeep.

36th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.03475v2 [cs.AI] 13 Nov 2023

improve it through small modiﬁcations (actions). However, such incremental search cannot quickly

access very different solutions, and requires handcrafted procedures to deﬁne a sensible action space.

Second, construction methods incrementally build a solution by selecting one element at a time. In

practice, it is often unrealistic for a learned heuristic to solve NP-hard problems in a single shot,

therefore these methods are typically combined with search strategies, such as stochastic sampling

or beam search. However, just as improvement methods are biased by the initial starting solution,

construction methods are biased by the single underlying policy. Thus, a balance must be struck

between the exploitation of the learned policy (which may be ill-suited for a given problem instance)

and the exploration of different solutions (where the extreme case of a purely random policy will

likely be highly inefﬁcient).

In this work, we propose Poppy, a construction method that uses a population of agents with

suitably diverse policies to improve the exploration of the solution space of hard CO problems.

Whereas a single agent aims to perform well across the entire problem distribution, and thus has

to make compromises, a population can learn a set of heuristics such that only one of these has to

be performant on any given problem instance. However, realizing this intuition presents several

challenges: (i) naïvely training a population of agents is expensive and challenging to scale, (ii) the

trained population should have complementary policies that propose different solutions, and (iii) the

training approach should not impose any handcrafted notion of diversity within the set of policies

given the absence of clear behavioral markers aligned with performance for typical CO problems.

Challenge (i) can be addressed by sharing a large fraction of the computations across the population,

specializing only lightweight policy heads to realize the diversity of agents. Moreover, this can be

done on top of a pre-trained model, which we clone to produce the population. Challenges (ii) and (iii)

are jointly achieved by introducing an RL objective aimed at specializing agents on distinct subsets

of the problem distribution. Concretely, we derive a policy gradient method for the population-level

objective, which corresponds to training only the agent which performs best on each problem. This

is intuitively justiﬁed as the performance of the population on a given problem is not improved by

training an agent on an instance where another agent already has better performance. Strikingly, we

ﬁnd that judicious application of this conceptually simple objective gives rise to a population where

the diversity of policies is obtained without explicit supervision (and hence is applicable across a

range of problems without modiﬁcation) and essential for strong performance.

Our contributions are summarized as follows:

We motivate the use of populations for CO problems as an efﬁcient way to explore environ-

ments that are not reliably solved by single-shot inference.

We derive a new training objective and present a practical training procedure that encourages

performance-driven diversity (i.e. effective diversity without the use of explicit behavioral

markers or other external supervision).

We evaluate Poppy on four CO problems: traveling salesman (TSP), capacitated vehicle

routing (CVRP), 0-1 knapsack (KP), and job-shop scheduling (JSSP). In these four problems,

Poppy consistently outperforms all other RL-based approaches.

2 Related Work

ML for Combinatorial Optimization The ﬁrst attempt to solve TSP with neural networks is due

to Hopﬁeld and Tank [1985], which only scaled up to 30 cities. Recent developments of bespoke

neural architectures [Vinyals et al.,2015,Vaswani et al.,2017] and performant hardware have made

ML approaches increasingly efﬁcient. Indeed, several architectures have been used to address CO

problems, such as graph neural networks [Dai et al.,2017], recurrent neural networks [Nazari et al.,

2018], and attention mechanisms [Deudon et al.,2018]. Kool et al. [2019] proposed an encoder-

decoder architecture that we employ for TSP, CVRP and KP. The costly encoder is run once per

problem instance, and the resulting embeddings are fed to a small decoder iteratively rolled out to

get the whole trajectory, which enables efﬁcient inference. This approach was furthered by Kwon

et al. [2020] and Kim et al. [2022], who leveraged the underlying symmetries of typical CO problems

(e.g. of starting positions and rotations) to realize improved training and inference performance using

instance augmentations. Kim et al. [2021] also draw on Kool et al. and use a hierarchical strategy

where a seeder proposes solution candidates, which are reﬁned bit-by-bit by a reviser. Closer to

our work, Xin et al. [2021] train multiple policies using a shared encoder and separate decoders.

Whilst this work (MDAM) shares our architecture and goal of training a population, our approach

to enforcing diversity differs substantially. MDAM explicitly trades off performance with diversity

by jointly optimizing policies and their KL divergence. Moreover, as computing the KL divergence

for the whole trajectory is intractable, MDAM is restricted to only using it to drive diversity at the

ﬁrst timestep. In contrast, Poppy drives diversity by maximizing population-level performance (i.e.

without any explicit diversity metric), uses the whole trajectory and scales better with the population

size (we have used up to 32 agents instead of only 5).

Additionally, ML approaches usually rely on mechanisms to generate multiple candidate solutions

[Mazyavkina et al.,2021]. One such mechanism consists in using improvement methods on an initial

solution: de O. da Costa et al. [2020] use policy gradients to learn a policy that selects local operators

(2-opt) given a current solution in TSP, while Lu et al. [2020] and Wu et al. [2021] extend this method

to CVRP. This idea has been extended to enable searching a learned latent space of solutions [Hottung

et al.,2021]. However, these approaches have two limitations: they are environment-speciﬁc, and the

search procedure is inherently biased by the initial solution.

An alternative exploration mechanism is to generate a diverse set of trajectories by stochastically

sampling a learned policy, potentially with additional beam search [Joshi et al.,2019], Monte Carlo

tree search [Fu et al.,2021], dynamic programming [Kool et al.,2021], active search [Hottung et al.,

2022], or simulation-guided search [Choo et al.,2022]. However, intuitively, the generated solutions

tend to remain close to the underlying deterministic policy, implying that the beneﬁts of additional

sampled candidates diminish quickly.

Population-Based RL Populations have already been used in RL to learn diverse behaviors. In a

different context, Gupta et al. [2018], Eysenbach et al. [2019], Hartikainen et al. [2020] and Pong et al.

[2020] use a single policy conditioned on a set of goals as an implicit population for unsupervised

skill discovery. Closer to our approach, another line of work revolves around explicitly storing a

set of distinct policy parameters. Hong et al. [2018], Doan et al. [2020], Jung et al. [2020] and

Parker-Holder et al. [2020] use a population to achieve a better coverage of the policy space. However,

they enforce explicit attraction-repulsion mechanisms, which is a major difference with respect to our

approach where diversity is a pure byproduct of performance optimization.

Our method is also related to approaches combining RL with evolutionary algorithms [EA; Khadka

and Tumer,2018,Khadka et al.,2019,Pourchot and Sigaud,2019], which beneﬁt from the sample-

efﬁcient RL policy updates while enjoying evolutionary population-level exploration. However, the

population is a means to learn a unique strong policy, whereas Poppy learns a set of complementary

strategies. More closely related, Quality-Diversity [QD; Pugh et al.,2016,Cully and Demiris,2018]

is a popular EA framework that maintains a portfolio of diverse policies. Pierrot et al. [2022] has

recently combined RL with a QD algorithm, Map-Elites [Mouret and Clune,2015]; unlike Poppy,

QD methods rely on handcrafted behavioral markers, which is not easily amenable to the CO context.

One of the drawbacks of population-based RL is its expensive cost. However, recent approaches have

shown that modern hardware, as well as targeted frameworks, enable efﬁcient vectorized population

training [Flajolet et al.,2022], opening the door to a wider range of applications.

3 Methods

3.1 Background and Motivation

RL Formulation A CO problem instance

sampled from some distribution

consists of a discrete

set of

variables (e.g. city locations in TSP). We model a CO problem as a Markov decision process

(MDP) deﬁned by a state space

, an action space

, a transition function

, and a reward function

. A state is a trajectory through the problem instance

τt= (x1, . . . , xt)∈ S

where

xi∈ρ

, and thus

consists of an ordered list of variables (not necessarily of length

). An action,

a∈ A ⊆ ρ

, consists

of choosing the next variable to add; thus, given state

τt= (x1, . . . , xt)

and action

, the next state is

τt+1 =T(τt, a)=(x1, . . . , xt, a)

. Let

S∗⊆ S

be the set of solutions, i.e. states that comply with

the problem’s constraints (e.g., a sequence of cities such that each city is visited once and ends with

the starting city in TSP). The reward function

R:S∗→R

maps solutions into scalars. We assume

the reward is maximized by the optimal solution (e.g. Rreturns the negative tour length in TSP).

Apolicy

πθ

parameterized by

can be used to generate solutions for any instance

ρ∼ D

iteratively sampling the next action

a∈ A

according to the probability distribution

πθ(· | ρ, τt)

. We

learn

πθ

using REINFORCE [Williams,1992]. This method aims at maximizing the RL objective

J(θ).

=Eρ∼DEτ∼πθ,ρR(τ)

by adjusting

such that good trajectories are more likely to be sampled

in the future. Formally, the policy parameters

are updated by gradient ascent using

∇θJ(θ) =

Eρ∼DEτ∼πθ,ρ(R(τ)−bρ)∇θlog(pθ(τ))

, where

pθ(τ) = Qtπθ(at+1 |ρ, τt)

and

bρ

is a baseline.

The gradient of the objective, ∇θJ, can be estimated empirically using Monte Carlo simulations.

Figure 1: In this environment, the upward path always leads to a medium reward, while the left and

right paths are intricate such that either one may lead to a low reward or high reward with equal

probability. Left: An agent trained to maximize its expected reward converges to taking the safe

upward road since acting optimally is too computationally intensive, as it requires solving the maze.

Right: A 2-agent population can always take the left and right paths and thus get the largest reward.

Motivating Example We argue for the beneﬁts of training a population using the example in

Figure 1. In this environment, there are three actions: Left,Right, and Up.Up leads to a medium

reward, while Left/Right lead to low/high or high/low rewards (the conﬁguration is determined with

equal probability at the start of each episode). Crucially, the left and right paths are intricate, so the

agent cannot easily infer from its observation which one leads to a higher reward. Then, the best

strategy for a computationally limited agent is to always go Up, as the guaranteed medium reward

(2 scoops) is higher than the expected reward of guessing left or right (1.5 scoops). In contrast, two

agents in a population can go in opposite directions and always ﬁnd the maximum reward. There

are two striking observations: (i) the agents do not need to perform optimally for the population

performance to be optimal (one agent gets the maximum reward), and (ii) the performance of each

individual agent is worse than in the single-agent case.

The discussed phenomenon can occur when (i) some optimal actions are too difﬁcult to infer from

observations and (ii) choices are irreversible (i.e. it is not possible to recover from a sub-optimal

decision). This problem setting can be seen as a toy model for the challenge of solving an NP-hard

CO problem in a sequential decision-making setting. In the case of TSP, for example, the number of

possible unique tours that could follow from each action is exponentially large and, for any reasonably

ﬁnite agent capacity, essentially provides the same obfuscation over the ﬁnal returns. In this situation,

as shown above, maximizing the performance of a population will require agents to specialize and

likely yield better results than in the single-agent case.

3.2 Poppy

We present the components of Poppy: an RL objective encouraging agent specialization, and an

efﬁcient training procedure taking advantage of a pre-trained policy.

Population-Based Objective At inference, reinforcement learning methods usually sample several

candidates to ﬁnd better solutions. This process, though, is not anticipated during training, which

optimizes the 1-shot performance with the usual RL objective

J(θ) = Eρ∼DEτ∼πθ,ρR(τ)

previously

presented in Section 3.1. Intuitively, given

trials, we would like to ﬁnd the best set of policies

{π1, . . . , πK}to rollout once on a given problem. This gives the following population objective:

Jpop(θ1, . . . , θK).

=Eρ∼DEτ1∼πθ1,...,τK∼πθKmax [R(τ1), . . . , R(τK)] ,

where each trajectory

τi

is sampled according to the policy

πθi

. Maximizing

Jpop

leads to ﬁnding the

best set of Kagents which can be rolled out in parallel for any problem.

Theorem 1 (Policy gradient for populations).The gradient of the population objective is:

∇Jpop(θ1, θ2, . . . , θK) = Eρ∼DEτ1∼πθ1,...,τK∼πθKR(τi∗)−R(τi∗∗ )∇log pθi∗(τi∗),

where:

i∗= arg maxi∈{1,...,K}R(τi)

(index of the agent that got the highest reward) and

i∗∗ =

arg maxi̸=i∗R(τi)(index of the agent that got the second highest reward).

The proof is provided in Appendix B.1. Remarkably, it corresponds to rolling out every agent and

only training the one that got the highest reward on any problem. This formulation applies across

various problems and directly optimizes for population-level performance without explicit supervision

or handcrafted behavioral markers.

The theorem trivially holds when instead of a population, one single agent is sampled

times, which

falls back to the speciﬁc case where θ1=θ2=··· =θK. In this case, the gradient becomes:

∇Jpop(θ) = Eρ∼DEτ1∼πθ,...,τK∼πθR(τi∗)−R(τi∗∗ )∇log pθ(τi∗),

Remark. The occurrence of

R(τi∗∗ )

in the new gradient formulation can be surprising. However,

Theorem 1can be intuitively understood as training each agent on its true contribution to the

population’s performance: if for any problem instance the best agent

i∗

was removed, the population

performance would fallback to the level of the second-best agent

i∗∗

, hence its contribution is indeed

R(τi∗)−R(τi∗∗ ).

Optimizing the presented objective does not provide any strict diversity guarantee. However, note that

diversity maximizes our objective in the highly probable case that, within the bounds of ﬁnite capacity

and training, a single agent does not perform optimally on all subsets of the training distribution.

Therefore, intuitively and later shown empirically, diversity emerges over training in the pursuit of

maximizing the objective.

Algorithm 1 Poppy training

Input: problem distribution

, number of agents

, batch size

, number of training steps

pre-trained parameters θ.

2: θ1, θ2, . . . , θK←CLONE(θ){Clone the pre-trained agent Ktimes.}

3: for step 1 to Hdo

4: ρi←Sample(D)∀i∈1, . . . , B

5: τk

i←Rollout(ρi, θk)∀i∈1, . . . , B, ∀k∈1, . . . , K

6: k∗

i←arg maxk≤KR(τk

i)∀i∈1, . . . , B {Select the best agent for each problem ρi.}

7: ∇L(θ1, . . . , θK)←1

BPi≤BREINFORCE(τk∗

i){Backpropagate through these only.}

8: (θ1, . . . , θK)←(θ1, . . . , θK)−α∇L(θ1, . . . , θK)

Training Procedure The training procedure consists of two phases:

We train (or reuse) a single agent using an architecture suitable for solving the CO problem

at hand. We later outline the architectures used for the different problems.

The agent trained in Phase 1 is cloned

times to form a

-agent population. The population

is trained as described in Algorithm 1: only the best agent is trained on any problem. Agents

implicitly specialize in different types of problem instances during this phase.

Phase 1 enables training the model without the computational overhead of a population. Moreover,

we informally note that applying the Poppy objective directly to a population of untrained agents can

be unstable. Randomly initialized agents are often ill-distributed, hence a single (or few) agent(s)

dominate the performance across all instances. In this case, only the initially dominating agents

receive a training signal, further widening the performance gap. Whilst directly training a population

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WinnerTakesItAll:TrainingPerformantRLPopulationsforCombinatorialOptimizationNathanGrinsztajnInstaDeepn.grinsztajn@instadeep.comDanielFurelos-BlancoImperialCollegeLondon∗ShikhaSuranaInstaDeepClémentBonnetInstaDeepThomasD.BarrettInstaDeepAbstractApplyingreinforcementlearning(RL)tocombinatorialoptimiza...

展开>> 收起<<

Winner Takes It All Training Performant RL Populations for Combinatorial Optimization Nathan Grinsztajn.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Winner Takes It All Training Performant RL Populations for Combinatorial Optimization Nathan Grinsztajn

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: