DIMES A Differentiable Meta Solver for Combinatorial Optimization Problems Ruizhong Qiuy

2025-04-27 0 0 637.39KB 22 页 10玖币

侵权投诉

DIMES: A Differentiable Meta Solver for

Combinatorial Optimization Problems

Ruizhong Qiu∗†

Department of Computer Science

University of Illinois Urbana–Champaign

rq5@illinois.edu

Zhiqing Sun∗

, Yiming Yang

Language Technologies Institute

Carnegie Mellon University

{zhiqings,yiming}@cs.cmu.edu

Abstract

Recently, deep reinforcement learning (DRL) models have shown promising results

in solving NP-hard Combinatorial Optimization (CO) problems. However, most

DRL solvers can only scale to a few hundreds of nodes for combinatorial opti-

mization problems on graphs, such as the Traveling Salesman Problem (TSP). This

paper addresses the scalability challenge in large-scale combinatorial optimization

by proposing a novel approach, namely, DIMES. Unlike previous DRL methods

which suffer from costly autoregressive decoding or iterative reﬁnements of discrete

solutions, DIMES introduces a compact continuous space for parameterizing the

underlying distribution of candidate solutions. Such a continuous space allows

stable REINFORCE-based training and ﬁne-tuning via massively parallel sampling.

We further propose a meta-learning framework to enable effective initialization

of model parameters in the ﬁne-tuning stage. Extensive experiments show that

DIMES outperforms recent DRL-based methods on large benchmark datasets for

Traveling Salesman Problems and Maximal Independent Set problems.

1 Introduction

Combinatorial Optimization (CO) is a fundamental problem in computer science. It has important

real-world applications such as shipment planning, transportation, robots routing, biology, circuit

design, and more [

]. However, due to NP-hardness, a signiﬁcant portion of the CO problems suffer

from an exponential computational cost when using traditional algorithms. As a well-known example,

the Traveling Salesman Problem (TSP) has been intensively studied [

] for ﬁnding the most

cost-effective tour over an input graph where each node is visited exactly once before ﬁnally returning

to the start node. Over the past decades, signiﬁcant effort has been made for designing more efﬁcient

heuristic solvers [5, 20] to approximate near-optimal solutions in a reduced search space.

Recent development in deep reinforcement learning (DRL) has shown promises in solving CO

problems without manual injection of domain-speciﬁc expert knowledge [

]. The appeal of

neural methods is because they can learn useful patterns (such as graph motifs) from data, which

might be difﬁcult to discover by hand. A typical category of DRL solvers, namely construction

heuristics learners, [

] uses a Markov decision process (MDP) to grow partial solutions by adding

one new node per step, with a trained strategy which assigns higher probabilities to better solutions.

Another category of DRL-based solvers, namely improvement heuristics learners [

], iteratively

reﬁnes a feasible solution with neural network-guided local Operations Research (OR) operations

[

]. A major limitation of these DRL solvers lies in their scalability on large instances. For example,

current DRL solvers for TSP can only scale to graphs with up to hundreds of nodes.

∗Equal contribution.

†Work was done during internship at CMU.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04123v2 [cs.LG] 25 Oct 2022

The bad scalability of these DRL methods lies in the fact that they suffer from costly decoding of

CO solutions, which is typically linear in the number of nodes in the input graph. Since the reward

of reinforcement learning is determined after decoding a complete solution (with either a chain rule

factorization or iterative reﬁnements), either construction or improvement heuristic learners would

encounter the sparse reward problem when dealing with large graphs [

]. While such an

overhead can be partially alleviated by constructing several parts of the solution in parallel [

] for

locally decomposable CO problems

, such as for maximum independent set (MIS) problems [

how to scale up neural solvers for CO problems in general, including the locally non-decomposable

ones (such as TSP) is still an open challenge.

In this paper, we address the scalability challenge by proposing a novel framework, namely DIMES

(DIfferentiable MEta Solver), for solving combinatorial optimization problems. Unlike previous

DRL-based CO solvers that rely on construction or improvement heuristics, we introduce a compact

continuous space to parameterize the underlying distribution of candidate solutions, which allows

massively parallel on-policy sampling without the costly decoding process, and effectively reduces

the variance of the gradients by the REINFORCE algorithm [

] during both training and ﬁne-tuning

phases. We further propose a meta-learning framework for CO over problem instances to enable

effective initialization of model parameters in the ﬁne-tuning stage. To our knowledge, we are the

ﬁrst to apply meta-learning over a collection of CO problem instances, where each instance graph is

treated as one of a collection tasks in a uniﬁed framework.

We need to point out that the idea of designing a continuous space for combinatorial optimization

problems has been tried by the heatmaps approaches in the literature [

]. However,

there are major distinctions between the existing methods and our DIMES. For instance, Fu et al.

[19]

learn to generate heatmaps via supervised learning (i.e., each training instance is paired with its

best solution) [

], which is very costly to obtain on large graphs. DIMES is directly optimized

with gradients estimated by the REINFORCE algorithm without any supervision, so it can be trained

on large graphs directly. As a result, DIMES can scale to large graphs with up to tens of thousands of

nodes, and predict (nearly) optimal solutions without the need for costly generation of supervised

training data or human speciﬁcation of problem-speciﬁc heuristics.

In our experiments, we show that DIMES outperforms strong baselines among DRL-based solvers on

TSP benchmark datasets, and can successfully scale up to graphs with tens of thousands of nodes. As

a sanity check, we also evaluate our framework with locally decomposable combinatorial optimization

problems, including Maximal Independent Set (MIS) problem for synthetic graphs and graphs reduced

from satisﬁability (SAT) problems. Our experimental results show that DIMES achieve competitive

performance compared to neural solvers specially designed for locally decomposable CO problems.

2 Related Work

2.1 DRL-Based Construction Heuristics Learners

Construction heuristics methods create a solution of CO problem instance in one shot without further

modiﬁcations. Bello et al.

[7]

are the ﬁrst to tackle combinatorial optimization problems using

neural networks and reinforcement learning. They used a Pointer Network (PtrNet) [

] as the policy

network and used the actor-critic algorithm [

] for training on TSP and KnapSack instances. Further

improved models have been developed afterwards [

], such as attention models

[

], better DRL algorithms [

], for an extended scope of CO problems such

as Capacitated Vehicle Routing Problem (CVRP) [

], Job Shop Scheduling Problem (JSSP) [

Maximal Independent Set (MIS) problem [36, 1], and boolean satisﬁability problem (SAT) [75].

Our proposed method in this paper belongs to the category of construction heuristics learners in the

sense of producing a one-shot solution per problem instance. However, there are major distinctions

between previous methods and ours. One distinction is how to construct solutions. Unlike previous

methods which generate the solutions via a constructive Markov decision process (MDP) with rather

costly decoding steps (adding one un-visited node per step to a partial solution), we introduce a

compact continuous space to parameterize the underlying distribution of discrete candidate solutions,

and to allow efﬁcient sampling from that distribution without costly neural network-involved decoding.

Locally decomposable problem refers to the problem where the feasibility constraint and the objective can

be decomposed by locally connected variables (in a graph) [1].

Another distinction is about the training framework. For instance, Drori et al.

[14]

proposes a similar

solution decoding scheme but employs a DRL framework to train the model. Instead, we propose a

much more effective meta-learning framework to train our model, enabling DIMES to be trained on

large graphs directly.

2.2 DRL-Based Improvement Heuristics Learners

In contrast to construction heuristics, DRL-based improvement heuristics methods train a neural

network to iteratively improve the quality of the current solution until computational budget runs

out. Such DRL-based improvement heuristics methods are usually inspired by classical local

search algorithms such as 2-opt [

] and the large neighborhood search (LNS) [

], and have been

demonstrated with outstanding results by many previous work [

Improvement heuristics methods generally show better performance than construction heuristics

methods but are slower in computation in return.

2.3 Supervised Learners for CO Problems

Vinyals et al.

[68]

trained a Pointer Network to predict a TSP solution based on supervision signals

from the Held–Karp algorithm [

] or approximate algorithms. Li et al.

[48]

and Joshi et al.

[31]

trained a graph convolutional network to predict the possibility of each node or edge to be included

in the optimal solutions of MIS and TSP problems, respectively. Recently, Joshi et al.

[32]

showed

that unsupervised reinforcement learning leads to better emergent generalization over various sized

graphs than supervised learning. Our work in this paper provides further evidence for the beneﬁts of

the unsupervised training, or more speciﬁcally, unsupervised generation of heatmaps [

for combinatorial optimization problems.

3 Proposed Method

3.1 Formal Deﬁnitions

Following a conventional notation [

] we deﬁne

as the set of discrete feasible solutions for a

CO problem instance

, and

cs:Fs→R

as the cost function for feasible solutions

f∈ Fs

. The

objective is to ﬁnd the optimal solution for a given instance s:

f∗

s= argmin

f∈Fs

cs(f).(1)

For the Traveling Salesman Problem (TSP),

is the set of all the tours that visit each node exactly

once and returns to the starting node at the end, and

calculates the cost for each tour

f∈ Fs

summing up the edge weights in the tour. The size of

for TSP is

for a graph with

nodes.

For the Maximal Independent Set (MIS) problem,

is a subset of the power set

Ss={0,1}n

and

consists of all the independent subsets where each node of a subset has no connection to any other

node in the same subset, and cscalculates the negation of the size of each independent subset.

We parameterize the solution space with a continuous and differentiable vector

θ∈R|Vs|

, where

denotes the variables in the problem instance

(e.g., edges in TSP and nodes in MIS), and estimates

the probability of each feasible solution fas:

pθ(f|s)∝exp |Vs|

i=1

fi·θisubject to f∈ Fs.(2)

where

pθ

is an energy function over the discrete feasible solution space,

is a

|Vs|

-dimensional

vector with element

fi∈ {0,1}

indicating whether the

ith

variable is included in feasible solution

and the higher value of θimeans a higher probability for the ith variable produced by pθ(f|s).

3.2 Gradient-Based Optimization

When the combinatorial problem is locally decomposable, such a MIS, a penalty loss [

] can be

added to suppress the unfeasible solutions, e.g.:

`Erd˝

os(θ|s) = X

f∈Ss

[pθ(f|s)·(cs(f) + β·(f6∈ Fs))] .(3)

where

β > maxf∈Fscs(f)

. The objective function

`Erd˝

can thus be calculated analytically and

enable end-to-end training. However, this is not always possible for general structured combinatorial

problems such as TSP

. Therefore, we propose to directly optimize the expected cost over the

underlying population of feasible solutions, which is deﬁned as:

`p(θ|s) = Ef∼pθ[cs(f)] .(4)

Optimizing this objective requires efﬁcient sampling, with which REINFORCE-based [

] gradient

estimation can be calculate. Nevertheless, a common practice to sample from the energy

pθ

functions

requires MCMC [

], which is not efﬁcient enough. Hence we propose to design an auxiliary

distribution

qθ

over the feasible solutions

, such that the following conditions hold: 1) sampling

from

qθ

is efﬁcient, and 2)

qθ

and

pθ

should convergence to the same optimal

θ∗

. Then, we can

replace pθby qθin our objective function as:

`q(θ|s) = Ef∼qθ[cs(f)] ,(5)

and get the REINFORCE-based update rule as:

∇θEf∼qθ[cs(f)] = Ef∼qθ[(cs(f)−b(s))∇θlog qθ(f)],(6)

where

b(s)

denotes a baseline function that does not depend on

and estimates the expected cost

to reduce the variance of the gradients. In this paper, we use a sampling-based baseline function

proposed by Kool et al. [43].

Next, we specify the auxiliary distributions for TSP and MIS, respectively. For brevity, we omit the

conditional notations of sfor all probability formulas in the rest of the paper.

3.2.1 Auxiliary Distribution for TSP

For TSP on an

-node graph, each feasible solution

consists of

edges forming a tour, which

can be speciﬁed as a permutation

πf

nodes, where

πf(0) = πf(n)

is the start/end node, and

πf(i)6=πf(j)

for any

i, j

with

0≤i, j < n

and

i6=j

. Note that for a single solution

different

choices of the start node

πf(0)

correspond to

different permutations

πf

. In this paper, we choose

the start node πf(0) randomly with a uniform distribution:

qTSP(πf(0) = j) := 1

nfor any node j;(7)

qTSP

θ(f) :=

n−1

j=0

n·qTSP(πf|πf(0) = j).(8)

Given the start node πf(0), we factorize the probability via chain rule in the visiting order:

qTSP(πf|πf(0)) :=

n−1

i=1

qTSP(πf(i)|πf(< i)).(9)

Since the variables in TSP are edges, we let

θi,j

denote the

value of edge from node

to node

for

notational simplicity, i.e., we use a matrix

θ∈Rn×n

to parameterize the probabilistic distribution of

n!discrete feasible solutions. We deﬁne:

qTSP(πf(i)|πf(< i)) := exp(θπf(i−1),πf(i))

j=iexp(θπf(i−1),πf(j)).(10)

Here a higher valued

θi,j

corresponds to a higher probability for the edge from node

to node

be sampled. The compact, continuous and differentiable space of

allows us to leverage gradient-

based optimization without costly MDP-based construction of feasible solutions, which has been a

bottleneck for scaling up in representative DRL solvers so far. In other words, we also no longer

need costly MCMC-based sampling for optimizing our model due to the chain-rule decomposition.

Instead, we use autoregressive factorization for sampling from the auxiliary distribution, which is

faster than sampling with MCMC from the distribution deﬁned by the energy function.

4TSP has a global constraint of forming a Hamiltonian cycle.

3.2.2 Auxiliary Distribution for MIS

For the Maximal Independent Set (MIS) problem, the feasible solution is a set of independent nodes,

which means that none of the node has any link to any other node in the same set. To ease the analysis,

we further impose a constraint to the MIS solutions such that each set is not a proper subset of any

other independent set in the feasible domain.

To enable the chain-rule decomposition in probability estimation, we introduce

as an ordering of

the independent nodes in solution

, and

{a}f

as the set of all possible orderings of the nodes in

The chain rule applied to acan thus be deﬁned as:

qMIS

θ(f) = X

a∈{a}f

qMIS(a),(11)

qMIS(a) =

|a|

i=1

qMIS(ai|a<i) =

|a|

i=1

exp(θai)

Pj∈G(a<i)exp(θj).

where

G(a<i)

denotes the set of available nodes for growing partial solution

(a1, . . . , ai−1)

, i.e., the

nodes that have no edge to any nodes in

{a1, . . . , ai−1}

. Notice again that the parameterization space

for MIS

θ∈Rn

(where

denotes the number of nodes in the graph) is compact, continuous and

differentiable, which allows efﬁcient gradient-driven optimization.

Due to the space limit, we leave the proof of the convergence between

pθ

and

qθ

(i.e.,

qTSP

and

qMIS

)

to the appendix.

3.3 Meta-Learning Framework

Model-Agnostic Meta-Learning (MAML) [

] is originally proposed for few-shot learning. In the

MAML framework, a model is ﬁrst trained on a collection of tasks simultaneously, and then adapts

its model parameters to each task. The standard MAML uses second-order derivatives in training,

which are costly to compute. To reduce computation burden, the authors also propose ﬁrst-order

approximation that does not require second-order derivatives.

Inspired by MAML, we train a graph neural network (GNN) over a collection of problem instances

in a way that the it can capture the common nature across all the instances, and adapt its distribution

parameters effectively to each instance based on the features/structure of each input graph. Let

FΦ

be the graph neural network with parameter

, and denote by

κs

the input features of an instance

graph

in collection

, by

the adjacency matrix of the input graph, and by

θs:= FΦ(κs,As)

the instance-speciﬁc initialization of distribution parameters. The vanilla loss function is deﬁned as

the expected cost of the solution for any graph in the collection as:

L(Φ| C) = Es∈C`q(θs) = Es∈C`q(FΦ(κs,As)).(12)

The gradient-based updates can thus be written as:

∇ΦL(Φ| C) = Es∈C [∇Φθs· ∇θs`q(θs)]

=Es∈C [∇ΦFΦ(κs,As)· ∇θs`q(θs)] .(13)

where

∇θs`q(θs)

is estimated using the REINFORCE algorithm (Equation 6). Since

does not

depend on the ground-truth labels, we can further ﬁne-tune neural network parameters on each single

test instance with REINFORCE-based updates, which is referred to as active search [7, 28].

Speciﬁcally, the ﬁne-tuned parameters

Φ(T)

is computed using one or more gradient updates for each

graph instance

. For example, when adapting to a problem instance

using

gradient updates with

learning rate α, we have:

Φ(0)

s=Φ,Φ(t)

s=Φ(t−1)

s−α∇Φ(t−1)

sL(Φ(t−1)

s| {s})for 1≤t≤T, (14)

θ(T)

s=FΦ(T)

s(κs,As).(15)

Here we use AdamW [

] in our experiments. Next, we optimize the performance of the graph neural

network with updated parameters (i.e., Φ(T)

s) with respect to Φ, with a meta-objective:

Lmeta(Φ| C) = Es∈C `q(θ(T)

s|s),(16)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DIMES:ADifferentiableMetaSolverforCombinatorialOptimizationProblemsRuizhongQiuyDepartmentofComputerScienceUniversityofIllinoisUrbanaChampaignrq5@illinois.eduZhiqingSun,YimingYangLanguageTechnologiesInstituteCarnegieMellonUniversity{zhiqings,yiming}@cs.cmu.eduAbstractRecently,deepreinforcementlear...

展开>> 收起<<

DIMES A Differentiable Meta Solver for Combinatorial Optimization Problems Ruizhong Qiuy.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DIMES A Differentiable Meta Solver for Combinatorial Optimization Problems Ruizhong Qiuy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: