DIMES A Differentiable Meta Solver for Combinatorial Optimization Problems Ruizhong Qiuy

2025-04-27 0 0 637.39KB 22 页 10玖币
侵权投诉
DIMES: A Differentiable Meta Solver for
Combinatorial Optimization Problems
Ruizhong Qiu∗†
Department of Computer Science
University of Illinois Urbana–Champaign
rq5@illinois.edu
Zhiqing Sun
, Yiming Yang
Language Technologies Institute
Carnegie Mellon University
{zhiqings,yiming}@cs.cmu.edu
Abstract
Recently, deep reinforcement learning (DRL) models have shown promising results
in solving NP-hard Combinatorial Optimization (CO) problems. However, most
DRL solvers can only scale to a few hundreds of nodes for combinatorial opti-
mization problems on graphs, such as the Traveling Salesman Problem (TSP). This
paper addresses the scalability challenge in large-scale combinatorial optimization
by proposing a novel approach, namely, DIMES. Unlike previous DRL methods
which suffer from costly autoregressive decoding or iterative refinements of discrete
solutions, DIMES introduces a compact continuous space for parameterizing the
underlying distribution of candidate solutions. Such a continuous space allows
stable REINFORCE-based training and fine-tuning via massively parallel sampling.
We further propose a meta-learning framework to enable effective initialization
of model parameters in the fine-tuning stage. Extensive experiments show that
DIMES outperforms recent DRL-based methods on large benchmark datasets for
Traveling Salesman Problems and Maximal Independent Set problems.
1 Introduction
Combinatorial Optimization (CO) is a fundamental problem in computer science. It has important
real-world applications such as shipment planning, transportation, robots routing, biology, circuit
design, and more [
67
]. However, due to NP-hardness, a significant portion of the CO problems suffer
from an exponential computational cost when using traditional algorithms. As a well-known example,
the Traveling Salesman Problem (TSP) has been intensively studied [
35
,
60
] for finding the most
cost-effective tour over an input graph where each node is visited exactly once before finally returning
to the start node. Over the past decades, significant effort has been made for designing more efficient
heuristic solvers [5, 20] to approximate near-optimal solutions in a reduced search space.
Recent development in deep reinforcement learning (DRL) has shown promises in solving CO
problems without manual injection of domain-specific expert knowledge [
7
,
42
,
45
]. The appeal of
neural methods is because they can learn useful patterns (such as graph motifs) from data, which
might be difficult to discover by hand. A typical category of DRL solvers, namely construction
heuristics learners, [
7
,
42
] uses a Markov decision process (MDP) to grow partial solutions by adding
one new node per step, with a trained strategy which assigns higher probabilities to better solutions.
Another category of DRL-based solvers, namely improvement heuristics learners [
10
,
72
], iteratively
refines a feasible solution with neural network-guided local Operations Research (OR) operations
[
64
]. A major limitation of these DRL solvers lies in their scalability on large instances. For example,
current DRL solvers for TSP can only scale to graphs with up to hundreds of nodes.
Equal contribution.
Work was done during internship at CMU.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04123v2 [cs.LG] 25 Oct 2022
The bad scalability of these DRL methods lies in the fact that they suffer from costly decoding of
CO solutions, which is typically linear in the number of nodes in the input graph. Since the reward
of reinforcement learning is determined after decoding a complete solution (with either a chain rule
factorization or iterative refinements), either construction or improvement heuristic learners would
encounter the sparse reward problem when dealing with large graphs [
42
,
33
,
37
]. While such an
overhead can be partially alleviated by constructing several parts of the solution in parallel [
1
] for
locally decomposable CO problems
3
, such as for maximum independent set (MIS) problems [
54
],
how to scale up neural solvers for CO problems in general, including the locally non-decomposable
ones (such as TSP) is still an open challenge.
In this paper, we address the scalability challenge by proposing a novel framework, namely DIMES
(DIfferentiable MEta Solver), for solving combinatorial optimization problems. Unlike previous
DRL-based CO solvers that rely on construction or improvement heuristics, we introduce a compact
continuous space to parameterize the underlying distribution of candidate solutions, which allows
massively parallel on-policy sampling without the costly decoding process, and effectively reduces
the variance of the gradients by the REINFORCE algorithm [
71
] during both training and fine-tuning
phases. We further propose a meta-learning framework for CO over problem instances to enable
effective initialization of model parameters in the fine-tuning stage. To our knowledge, we are the
first to apply meta-learning over a collection of CO problem instances, where each instance graph is
treated as one of a collection tasks in a unified framework.
We need to point out that the idea of designing a continuous space for combinatorial optimization
problems has been tried by the heatmaps approaches in the literature [
48
,
31
,
19
,
14
,
44
]. However,
there are major distinctions between the existing methods and our DIMES. For instance, Fu et al.
[19]
learn to generate heatmaps via supervised learning (i.e., each training instance is paired with its
best solution) [
4
,
21
], which is very costly to obtain on large graphs. DIMES is directly optimized
with gradients estimated by the REINFORCE algorithm without any supervision, so it can be trained
on large graphs directly. As a result, DIMES can scale to large graphs with up to tens of thousands of
nodes, and predict (nearly) optimal solutions without the need for costly generation of supervised
training data or human specification of problem-specific heuristics.
In our experiments, we show that DIMES outperforms strong baselines among DRL-based solvers on
TSP benchmark datasets, and can successfully scale up to graphs with tens of thousands of nodes. As
a sanity check, we also evaluate our framework with locally decomposable combinatorial optimization
problems, including Maximal Independent Set (MIS) problem for synthetic graphs and graphs reduced
from satisfiability (SAT) problems. Our experimental results show that DIMES achieve competitive
performance compared to neural solvers specially designed for locally decomposable CO problems.
2 Related Work
2.1 DRL-Based Construction Heuristics Learners
Construction heuristics methods create a solution of CO problem instance in one shot without further
modifications. Bello et al.
[7]
are the first to tackle combinatorial optimization problems using
neural networks and reinforcement learning. They used a Pointer Network (PtrNet) [
68
] as the policy
network and used the actor-critic algorithm [
41
] for training on TSP and KnapSack instances. Further
improved models have been developed afterwards [
13
,
42
,
63
,
14
,
46
], such as attention models
[
66
], better DRL algorithms [
36
,
52
,
43
,
45
,
59
,
73
,
69
], for an extended scope of CO problems such
as Capacitated Vehicle Routing Problem (CVRP) [
57
], Job Shop Scheduling Problem (JSSP) [
76
],
Maximal Independent Set (MIS) problem [36, 1], and boolean satisfiability problem (SAT) [75].
Our proposed method in this paper belongs to the category of construction heuristics learners in the
sense of producing a one-shot solution per problem instance. However, there are major distinctions
between previous methods and ours. One distinction is how to construct solutions. Unlike previous
methods which generate the solutions via a constructive Markov decision process (MDP) with rather
costly decoding steps (adding one un-visited node per step to a partial solution), we introduce a
compact continuous space to parameterize the underlying distribution of discrete candidate solutions,
and to allow efficient sampling from that distribution without costly neural network-involved decoding.
3
Locally decomposable problem refers to the problem where the feasibility constraint and the objective can
be decomposed by locally connected variables (in a graph) [1].
2
Another distinction is about the training framework. For instance, Drori et al.
[14]
proposes a similar
solution decoding scheme but employs a DRL framework to train the model. Instead, we propose a
much more effective meta-learning framework to train our model, enabling DIMES to be trained on
large graphs directly.
2.2 DRL-Based Improvement Heuristics Learners
In contrast to construction heuristics, DRL-based improvement heuristics methods train a neural
network to iteratively improve the quality of the current solution until computational budget runs
out. Such DRL-based improvement heuristics methods are usually inspired by classical local
search algorithms such as 2-opt [
11
] and the large neighborhood search (LNS) [
65
], and have been
demonstrated with outstanding results by many previous work [
72
,
51
,
70
,
12
,
10
,
26
,
74
,
53
,
29
,
37
].
Improvement heuristics methods generally show better performance than construction heuristics
methods but are slower in computation in return.
2.3 Supervised Learners for CO Problems
Vinyals et al.
[68]
trained a Pointer Network to predict a TSP solution based on supervision signals
from the Held–Karp algorithm [
6
] or approximate algorithms. Li et al.
[48]
and Joshi et al.
[31]
trained a graph convolutional network to predict the possibility of each node or edge to be included
in the optimal solutions of MIS and TSP problems, respectively. Recently, Joshi et al.
[32]
showed
that unsupervised reinforcement learning leads to better emergent generalization over various sized
graphs than supervised learning. Our work in this paper provides further evidence for the benefits of
the unsupervised training, or more specifically, unsupervised generation of heatmaps [
48
,
31
,
19
,
44
],
for combinatorial optimization problems.
3 Proposed Method
3.1 Formal Definitions
Following a conventional notation [
61
] we define
Fs
as the set of discrete feasible solutions for a
CO problem instance
s
, and
cs:FsR
as the cost function for feasible solutions
f∈ Fs
. The
objective is to find the optimal solution for a given instance s:
f
s= argmin
f∈Fs
cs(f).(1)
For the Traveling Salesman Problem (TSP),
Fs
is the set of all the tours that visit each node exactly
once and returns to the starting node at the end, and
cs
calculates the cost for each tour
f∈ Fs
by
summing up the edge weights in the tour. The size of
Fs
for TSP is
n!
for a graph with
n
nodes.
For the Maximal Independent Set (MIS) problem,
Fs
is a subset of the power set
Ss={0,1}n
and
consists of all the independent subsets where each node of a subset has no connection to any other
node in the same subset, and cscalculates the negation of the size of each independent subset.
We parameterize the solution space with a continuous and differentiable vector
θR|Vs|
, where
Vs
denotes the variables in the problem instance
s
(e.g., edges in TSP and nodes in MIS), and estimates
the probability of each feasible solution fas:
pθ(f|s)exp |Vs|
X
i=1
fi·θisubject to f∈ Fs.(2)
where
pθ
is an energy function over the discrete feasible solution space,
f
is a
|Vs|
-dimensional
vector with element
fi∈ {0,1}
indicating whether the
ith
variable is included in feasible solution
f
,
and the higher value of θimeans a higher probability for the ith variable produced by pθ(f|s).
3.2 Gradient-Based Optimization
When the combinatorial problem is locally decomposable, such a MIS, a penalty loss [
34
,
2
] can be
added to suppress the unfeasible solutions, e.g.:
`Erd˝
os(θ|s) = X
f∈Ss
[pθ(f|s)·(cs(f) + β·(f6∈ Fs))] .(3)
3
where
β > maxf∈Fscs(f)
. The objective function
`Erd˝
os
can thus be calculated analytically and
enable end-to-end training. However, this is not always possible for general structured combinatorial
problems such as TSP
4
. Therefore, we propose to directly optimize the expected cost over the
underlying population of feasible solutions, which is defined as:
`p(θ|s) = Efpθ[cs(f)] .(4)
Optimizing this objective requires efficient sampling, with which REINFORCE-based [
71
] gradient
estimation can be calculate. Nevertheless, a common practice to sample from the energy
pθ
functions
requires MCMC [
47
], which is not efficient enough. Hence we propose to design an auxiliary
distribution
qθ
over the feasible solutions
Fs
, such that the following conditions hold: 1) sampling
from
qθ
is efficient, and 2)
qθ
and
pθ
should convergence to the same optimal
θ
. Then, we can
replace pθby qθin our objective function as:
`q(θ|s) = Efqθ[cs(f)] ,(5)
and get the REINFORCE-based update rule as:
θEfqθ[cs(f)] = Efqθ[(cs(f)b(s))θlog qθ(f)],(6)
where
b(s)
denotes a baseline function that does not depend on
f
and estimates the expected cost
to reduce the variance of the gradients. In this paper, we use a sampling-based baseline function
proposed by Kool et al. [43].
Next, we specify the auxiliary distributions for TSP and MIS, respectively. For brevity, we omit the
conditional notations of sfor all probability formulas in the rest of the paper.
3.2.1 Auxiliary Distribution for TSP
For TSP on an
n
-node graph, each feasible solution
f
consists of
n
edges forming a tour, which
can be specified as a permutation
πf
of
n
nodes, where
πf(0) = πf(n)
is the start/end node, and
πf(i)6=πf(j)
for any
i, j
with
0i, j < n
and
i6=j
. Note that for a single solution
f
,
n
different
choices of the start node
πf(0)
correspond to
n
different permutations
πf
. In this paper, we choose
the start node πf(0) randomly with a uniform distribution:
qTSP(πf(0) = j) := 1
nfor any node j;(7)
qTSP
θ(f) :=
n1
X
j=0
1
n·qTSP(πf|πf(0) = j).(8)
Given the start node πf(0), we factorize the probability via chain rule in the visiting order:
qTSP(πf|πf(0)) :=
n1
Y
i=1
qTSP(πf(i)|πf(< i)).(9)
Since the variables in TSP are edges, we let
θi,j
denote the
θ
value of edge from node
i
to node
j
for
notational simplicity, i.e., we use a matrix
θRn×n
to parameterize the probabilistic distribution of
n!discrete feasible solutions. We define:
qTSP(πf(i)|πf(< i)) := exp(θπf(i1)f(i))
Pn
j=iexp(θπf(i1)f(j)).(10)
Here a higher valued
θi,j
corresponds to a higher probability for the edge from node
i
to node
j
to
be sampled. The compact, continuous and differentiable space of
θ
allows us to leverage gradient-
based optimization without costly MDP-based construction of feasible solutions, which has been a
bottleneck for scaling up in representative DRL solvers so far. In other words, we also no longer
need costly MCMC-based sampling for optimizing our model due to the chain-rule decomposition.
Instead, we use autoregressive factorization for sampling from the auxiliary distribution, which is
faster than sampling with MCMC from the distribution defined by the energy function.
4TSP has a global constraint of forming a Hamiltonian cycle.
4
3.2.2 Auxiliary Distribution for MIS
For the Maximal Independent Set (MIS) problem, the feasible solution is a set of independent nodes,
which means that none of the node has any link to any other node in the same set. To ease the analysis,
we further impose a constraint to the MIS solutions such that each set is not a proper subset of any
other independent set in the feasible domain.
To enable the chain-rule decomposition in probability estimation, we introduce
a
as an ordering of
the independent nodes in solution
f
, and
{a}f
as the set of all possible orderings of the nodes in
f
.
The chain rule applied to acan thus be defined as:
qMIS
θ(f) = X
a∈{a}f
qMIS(a),(11)
qMIS(a) =
|a|
Y
i=1
qMIS(ai|a<i) =
|a|
Y
i=1
exp(θai)
Pj∈G(a<i)exp(θj).
where
G(a<i)
denotes the set of available nodes for growing partial solution
(a1, . . . , ai1)
, i.e., the
nodes that have no edge to any nodes in
{a1, . . . , ai1}
. Notice again that the parameterization space
for MIS
θRn
(where
n
denotes the number of nodes in the graph) is compact, continuous and
differentiable, which allows efficient gradient-driven optimization.
Due to the space limit, we leave the proof of the convergence between
pθ
and
qθ
(i.e.,
qTSP
θ
and
qMIS
θ
)
to the appendix.
3.3 Meta-Learning Framework
Model-Agnostic Meta-Learning (MAML) [
18
] is originally proposed for few-shot learning. In the
MAML framework, a model is first trained on a collection of tasks simultaneously, and then adapts
its model parameters to each task. The standard MAML uses second-order derivatives in training,
which are costly to compute. To reduce computation burden, the authors also propose first-order
approximation that does not require second-order derivatives.
Inspired by MAML, we train a graph neural network (GNN) over a collection of problem instances
in a way that the it can capture the common nature across all the instances, and adapt its distribution
parameters effectively to each instance based on the features/structure of each input graph. Let
FΦ
be the graph neural network with parameter
Φ
, and denote by
κs
the input features of an instance
graph
s
in collection
C
, by
As
the adjacency matrix of the input graph, and by
θs:= FΦ(κs,As)
the instance-specific initialization of distribution parameters. The vanilla loss function is defined as
the expected cost of the solution for any graph in the collection as:
L(Φ| C) = Es∈C`q(θs) = Es∈C`q(FΦ(κs,As)).(12)
The gradient-based updates can thus be written as:
ΦL(Φ| C) = Es∈C [Φθs· ∇θs`q(θs)]
=Es∈C [ΦFΦ(κs,As)· ∇θs`q(θs)] .(13)
where
θs`q(θs)
is estimated using the REINFORCE algorithm (Equation 6). Since
`q
does not
depend on the ground-truth labels, we can further fine-tune neural network parameters on each single
test instance with REINFORCE-based updates, which is referred to as active search [7, 28].
Specifically, the fine-tuned parameters
Φ(T)
s
is computed using one or more gradient updates for each
graph instance
s
. For example, when adapting to a problem instance
s
using
T
gradient updates with
learning rate α, we have:
Φ(0)
s=Φ,Φ(t)
s=Φ(t1)
sαΦ(t1)
sL(Φ(t1)
s| {s})for 1tT, (14)
θ(T)
s=FΦ(T)
s(κs,As).(15)
Here we use AdamW [
50
] in our experiments. Next, we optimize the performance of the graph neural
network with updated parameters (i.e., Φ(T)
s) with respect to Φ, with a meta-objective:
Lmeta(Φ| C) = Es∈C `q(θ(T)
s|s),(16)
5
摘要:

DIMES:ADifferentiableMetaSolverforCombinatorialOptimizationProblemsRuizhongQiuyDepartmentofComputerScienceUniversityofIllinoisUrbana–Champaignrq5@illinois.eduZhiqingSun,YimingYangLanguageTechnologiesInstituteCarnegieMellonUniversity{zhiqings,yiming}@cs.cmu.eduAbstractRecently,deepreinforcementlear...

展开>> 收起<<
DIMES A Differentiable Meta Solver for Combinatorial Optimization Problems Ruizhong Qiuy.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:637.39KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注