
The bad scalability of these DRL methods lies in the fact that they suffer from costly decoding of
CO solutions, which is typically linear in the number of nodes in the input graph. Since the reward
of reinforcement learning is determined after decoding a complete solution (with either a chain rule
factorization or iterative refinements), either construction or improvement heuristic learners would
encounter the sparse reward problem when dealing with large graphs [
42
,
33
,
37
]. While such an
overhead can be partially alleviated by constructing several parts of the solution in parallel [
1
] for
locally decomposable CO problems
3
, such as for maximum independent set (MIS) problems [
54
],
how to scale up neural solvers for CO problems in general, including the locally non-decomposable
ones (such as TSP) is still an open challenge.
In this paper, we address the scalability challenge by proposing a novel framework, namely DIMES
(DIfferentiable MEta Solver), for solving combinatorial optimization problems. Unlike previous
DRL-based CO solvers that rely on construction or improvement heuristics, we introduce a compact
continuous space to parameterize the underlying distribution of candidate solutions, which allows
massively parallel on-policy sampling without the costly decoding process, and effectively reduces
the variance of the gradients by the REINFORCE algorithm [
71
] during both training and fine-tuning
phases. We further propose a meta-learning framework for CO over problem instances to enable
effective initialization of model parameters in the fine-tuning stage. To our knowledge, we are the
first to apply meta-learning over a collection of CO problem instances, where each instance graph is
treated as one of a collection tasks in a unified framework.
We need to point out that the idea of designing a continuous space for combinatorial optimization
problems has been tried by the heatmaps approaches in the literature [
48
,
31
,
19
,
14
,
44
]. However,
there are major distinctions between the existing methods and our DIMES. For instance, Fu et al.
[19]
learn to generate heatmaps via supervised learning (i.e., each training instance is paired with its
best solution) [
4
,
21
], which is very costly to obtain on large graphs. DIMES is directly optimized
with gradients estimated by the REINFORCE algorithm without any supervision, so it can be trained
on large graphs directly. As a result, DIMES can scale to large graphs with up to tens of thousands of
nodes, and predict (nearly) optimal solutions without the need for costly generation of supervised
training data or human specification of problem-specific heuristics.
In our experiments, we show that DIMES outperforms strong baselines among DRL-based solvers on
TSP benchmark datasets, and can successfully scale up to graphs with tens of thousands of nodes. As
a sanity check, we also evaluate our framework with locally decomposable combinatorial optimization
problems, including Maximal Independent Set (MIS) problem for synthetic graphs and graphs reduced
from satisfiability (SAT) problems. Our experimental results show that DIMES achieve competitive
performance compared to neural solvers specially designed for locally decomposable CO problems.
2 Related Work
2.1 DRL-Based Construction Heuristics Learners
Construction heuristics methods create a solution of CO problem instance in one shot without further
modifications. Bello et al.
[7]
are the first to tackle combinatorial optimization problems using
neural networks and reinforcement learning. They used a Pointer Network (PtrNet) [
68
] as the policy
network and used the actor-critic algorithm [
41
] for training on TSP and KnapSack instances. Further
improved models have been developed afterwards [
13
,
42
,
63
,
14
,
46
], such as attention models
[
66
], better DRL algorithms [
36
,
52
,
43
,
45
,
59
,
73
,
69
], for an extended scope of CO problems such
as Capacitated Vehicle Routing Problem (CVRP) [
57
], Job Shop Scheduling Problem (JSSP) [
76
],
Maximal Independent Set (MIS) problem [36, 1], and boolean satisfiability problem (SAT) [75].
Our proposed method in this paper belongs to the category of construction heuristics learners in the
sense of producing a one-shot solution per problem instance. However, there are major distinctions
between previous methods and ours. One distinction is how to construct solutions. Unlike previous
methods which generate the solutions via a constructive Markov decision process (MDP) with rather
costly decoding steps (adding one un-visited node per step to a partial solution), we introduce a
compact continuous space to parameterize the underlying distribution of discrete candidate solutions,
and to allow efficient sampling from that distribution without costly neural network-involved decoding.
3
Locally decomposable problem refers to the problem where the feasibility constraint and the objective can
be decomposed by locally connected variables (in a graph) [1].
2