Travel the Same Path A Novel TSP Solving Strategy Pingbang Hu Department of Computer Science

2025-04-26 0 0 3.65MB 18 页 10玖币
侵权投诉
Travel the Same Path: A Novel TSP Solving Strategy
Pingbang Hu
Department of Computer Science
University of Michigan
pbb@umich.edu
Abstract
In this paper, we provide a novel strategy for solving Traveling Salesman Problem,
which is a famous combinatorial optimization problem studied intensely in the
TCS community. In particular, we consider the imitation learning framework,
which helps a deterministic algorithm making good choices whenever it needs to,
resulting in a speed up while maintaining the exactness of the solution without
suffering from the unpredictability and a potential large deviation.
Furthermore, we demonstrate a strong generalization ability of a graph neural
network trained under the imitation learning framework. Specifically, the model is
capable of solving a large instance of TSP faster than the baseline while has only
seen small TSP instances when training.
Keywords:
Traveling salesman problem, Graph Neural Network, Imitation Train-
ing, Reinforcement Learning, Integer Programming, Embedding learning, Combi-
natorial Optimization, Exact solver.
1 Introduction
The traveling salesman problem (TSP) can be described as follows: given a list of cities and the
distances between each pair of cities, find the shortest route possible that visits each city exactly
once then returns to the origin city. Specifically, given an undirected weighted graph
G= (E,V)
,
with an ordered pair of nodes set
E
and an edge set
V E × E
where
G
is equipped with spatial
structure. This means that each edge between nodes will have different weights and each node will
have its coordinates, we want to find a simple cycle that visits every node exactly once while having
the smallest cost.
We will utilize GCNN (Graph Convolutional Neural Network), a particular kind of GNN, together
with imitation learning to solve TSP in an interesting and inspiring way. In particular, we focus on
the generalization ability of models trained on small-sized problem instances.1
2 Related Works
There has already been extensive work done to optimize TSP solvers both theoretically and practically.
We have done extensive research into other solvers; the papers most relevant to our project are
summarized below.
Transformer Network for TSP [6].
The main focus of this paper is to detail the application of
deep reinforcement learning reapplied to a Transformer architecture originally created for Natural
Language Processing (NLP). Unlike our proposed model, this solver does not solve TSP exactly but
instead learns heuristics that have very low error rates (0.004% for TSP50 and 0.39% for TSP100).
1The code is available at https://github.com/sleepymalc/Travel-the-Same-Path.
arXiv:2210.05906v1 [cs.LG] 12 Oct 2022
These heuristics can run over a TSP problem much faster than a traditional solver while still achieving
similar results.
Exact Combinatorial Optimization with GCNNs [11].
This paper serves as one of the backbones
of our research; its main focus is to detail how MIPS can potentially be solved much quicker than
a traditional solver by using GNNs (specifically GCNNs). It did this by training its model using
imitation learning (using the strong branching expert rule) and was able to effectively produce outputs
for problem instances much greater than what they were trained on.
State of the Art Exact Solver.
There has been a lot of progress on the symmetric TSP in the last
century. With the increase in the number of nodes, there is a super-polynomial (at least exponential)
explosion in the number of potential solutions. This makes the TSP problem difficult to solve on
two parameters, the first being finding a global shortest route as well as reducing the computation
complexity in finding this route.
Concorde
[
9
], written in the ANSI C programming language is
widely recognized as the fastest state-of-the-art (SOTA) exact TSP solution for large instances.
3 Preliminary
3.1 Integer Linear Programming Formulation of TSP
We first formulate TSP in terms of Integer Linear Programming. Given an undirected weighted group
G= (E,V), we label the nodes with numbers 1, . . . , n and define
xij :=1,if (i, j)∈ E0
0,if (i, j) E \ E0,
where
E0⊂ E
is a variable which can be viewed as a compact representation of all variables
xij
,
i, j
.
Furthermore, we denote the weight on edge
(i, j)
by
cij
, then for a particular TSP problem instance,
we can formulate the problem as follows.
min
n
X
i=1
n
X
j6=i,j=1
cij xij
n
X
i=1,i6=j
xij = 1 j= 1, . . . , n;
n
X
j=1,j6=i
xij = 1 i= 1, . . . , n;
uiuj+nxij n1 2 i6=jn;
1uin1 2 in;
xij ∈ {0,1}i, j = 1, . . . , n;
uiZi= 2, . . . , n.
(1)
This is the Miller-Tucker-Zemlin formulation [
20
]. Note that in our case, since we are solving TSP
exactly, all variables are integers. This type of integer linear programming is sometimes known as
pure integer programming.
3.2 Solving the Integer Linear Program
Since integer programming is an NP-Hard problem, there is no known polynomial algorithm that
can solve this explicitly. Hence, the modern approach to such a problem is to relax the integrality
constraint, which makes Equation 1 becomes continuous linear programming (LP), whose solution
provides a lower bound to Equation 1 since it is a relaxation, and we are trying to find the minimum.
Since an LP is a convex optimization problem, we have many polynomial-time algorithms to solve the
relaxed version. After obtaining a relaxed solution, if such LP relaxed solution respects the integrality
constraint, we see that it’s indeed a solution to Equation 1. But if not, we can simply divide the
2
original relaxed LP into two sub-problems by splitting the feasible region according to a variable that
does not respect integrality in the current relaxed LP solution x,
xi≤ bx
ic ∨ xi≥ dx
ie,ip|x
i/Z.(2)
We see that by adding such additional constraints in two sub-problems respectively, we get a recursive
algorithm called Branch-and-Bound [
26
]. The branch-and-bound algorithm is widely used to solve
integer programming problems. We see that the key step in the branch-and-bound algorithm is
selecting a non-integer variable to
branch on
in Equation 2. And as one can expect, some choices may
reduce the recursive searching tree significantly [
2
], hence the branching rules are the core of modern
combinatorial optimization solvers, and it has been the focus of extensive research [18,22,10,1].
3.3 Branching Strategy
There are several popular strategies [3] used in modern solvers.
Strong branching.
Strong branching is guaranteed to result in the smallest recursive tree by
computing the expected bound improvement for each candidate variable before branching by finding
solutions of two LPs for every candidate. However, this is extremely computationally expensive [
4
].
Hybrid branching.
Hybrid branching computes a strong branching score at the beginning of
the solving process, but gradually switches to other methods like Conflict score, Most Infeasible
branching, or some other, hand-crafted, combinations of the above [3,1].
Pseudocost branching.
This is the default branching strategy used in
SCIP
. By keeping track of
each variable
xi
the change in the objective function when this variable was previously chosen as the
variable to branch on, the strategy then chooses the variable that is predicted to have the most change
on the objective function based on past changes when it was chosen as the branching variable [22].
4 Problem Formulation
In order to solve TSP with ILP efficiently, we use the branch-and-bound algorithm. Specifically, we
want to take advantage of the fast inference time and the learning ability of the model, hence we
choose to learn the most powerful branching strategy known: strong branching. Our objective is
then to learn a branching strategy without expensive evaluation. Since this is a discrete-time control
process, we model the problem by Markov Decision Process (MDP) [15].
4.1 Markov Decision Process (MDP)
Given a regular Markov decision process
M:= (S,A, pinit, ptrans, R)
, we have the state space
S
,
action space
A
, initial state distribution
pinit :S R0
, state transition distribution
ptrans :S ×
A × S R0and the reward function R:S R. One thing to note is that the reward function R
need not be deterministic. In other words, we can define
R
as a random function that will take a value
based on a particular state in
S
with some randomness. Note that if
R
in
M
is equipped with any kind
of randomness, we can write the reward
rt
at time
t
as
rtpreward(rt|st1, at1, st)
. This can be
converted into an equivalent Markov Decision Process
M0
with a deterministic reward function
R0
,
where the randomness is integrated into parts of the states. With an action policy
π:A × S R0
such that the action
at
taken at time
t
is determined by
atπ(at|st)
, we see that an MDP can be
unrolled to produce a trajectory composed by state-action pairs as
τ= (s0, a0, s1, a1, . . .)
which
obeys the joint distribution
τpinit(s0)
| {z }
initial state
Y
t=0
π(at|st)
| {z }
next action
ptrans(st+1 |at, st)
| {z }
next state
4.2 Partially Observable Markov Decision Process (PO-MDP)
Following from the same idea as MDP, the PO-MDP setting deals with the case that when the complete
information about the current MDP state
S
is unavailable or not necessarily for decision-making [
27
].
3
Instead, in our case, only a partial observation
o
is available, where
is called the partial
state space. We can use an active perspective to view the above model; namely, we are merely
applying an
observation function O:S
to the current state
st
at each time step
t
. Hence, we
define a PO-MDP
f
M
as a tuple
f
M:= (S,A, pinit, ptrans, R, O)
. Within this setup, a trajectory of
PO-MDP takes form as
τ= (o0, r0, a0, o1, r1, a1, . . .)
, where
ot:=O(st)
and
rt:=R(st)
. It is
important to note that here
rt
still depends on the state of the OP-MDP, not the observation. We
introduce a convenience variable
ht: (o0, r0, a0, . . . , ot, rt)∈ H
, which represents the PO-MDP
history at time step
t
without the action
at
. Due to the non-Markovian nature of the trajectories,
ot+1, rt+1 6⊥ ht1|ot, rt, at
, the decision-maker must take the whole history of observations,
rewards and actions into account to decide on an optimal action at the current time step
t
. We then
see that the action policy for PO-MDP takes the form eπ:A × H R0such that atπ(at|ht).
4.3 Markov Control Problem
We define the MDP control problem as that of finding a policy
π:A × S R0
which is optimal
with respect to the expected total reward. That is,
π= arg max
π
lim
T→∞
Eτ"T
X
t=0
rt#
where
rt:=R(st)
. To generalize this into a PO-MDP control problem, similar to the MDP control
problem, the objective is to find a policy
eπ:A × H R0
such that it maximizes the expected
total rewards. By slightly abusing the notation, we simply denote this learned policy by
eπ
where the
objective function is completely the same as in the MDP case.
5 Methodology
Since the branch-and-bound variable selection problem can be naturally formulated as a Markov deci-
sion process, a natural machine learning algorithm to use is reinforcement learning [
25
]. Specifically,
since there are some SOTA integers programming solvers out there,
Gurobi
[
14
],
SCIP
[
5
], etc.,
we decided to try imitation learning [
16
] by learning directly from an expert branching rule. There
are some related works in this approach [
11
] aiming to tackle mixed integer linear programming
(MILP) where only a portion of variables have integral constraints, while other variables can be real
numbers. Our approach extends this further. We are focusing on TSP, which not only is pure integer
programming, but also the variables can only take values from {0,1}.
5.1 Learning Pipeline
Our learning pipeline is as follows: we first create some random TSP instances and turn them into
ILP. Then, we use imitation learning to learn how to choose the branching target at each branching.
Our GNN model produces a set of actions with the probability corresponding to each possible action
(in our case, which variable to branch). We then use Cross-Entropy Loss to compare our prediction to
the result produced by SCIP and complete one iteration.
Instances Generation.
For each TSP instance, we randomly generate the coordinates for every
node and formulate it by using Miller-Tucker-Zemlin formulation [
20
] and record it in the linear
programming format called instances_*.lp via CPlex [8].
Samples Generation.
By passing every
instances_*.lp
to
SCIP
, we can record the branching
decision solver made when solving it. The modern solver usually uses a mixed branching strategy
to balance the running time, but since we want to learn the best branching strategy, we ask
SCIP
to
use a strong branch with some probability when branching, and only record the state and branching
decision (state-action pairs) D={(si,a
i)}N
i=1 when SCIP uses strong branch.
Imitating Learning. We learn our policy eπby minimizing the cross-entropy loss
L(θ) = 1
NX
(s,a)∈D
log eπθ(a|s)
to train by behavioral cloning [23] from the state-action pairs we recorded.
4
摘要:

TraveltheSamePath:ANovelTSPSolvingStrategyPingbangHuDepartmentofComputerScienceUniversityofMichiganpbb@umich.eduAbstractInthispaper,weprovideanovelstrategyforsolvingTravelingSalesmanProblem,whichisafamouscombinatorialoptimizationproblemstudiedintenselyintheTCScommunity.Inparticular,weconsidertheimit...

展开>> 收起<<
Travel the Same Path A Novel TSP Solving Strategy Pingbang Hu Department of Computer Science.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:3.65MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注