Travel the Same Path A Novel TSP Solving Strategy Pingbang Hu Department of Computer Science

2025-04-26 1 0 3.65MB 18 页 10玖币

侵权投诉

Travel the Same Path: A Novel TSP Solving Strategy

Pingbang Hu

Department of Computer Science

University of Michigan

pbb@umich.edu

Abstract

In this paper, we provide a novel strategy for solving Traveling Salesman Problem,

which is a famous combinatorial optimization problem studied intensely in the

TCS community. In particular, we consider the imitation learning framework,

which helps a deterministic algorithm making good choices whenever it needs to,

resulting in a speed up while maintaining the exactness of the solution without

suffering from the unpredictability and a potential large deviation.

Furthermore, we demonstrate a strong generalization ability of a graph neural

network trained under the imitation learning framework. Speciﬁcally, the model is

capable of solving a large instance of TSP faster than the baseline while has only

seen small TSP instances when training.

Keywords:

Traveling salesman problem, Graph Neural Network, Imitation Train-

ing, Reinforcement Learning, Integer Programming, Embedding learning, Combi-

natorial Optimization, Exact solver.

1 Introduction

The traveling salesman problem (TSP) can be described as follows: given a list of cities and the

distances between each pair of cities, ﬁnd the shortest route possible that visits each city exactly

once then returns to the origin city. Speciﬁcally, given an undirected weighted graph

G= (E,V)

with an ordered pair of nodes set

and an edge set

V ⊆ E × E

where

is equipped with spatial

structure. This means that each edge between nodes will have different weights and each node will

have its coordinates, we want to ﬁnd a simple cycle that visits every node exactly once while having

the smallest cost.

We will utilize GCNN (Graph Convolutional Neural Network), a particular kind of GNN, together

with imitation learning to solve TSP in an interesting and inspiring way. In particular, we focus on

the generalization ability of models trained on small-sized problem instances.1

2 Related Works

There has already been extensive work done to optimize TSP solvers both theoretically and practically.

We have done extensive research into other solvers; the papers most relevant to our project are

summarized below.

Transformer Network for TSP [6].

The main focus of this paper is to detail the application of

deep reinforcement learning reapplied to a Transformer architecture originally created for Natural

Language Processing (NLP). Unlike our proposed model, this solver does not solve TSP exactly but

instead learns heuristics that have very low error rates (0.004% for TSP50 and 0.39% for TSP100).

1The code is available at https://github.com/sleepymalc/Travel-the-Same-Path.

arXiv:2210.05906v1 [cs.LG] 12 Oct 2022

These heuristics can run over a TSP problem much faster than a traditional solver while still achieving

similar results.

Exact Combinatorial Optimization with GCNNs [11].

This paper serves as one of the backbones

of our research; its main focus is to detail how MIPS can potentially be solved much quicker than

a traditional solver by using GNNs (speciﬁcally GCNNs). It did this by training its model using

imitation learning (using the strong branching expert rule) and was able to effectively produce outputs

for problem instances much greater than what they were trained on.

State of the Art Exact Solver.

There has been a lot of progress on the symmetric TSP in the last

century. With the increase in the number of nodes, there is a super-polynomial (at least exponential)

explosion in the number of potential solutions. This makes the TSP problem difﬁcult to solve on

two parameters, the ﬁrst being ﬁnding a global shortest route as well as reducing the computation

complexity in ﬁnding this route.

Concorde

[

], written in the ANSI C programming language is

widely recognized as the fastest state-of-the-art (SOTA) exact TSP solution for large instances.

3 Preliminary

3.1 Integer Linear Programming Formulation of TSP

We ﬁrst formulate TSP in terms of Integer Linear Programming. Given an undirected weighted group

G= (E,V), we label the nodes with numbers 1, . . . , n and deﬁne

xij :=1,if (i, j)∈ E0

0,if (i, j)∈ E \ E0,

where

E0⊂ E

is a variable which can be viewed as a compact representation of all variables

xij

∀i, j

Furthermore, we denote the weight on edge

(i, j)

cij

, then for a particular TSP problem instance,

we can formulate the problem as follows.

min

i=1

j6=i,j=1

cij xij

i=1,i6=j

xij = 1 j= 1, . . . , n;

j=1,j6=i

xij = 1 i= 1, . . . , n;

ui−uj+nxij ≤n−1 2 ≤i6=j≤n;

1≤ui≤n−1 2 ≤i≤n;

xij ∈ {0,1}i, j = 1, . . . , n;

ui∈Zi= 2, . . . , n.

(1)

This is the Miller-Tucker-Zemlin formulation [

]. Note that in our case, since we are solving TSP

exactly, all variables are integers. This type of integer linear programming is sometimes known as

pure integer programming.

3.2 Solving the Integer Linear Program

Since integer programming is an NP-Hard problem, there is no known polynomial algorithm that

can solve this explicitly. Hence, the modern approach to such a problem is to relax the integrality

constraint, which makes Equation 1 becomes continuous linear programming (LP), whose solution

provides a lower bound to Equation 1 since it is a relaxation, and we are trying to ﬁnd the minimum.

Since an LP is a convex optimization problem, we have many polynomial-time algorithms to solve the

relaxed version. After obtaining a relaxed solution, if such LP relaxed solution respects the integrality

constraint, we see that it’s indeed a solution to Equation 1. But if not, we can simply divide the

original relaxed LP into two sub-problems by splitting the feasible region according to a variable that

does not respect integrality in the current relaxed LP solution x∗,

xi≤ bx∗

ic ∨ xi≥ dx∗

ie,∃i≤p|x∗

i/∈Z.(2)

We see that by adding such additional constraints in two sub-problems respectively, we get a recursive

algorithm called Branch-and-Bound [

]. The branch-and-bound algorithm is widely used to solve

integer programming problems. We see that the key step in the branch-and-bound algorithm is

selecting a non-integer variable to

branch on

in Equation 2. And as one can expect, some choices may

reduce the recursive searching tree signiﬁcantly [

], hence the branching rules are the core of modern

combinatorial optimization solvers, and it has been the focus of extensive research [18,22,10,1].

3.3 Branching Strategy

There are several popular strategies [3] used in modern solvers.

Strong branching.

Strong branching is guaranteed to result in the smallest recursive tree by

computing the expected bound improvement for each candidate variable before branching by ﬁnding

solutions of two LPs for every candidate. However, this is extremely computationally expensive [

Hybrid branching.

Hybrid branching computes a strong branching score at the beginning of

the solving process, but gradually switches to other methods like Conﬂict score, Most Infeasible

branching, or some other, hand-crafted, combinations of the above [3,1].

Pseudocost branching.

This is the default branching strategy used in

SCIP

. By keeping track of

each variable

the change in the objective function when this variable was previously chosen as the

variable to branch on, the strategy then chooses the variable that is predicted to have the most change

on the objective function based on past changes when it was chosen as the branching variable [22].

4 Problem Formulation

In order to solve TSP with ILP efﬁciently, we use the branch-and-bound algorithm. Speciﬁcally, we

want to take advantage of the fast inference time and the learning ability of the model, hence we

choose to learn the most powerful branching strategy known: strong branching. Our objective is

then to learn a branching strategy without expensive evaluation. Since this is a discrete-time control

process, we model the problem by Markov Decision Process (MDP) [15].

4.1 Markov Decision Process (MDP)

Given a regular Markov decision process

M:= (S,A, pinit, ptrans, R)

, we have the state space

action space

, initial state distribution

pinit :S → R≥0

, state transition distribution

ptrans :S ×

A × S → R≥0and the reward function R:S → R. One thing to note is that the reward function R

need not be deterministic. In other words, we can deﬁne

as a random function that will take a value

based on a particular state in

with some randomness. Note that if

is equipped with any kind

of randomness, we can write the reward

at time

rt∼preward(rt|st−1, at−1, st)

. This can be

converted into an equivalent Markov Decision Process

with a deterministic reward function

where the randomness is integrated into parts of the states. With an action policy

π:A × S → R≥0

such that the action

taken at time

is determined by

at∼π(at|st)

, we see that an MDP can be

unrolled to produce a trajectory composed by state-action pairs as

τ= (s0, a0, s1, a1, . . .)

which

obeys the joint distribution

τ∼pinit(s0)

| {z }

initial state

∞

t=0

π(at|st)

| {z }

next action

ptrans(st+1 |at, st)

| {z }

next state

4.2 Partially Observable Markov Decision Process (PO-MDP)

Following from the same idea as MDP, the PO-MDP setting deals with the case that when the complete

information about the current MDP state

is unavailable or not necessarily for decision-making [

Instead, in our case, only a partial observation

o∈Ω

is available, where

Ω

is called the partial

state space. We can use an active perspective to view the above model; namely, we are merely

applying an

observation function O:S → Ω

to the current state

at each time step

. Hence, we

deﬁne a PO-MDP

as a tuple

M:= (S,A, pinit, ptrans, R, O)

. Within this setup, a trajectory of

PO-MDP takes form as

τ= (o0, r0, a0, o1, r1, a1, . . .)

, where

ot:=O(st)

and

rt:=R(st)

. It is

important to note that here

still depends on the state of the OP-MDP, not the observation. We

introduce a convenience variable

ht: (o0, r0, a0, . . . , ot, rt)∈ H

, which represents the PO-MDP

history at time step

without the action

. Due to the non-Markovian nature of the trajectories,

ot+1, rt+1 6⊥ ht−1|ot, rt, at

, the decision-maker must take the whole history of observations,

rewards and actions into account to decide on an optimal action at the current time step

. We then

see that the action policy for PO-MDP takes the form eπ:A × H → R≥0such that at∼π(at|ht).

4.3 Markov Control Problem

We deﬁne the MDP control problem as that of ﬁnding a policy

π∗:A × S → R≥0

which is optimal

with respect to the expected total reward. That is,

π∗= arg max

lim

T→∞

Eτ"T

t=0

rt#

where

rt:=R(st)

. To generalize this into a PO-MDP control problem, similar to the MDP control

problem, the objective is to ﬁnd a policy

eπ∗:A × H → R≥0

such that it maximizes the expected

total rewards. By slightly abusing the notation, we simply denote this learned policy by

eπ∗

where the

objective function is completely the same as in the MDP case.

5 Methodology

Since the branch-and-bound variable selection problem can be naturally formulated as a Markov deci-

sion process, a natural machine learning algorithm to use is reinforcement learning [

]. Speciﬁcally,

since there are some SOTA integers programming solvers out there,

Gurobi

[

SCIP

[

], etc.,

we decided to try imitation learning [

] by learning directly from an expert branching rule. There

are some related works in this approach [

] aiming to tackle mixed integer linear programming

(MILP) where only a portion of variables have integral constraints, while other variables can be real

numbers. Our approach extends this further. We are focusing on TSP, which not only is pure integer

programming, but also the variables can only take values from {0,1}.

5.1 Learning Pipeline

Our learning pipeline is as follows: we ﬁrst create some random TSP instances and turn them into

ILP. Then, we use imitation learning to learn how to choose the branching target at each branching.

Our GNN model produces a set of actions with the probability corresponding to each possible action

(in our case, which variable to branch). We then use Cross-Entropy Loss to compare our prediction to

the result produced by SCIP and complete one iteration.

Instances Generation.

For each TSP instance, we randomly generate the coordinates for every

node and formulate it by using Miller-Tucker-Zemlin formulation [

] and record it in the linear

programming format called instances_*.lp via CPlex [8].

Samples Generation.

By passing every

instances_*.lp

SCIP

, we can record the branching

decision solver made when solving it. The modern solver usually uses a mixed branching strategy

to balance the running time, but since we want to learn the best branching strategy, we ask

SCIP

use a strong branch with some probability when branching, and only record the state and branching

decision (state-action pairs) D={(si,a∗

i)}N

i=1 when SCIP uses strong branch.

Imitating Learning. We learn our policy eπ∗by minimizing the cross-entropy loss

L(θ) = −1

(s,a∗)∈D

log eπθ(a∗|s)

to train by behavioral cloning [23] from the state-action pairs we recorded.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TraveltheSamePath:ANovelTSPSolvingStrategyPingbangHuDepartmentofComputerScienceUniversityofMichiganpbb@umich.eduAbstractInthispaper,weprovideanovelstrategyforsolvingTravelingSalesmanProblem,whichisafamouscombinatorialoptimizationproblemstudiedintenselyintheTCScommunity.Inparticular,weconsidertheimit...

展开>> 收起<<

Travel the Same Path A Novel TSP Solving Strategy Pingbang Hu Department of Computer Science.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Travel the Same Path A Novel TSP Solving Strategy Pingbang Hu Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: