
Instead, in our case, only a partial observation
o∈Ω
is available, where
Ω
is called the partial
state space. We can use an active perspective to view the above model; namely, we are merely
applying an
observation function O:S → Ω
to the current state
st
at each time step
t
. Hence, we
define a PO-MDP
f
M
as a tuple
f
M:= (S,A, pinit, ptrans, R, O)
. Within this setup, a trajectory of
PO-MDP takes form as
τ= (o0, r0, a0, o1, r1, a1, . . .)
, where
ot:=O(st)
and
rt:=R(st)
. It is
important to note that here
rt
still depends on the state of the OP-MDP, not the observation. We
introduce a convenience variable
ht: (o0, r0, a0, . . . , ot, rt)∈ H
, which represents the PO-MDP
history at time step
t
without the action
at
. Due to the non-Markovian nature of the trajectories,
ot+1, rt+1 6⊥ ht−1|ot, rt, at
, the decision-maker must take the whole history of observations,
rewards and actions into account to decide on an optimal action at the current time step
t
. We then
see that the action policy for PO-MDP takes the form eπ:A × H → R≥0such that at∼π(at|ht).
4.3 Markov Control Problem
We define the MDP control problem as that of finding a policy
π∗:A × S → R≥0
which is optimal
with respect to the expected total reward. That is,
π∗= arg max
π
lim
T→∞
Eτ"T
X
t=0
rt#
where
rt:=R(st)
. To generalize this into a PO-MDP control problem, similar to the MDP control
problem, the objective is to find a policy
eπ∗:A × H → R≥0
such that it maximizes the expected
total rewards. By slightly abusing the notation, we simply denote this learned policy by
eπ∗
where the
objective function is completely the same as in the MDP case.
5 Methodology
Since the branch-and-bound variable selection problem can be naturally formulated as a Markov deci-
sion process, a natural machine learning algorithm to use is reinforcement learning [
25
]. Specifically,
since there are some SOTA integers programming solvers out there,
Gurobi
[
14
],
SCIP
[
5
], etc.,
we decided to try imitation learning [
16
] by learning directly from an expert branching rule. There
are some related works in this approach [
11
] aiming to tackle mixed integer linear programming
(MILP) where only a portion of variables have integral constraints, while other variables can be real
numbers. Our approach extends this further. We are focusing on TSP, which not only is pure integer
programming, but also the variables can only take values from {0,1}.
5.1 Learning Pipeline
Our learning pipeline is as follows: we first create some random TSP instances and turn them into
ILP. Then, we use imitation learning to learn how to choose the branching target at each branching.
Our GNN model produces a set of actions with the probability corresponding to each possible action
(in our case, which variable to branch). We then use Cross-Entropy Loss to compare our prediction to
the result produced by SCIP and complete one iteration.
Instances Generation.
For each TSP instance, we randomly generate the coordinates for every
node and formulate it by using Miller-Tucker-Zemlin formulation [
20
] and record it in the linear
programming format called instances_*.lp via CPlex [8].
Samples Generation.
By passing every
instances_*.lp
to
SCIP
, we can record the branching
decision solver made when solving it. The modern solver usually uses a mixed branching strategy
to balance the running time, but since we want to learn the best branching strategy, we ask
SCIP
to
use a strong branch with some probability when branching, and only record the state and branching
decision (state-action pairs) D={(si,a∗
i)}N
i=1 when SCIP uses strong branch.
Imitating Learning. We learn our policy eπ∗by minimizing the cross-entropy loss
L(θ) = −1
NX
(s,a∗)∈D
log eπθ(a∗|s)
to train by behavioral cloning [23] from the state-action pairs we recorded.
4