
SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems
Learning a good nonlinear cost model
f
is non-trivial for
practical applications (e.g., AlphaFold (Jumper et al.,2021),
Density Functional Theory (Nagai et al.,2020), cost model
for embedding tables (Zha et al.,2022a)) and is beyond the
scope of this work.
Evaluation Metric. We mainly focus on two aspects: the
solution quality evaluated by
f(ˆ
x;y)
, and the number of
queries of
f
during optimization to achieve the solution
ˆ
x
.
For both, smaller measurements are favorable, i.e., fewer
query of fto get solutions closer to global optimum.
When
f(x;y)
is linear w.r.t
x
, and the feasible region
Ω(y)
can be encoded using mixed integer programs or other math-
ematical programs, the problem can be solved efficiently
using existing scalable optimization solvers. When
f(x;y)
is nonlinear, we propose
SurCo
that learns a surrogate lin-
ear objective function, which allow us to leverage these
existing scalable optimization solvers, and which results in
a solution that has high quality with respect to the original
hard-to-encode objective function f(x;y).
3. SurCo: Learning Linear Surrogates
3.1. SurCo-zero: on-the-fly optimization
We start from the simplest case in which we focus on a single
instance with
f(x) = f(x;y)
and
Ω = Ω(y)
.
SurCo
-
zero aims to optimize the following objective:
(SurCo-zero): min
c
Lzero(c) := f(gΩ(c)) (2)
where the surrogate optimizer
gΩ:Rn7→ Rn
is the output
of certain combinatorial solvers with linear cost weight
c∈Rn
and feasible region
Ω⊆Rn
. For example,
gΩ
can
be the following:
gΩ(c) := arg min
x
c⊤xs.t.x∈Ω := {Ax≤b,x∈Zn}
(3)
which is the output of a MILP solver. Thanks to previ-
ous works (Ferber et al.,2020;Pogan
ˇ
ci
´
c et al.,2019), we
can efficiently compute the partial derivative
∂gΩ(c)/∂c
.
Intuitively, this means that
gΩ(c)
can be backpropagated
through. Since
f
is also differentiable with respect to the
solution it is evaluating, we thus can optimize Eqn. 2in an
end-to-end manner using any gradient-based optimizer:
c(t+ 1) = c(t)−α∂gΩ
∂c
∂f
∂x,(4)
where
α
is the learning rate. The procedure starts from a ran-
domly initialized
c(0)
and converges at a local optimal so-
lution of
c
. While Eqn. 2is still nonlinear optimization and
there is no guarantee about the quality of the final solution
c
,
we argue that optimizing Eqn. 2is better than optimizing the
original nonlinear cost
minx∈Ωf(x)
. Furthermore, while
we cannot guarantee optimality, we guarantee feasibility by
leveraging a linear combinatorial solver.
Intuitively, instead of optimizing directly over the solution
space
x
, we optimize over the space of surrogate costs
c
,
and delegate the combinatorial feasibility requirements of
the nonlinear problem to SoTA combinatorial solvers. Com-
pared to naive approaches that directly optimize
f(x)
via
general optimization techniques, our method readily handles
complex constraints of the feasible regions, and thus makes
the optimization procedure easier. Furthermore, it also helps
escape from local minima, thanks to the embedded search
component of existing combinatorial solvers (e.g., branch-
and-bound (Land & Doig,2010) in MILP solvers). As we
see in the experiments, this is particularly important when
the problem becomes large-scale with more local optima.
This approach works well when we are optimizing individ-
ual instances and may not have access to offline training
data or the training time is cost-prohibitive.
Limitation. Note that due to linear surrogate, our approach
will always return a vertex in the feasible region, while the
solution to the original nonlinear objective may be in the
interior. We leave this limitation for future work. In many
real-world settings, such as in the three domains we tested,
the solutions are indeed on the vertices of feasible regions.
3.2. SurCo-prior: offline surrogate training
We now consider a more general case where we have
N
optimization instances, each parameterized by an instance
description
yi
,
i= 1 . . . N
, and we want to find their so-
lutions to a collection of nonlinear loss functions
f(x;yi)
simultaneously. Here we write
Dtrain := {yi}N
i=1
as the
training set. A naive approach is just to apply
SurCo
-
zero N
times, which leads to
N
independent surrogate
costs
{ci}N
i=1
. However, this approach does not consider
two important characteristics. First, it fails to leverage pos-
sible relationship between the instance descriptor
yi
and
its associated surrogate cost
ci
, since every surrogate cost
is independently estimated. Second, it fails to learn any
useful knowledge from the
N
instances after optimization.
As a result, for an unseen instance, the entire optimization
process needs to be conducted again, which is slow. This
motivates us to add a surrogate cost model
ˆ
c(y;θ)
into the
optimization as a regularizer:
(SurCo-prior-λ): min
θ,{ci}Lprior(θ,{ci};λ)
:=
N
X
i=1
f(gΩ(yi)(ci); yi) + λ∥ci−ˆ
c(yi;θ))∥2(5)
The regressor model
ˆ
c(y;θ)
directly predicts the surrogate
cost from the instance description. The form of the regressor
can be a neural network, in which
θ
is its parameters. Note
that when
λ= 0
, it reduces to
N
independent optimizations,
while when
λ > 0
, the surrogate costs
{ci}
interact with
each other. With the regressor, we distill knowledge gained
3