SurCo Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

2025-05-02 0 0 1.18MB 19 页 10玖币
侵权投诉
SurCo: Learning Linear SURrogates
for COmbinatorial Nonlinear Optimization Problems
Aaron Ferber 1Taoan Huang 1Daochen Zha 2
Martin Schubert 3Benoit Steiner 4Bistra Dilkina 1Yuandong Tian 3
Abstract
Optimization problems with nonlinear cost func-
tions and combinatorial constraints appear in
many real-world applications but remain chal-
lenging to solve efficiently compared to their lin-
ear counterparts. To bridge this gap, we propose
SurCo
that learns linear
Sur
rogate costs which
can be used in existing
Co
mbinatorial solvers to
output good solutions to the original nonlinear
combinatorial optimization problem. The surro-
gate costs are learned end-to-end with nonlinear
loss by differentiating through the linear surrogate
solver, combining the flexibility of gradient-based
methods with the structure of linear combinato-
rial optimization. We propose three
SurCo
vari-
ants:
SurCo zero
for individual nonlinear
problems,
SurCo prior
for problem distribu-
tions, and
SurCohybrid
to combine both dis-
tribution and problem-specific information. We
give theoretical intuition motivating
SurCo
, and
evaluate it empirically. Experiments show that
SurCo
finds better solutions faster than state-
of-the-art and domain expert approaches in real-
world optimization problems such as embedding
table sharding, inverse photonic design, and non-
linear route planning.
1. Introduction
Combinatorial optimization problems with linear objec-
tive functions such as mixed integer linear programming
(MILP) (Wolsey,2007), and occasionally linear program-
Work done during Aaron and Taoan’s internship in Meta AI.
Project page at
https://sites.google.com/usc.edu/
surco/ 1
Center for AI in Society, University of Southern Califor-
nia
2
Rice University
3
Meta AI, FAIR
4
Anthropic. Correspondence
to: Aaron Ferber
<
aferber@usc.edu
>
, Yuandong Tian
<
yuan-
dong@meta.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
ming (LP) (Chvatal et al.,1983), have been extensively
studied in operations research (OR). The resulting high-
performance solvers like Gurobi (Gurobi Optimization,
LLC,2022) can solve industrial-scale optimization prob-
lems with tens of thousands of variables in a few minutes.
However, even with perfect solvers, one issue remains: the
cost functions
f(x)
in many practical problems are nonlin-
ear, and the highly-optimized solvers mainly handle linear
or convex formulations while real-world problems have less
constrained objectives. For example, in embedding table
sharding (Zha et al.,2022a) one needs to distribute embed-
ding tables to multiple GPUs for the deployment of recom-
mendation systems. Due to the batching behaviors within a
single GPU and communication cost among different GPUs,
the overall latency (cost function) in this application de-
pends on interactions of multiple tables and thus can be
highly nonlinear (Zha et al.,2022a).
To obtain useful solutions to real-world problems, one may
choose to directly optimize the nonlinear cost, which can
be the black-box output of a simulator (Gosavi et al.,2015;
Ye et al.,2019), or the output of a cost estimator learned
by machine learning techniques (e.g., deep models) from
offline data (Steiner et al.,2021;Koziel et al.,2021;Wang
et al.,2021b;Cozad et al.,2014). However, many of these
direct optimization approaches either rely on human-defined
heuristics (e.g., greedy (Korte & Hausmann,1978;Reingold
& Tarjan,1981;Wolsey,1982), local improvement (V
et al.,2012;Li et al.,2021)), or resort to general nonlin-
ear optimization techniques like gradient descent (Ruder,
2016), reinforcement learning (Mazyavkina et al.,2021),
or evolutionary algorithms (Simon,2013). While these ap-
proaches can work in certain settings, they may lead to
a slow optimization process, in particular when the cost
function is expensive to evaluate, and they often ignore the
combinatorial nature of most real-world applications.
In this work, we propose a systematic framework
SurCo
that leverages existing efficient combinatorial solvers to find
solutions to nonlinear combinatorial optimization problems
arising in real-world scenarios. When only one nonlinear
differentiable cost
f(x)
needs to be minimized, we propose
SurCo
-
zero
that optimizes a linear surrogate cost
ˆ
c
so
1
arXiv:2210.12547v2 [cs.LG] 19 Jul 2023
SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems
Figure 1: Overview of our proposed framework SurCo.
that the surrogate optimizer (SO)
minxˆ
cx
outputs a
solution that is expected to be optimal w.r.t. the original
nonlinear cost
f(x)
. Due to its linear nature, SO can be
solved efficiently with existing solvers, and the surrogate
cost
ˆ
c
can be optimized in an end-to-end manner by back-
propagating through the solver via methods proposed in
previous work (Pogan
ˇ
ci
´
c et al.,2019;Niepert et al.,2021;
Berthet et al.,2020).
Thus,
SurCo
is a general-purpose method for solving com-
binatorial nonlinear optimization. Off-the-shelf nonlinear
optimizers are often not directly applicable to these prob-
lem domains and often require domain-specific solution
methodologies to give high-quality solutions in a reasonable
amount of time, and solution prediction methods fail to give
combinatorially feasible solutions without problem-specific
intervention. Here, learning a linear surrogate problem en-
sures that the surrogate solver is practically efficient, yields
gradient information for offline training, and generates solu-
tions that are combinatorially feasible.
When solving a family of nonlinear differentiable functions
f(x;y)
parameterized by instance description
y
, the sur-
rogate coefficients
ˆ
c(y;θ)
are learned on a set of optimiza-
tion instances (called the training set
{yi}
), by optimizing
the parameters
θ
. For an unseen held-out instance
y
, we
propose
SurCo
-
prior
that directly optimizes linear SO:
ˆ
x(y) := arg minxΩ(y)ˆ
c(y;θ)x
to get the solution,
avoiding optimizing the cost
f(x;y)
from scratch. Based
on the solution predicted by
SurCo
-
prior
, we also pro-
pose
SurCo
-
hybrid
that fine-tunes the surrogate costs
ˆ
c
with
SurCo
-
zero
to leverage both domain knowledge
synthesized offline and information about the specific in-
stance. We provide a comprehensive description of
SurCo
in Section 3.
We evaluate
SurCo
in three settings: embedding table
sharding (Zha et al.,2022a), photonic inverse design (Schu-
bert et al.,2022), and nonlinear route planning (Fan et al.,
2005). In the on-the-fly setting,
SurCo
-
zero
achieves
higher quality solutions in comparable or less runtime,
thanks to the help of an efficient combinatorial solver. in
SurCo
-
prior
, our method obtains better solutions in held-
out problems compared to other methods that require train-
ing (e.g., reinforcement learning).
Symbol Description
yParametric description of a specific instance.
xA solution to an instance.
f(x;y)The nonlinear objective (w.r.t x) for an instance y.
Ω(y)The feasible region of an instance y.
ˆ
x(y)The optimal SO solution to an instance y.
c(y)The surrogate coefficients for instance y.
Table 1: Notations used in this work.
We compare
SurCo
at a high level with related work inte-
grating learning and optimization at the end of our paper.
We additionally present theoretical intuition that helps moti-
vate why training a model to predict surrogate linear coeffi-
cients may exhibit better sample complexity than previous
approaches that directly predict the optimal solution (Li
et al.,2018;Ban & Rudin,2019).
2. Problem Specification
Our goal is to solve the following nonlinear optimization
problem describe by y:
min
x
f(x;y) s.t.xΩ(y)(1)
where
xRn
are the
n
variables to be optimized,
f(x;y)
is the nonlinear differentiable cost function to be minimized,
Ω(y)
is the feasible region, typically specified by linear
(in)equalities and integer constraints, and
yY
are the
problem instance parameters drawn from a distribution
D
over
Y
. For example, in the traveling salesman problem,
y
can be the distance matrix among cities.
Differentiable cost function. The nonlinear cost function
f(x;y)
can either be given analytically, or the result of a
simulator made differentiable via finite differencing (e.g.,
JAX (Bradbury et al.,2018)). If the cost function
f(x;y)
is
not differentiable as in one of our experimental settings, we
can use a cost model that is learned from an offline dataset,
often generated via sampling multiple feasible solutions
within
Ω(y)
, and recording their costs. In this work, we
assume the following property of f(x;y):
Assumption 2.1 (Differentiable cost function).During opti-
mization, the cost function
f(x;y)
and its partial derivative
f/∂xare accessible.
2
SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems
Learning a good nonlinear cost model
f
is non-trivial for
practical applications (e.g., AlphaFold (Jumper et al.,2021),
Density Functional Theory (Nagai et al.,2020), cost model
for embedding tables (Zha et al.,2022a)) and is beyond the
scope of this work.
Evaluation Metric. We mainly focus on two aspects: the
solution quality evaluated by
f(ˆ
x;y)
, and the number of
queries of
f
during optimization to achieve the solution
ˆ
x
.
For both, smaller measurements are favorable, i.e., fewer
query of fto get solutions closer to global optimum.
When
f(x;y)
is linear w.r.t
x
, and the feasible region
Ω(y)
can be encoded using mixed integer programs or other math-
ematical programs, the problem can be solved efficiently
using existing scalable optimization solvers. When
f(x;y)
is nonlinear, we propose
SurCo
that learns a surrogate lin-
ear objective function, which allow us to leverage these
existing scalable optimization solvers, and which results in
a solution that has high quality with respect to the original
hard-to-encode objective function f(x;y).
3. SurCo: Learning Linear Surrogates
3.1. SurCo-zero: on-the-fly optimization
We start from the simplest case in which we focus on a single
instance with
f(x) = f(x;y)
and
Ω = Ω(y)
.
SurCo
-
zero aims to optimize the following objective:
(SurCo-zero): min
c
Lzero(c) := f(g(c)) (2)
where the surrogate optimizer
g:Rn7→ Rn
is the output
of certain combinatorial solvers with linear cost weight
cRn
and feasible region
Rn
. For example,
g
can
be the following:
g(c) := arg min
x
cxs.t.xΩ := {Axb,xZn}
(3)
which is the output of a MILP solver. Thanks to previ-
ous works (Ferber et al.,2020;Pogan
ˇ
ci
´
c et al.,2019), we
can efficiently compute the partial derivative
g(c)/∂c
.
Intuitively, this means that
g(c)
can be backpropagated
through. Since
f
is also differentiable with respect to the
solution it is evaluating, we thus can optimize Eqn. 2in an
end-to-end manner using any gradient-based optimizer:
c(t+ 1) = c(t)αg
c
f
x,(4)
where
α
is the learning rate. The procedure starts from a ran-
domly initialized
c(0)
and converges at a local optimal so-
lution of
c
. While Eqn. 2is still nonlinear optimization and
there is no guarantee about the quality of the final solution
c
,
we argue that optimizing Eqn. 2is better than optimizing the
original nonlinear cost
minxf(x)
. Furthermore, while
we cannot guarantee optimality, we guarantee feasibility by
leveraging a linear combinatorial solver.
Intuitively, instead of optimizing directly over the solution
space
x
, we optimize over the space of surrogate costs
c
,
and delegate the combinatorial feasibility requirements of
the nonlinear problem to SoTA combinatorial solvers. Com-
pared to naive approaches that directly optimize
f(x)
via
general optimization techniques, our method readily handles
complex constraints of the feasible regions, and thus makes
the optimization procedure easier. Furthermore, it also helps
escape from local minima, thanks to the embedded search
component of existing combinatorial solvers (e.g., branch-
and-bound (Land & Doig,2010) in MILP solvers). As we
see in the experiments, this is particularly important when
the problem becomes large-scale with more local optima.
This approach works well when we are optimizing individ-
ual instances and may not have access to offline training
data or the training time is cost-prohibitive.
Limitation. Note that due to linear surrogate, our approach
will always return a vertex in the feasible region, while the
solution to the original nonlinear objective may be in the
interior. We leave this limitation for future work. In many
real-world settings, such as in the three domains we tested,
the solutions are indeed on the vertices of feasible regions.
3.2. SurCo-prior: offline surrogate training
We now consider a more general case where we have
N
optimization instances, each parameterized by an instance
description
yi
,
i= 1 . . . N
, and we want to find their so-
lutions to a collection of nonlinear loss functions
f(x;yi)
simultaneously. Here we write
Dtrain := {yi}N
i=1
as the
training set. A naive approach is just to apply
SurCo
-
zero N
times, which leads to
N
independent surrogate
costs
{ci}N
i=1
. However, this approach does not consider
two important characteristics. First, it fails to leverage pos-
sible relationship between the instance descriptor
yi
and
its associated surrogate cost
ci
, since every surrogate cost
is independently estimated. Second, it fails to learn any
useful knowledge from the
N
instances after optimization.
As a result, for an unseen instance, the entire optimization
process needs to be conducted again, which is slow. This
motivates us to add a surrogate cost model
ˆ
c(y;θ)
into the
optimization as a regularizer:
(SurCo-prior-λ): min
θ,{ci}Lprior(θ,{ci};λ)
:=
N
X
i=1
f(gΩ(yi)(ci); yi) + λciˆ
c(yi;θ))2(5)
The regressor model
ˆ
c(y;θ)
directly predicts the surrogate
cost from the instance description. The form of the regressor
can be a neural network, in which
θ
is its parameters. Note
that when
λ= 0
, it reduces to
N
independent optimizations,
while when
λ > 0
, the surrogate costs
{ci}
interact with
each other. With the regressor, we distill knowledge gained
3
SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems
Methods Applicable to Objective can be Training Set Generalize to Built-in handling of
nonlinear objective free form unseen instances combinatorial constraints
Gradient Descent Yes Yes N/A No No
Evolutionary Algorithm Yes Yes N/A No No
Nonlinear combinatorial solvers Yes No N/A No Yes
Learning direct mapping Yes Yes {yi,x
i}Yes No
Predict-then-optimize Limited No {yi,x
i}Yes Yes
SurCo (proposed) Yes Yes {yi}Yes Yes
Table 2: Conceptual comparison of optimizers (both traditional and ML-guided). Our approach (SurCo) can handle nonlinear objective
without a predefined analytical form, does not require pre-computed optimal solutions in its training set, can handle combinatorial
constraints (via commercial solvers it incorporates), and can generalize to unseen instances.
from the optimization procedure into θ, which can be used
for an unseen instance
y
. Indeed, we use the learned regres-
sor model to predict the surrogate cost
c=ˆ
c(y;θ)
, and
directly solve the surrogate optimization (SO):
ˆ
x(y) = arg min
xΩ(y)ˆ
c(y;θ)x(6)
A special case is when
λ+
, we directly learn the
network parameters
θ
instead of individual surrogate costs:
(SurCo-prior): min
θLprior(θ)
:=
N
X
i=1
f(gΩ(yi)(ˆ
c(yi;θ)); yi)(7)
This approach is useful when the goal is to find high-quality
solutions for unseen instances of a problem distribution
when the upfront cost of offline training is acceptable but
the cost of optimizing on-the-fly is prohibitive. Here, we
require access to a distribution of training optimization prob-
lems, but at test time only require the feasible region and
not the nonlinear objective. Different from predict-then-
optimize (Elmachtoub & Grigas,2022a;Ferber et al.,2020)
or ML optimizers (Ban & Rudin,2019), we do not require
the optimal solution {x
i}N
i=1 as part of the training set.
3.3. SurCo-hybrid: fine-tuning a predicted surrogate
Naturally, we consider
SurCo
-
hybrid
, a hybrid approach
which initializes the coefficients of
SurCo
-
zero
with
the coefficients predicted from
SurCo
-
prior
which was
trained on offline data. This allows
SurCo
-
hybrid
to
start out optimization from an initial prediction that has
good performance for the distribution at large but which
is then fine-tuned for the specific instance. Formally, we
initialize
c(0) = ˆ
c(yi;θ)
and then continue optimizing
c
based on the update from
SurCo
-
zero
. This approach is
geared towards optimizing the nonlinear objective using a
high-quality initial prediction that is based on the problem
distribution and then fine-tuning the objective coefficients
based on the specific problem instance at test time. Here,
high performance comes at the runtime cost of both hav-
ing to train offline on a problem distribution as well as
performing fine-tuning steps on-the-fly. However, this ad-
ditional cost is often worthwhile when the main goal is to
find the best possible solutions by leveraging synthesized
domain knowledge in combination with individual problem
instances as arises in chip design (Mirhoseini et al.,2021)
and compiler optimization (Zhou et al.,2020).
4. Is Predicting Surrogate Cost better than
Predicting Solution? A Theoretical Analysis
One of the key ingredient of our proposed methods (
SurCo
-
prior
and
SurCo
-
hybrid
) is to learn a model to predict
surrogate cost
c
from instance description
y
, which is in
contrast with previous solution regression approaches that
directly learn a mapping from problem description
y
to the
solution
x(y)
(Ban & Rudin,2019). A natural question
arise: which one is better?
In this section, we give theoretical intuition to compare
the two approaches using a simple 1-nearest-neighbor (1-
NN) solution regressor (Fix,1985). We first relate the
number of samples needed to learn any mapping to its
Lipschitz constant
L
, and then show that for the direct
mapping
y7→ x(y)
,
L
can be very large. Therefore,
there exist fundamental difficulties to learn such a mapping.
When this happens, we can still find surrogate cost mapping
y7→ c(y)
with finite
L
that leads to the optimal solution
x(y)of the original nonlinear problems.
4.1. Lipschitz constant and sample complexity
Formally, consider fitting any mapping
ϕ:RdY7→ Rm
with a dataset
C:= {yi,ϕi}
. Here
Y
is a compact region
with finite volume
vol(Y)
. The Lipschitz constant
L
is the
smallest number so that
ϕ(y1)ϕ(y2)2Ly1y22
holds for any
y1,y2Y
. The following theorem shows
that if the dataset covers the space
Y
, we could achieve high
accuracy prediction: ϕ(y)ˆ
ϕ(y)2ϵfor any yY.
Definition 4.1 (
δ
-cover).A dataset
C:= {(yi,ϕi)}N
i=1 δ
-
covers the space
Y
, if for any
yY
, there exists at least
one yiso that yyi2δ.
Lemma 4.2 (Sufficient condition of prediction with
ϵ
-accuracy).If the dataset
C
can
(ϵ/L)
-cover
Y
, then for
any
yY
, a 1-nearest-neighbor regressor
ˆ
ϕ
leads to
ˆ
ϕ(y)ϕ(y)2ϵ.
4
摘要:

SurCo:LearningLinearSURrogatesforCOmbinatorialNonlinearOptimizationProblemsAaronFerber1TaoanHuang1DaochenZha2MartinSchubert3BenoitSteiner4BistraDilkina1YuandongTian3AbstractOptimizationproblemswithnonlinearcostfunc-tionsandcombinatorialconstraintsappearinmanyreal-worldapplicationsbutremainchal-lengi...

展开>> 收起<<
SurCo Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.18MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注