SurCo Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

2025-05-02 0 0 1.18MB 19 页 10玖币

侵权投诉

SurCo: Learning Linear SURrogates

for COmbinatorial Nonlinear Optimization Problems

Aaron Ferber 1Taoan Huang 1Daochen Zha 2

Martin Schubert 3Benoit Steiner 4Bistra Dilkina 1Yuandong Tian 3

Abstract

Optimization problems with nonlinear cost func-

tions and combinatorial constraints appear in

many real-world applications but remain chal-

lenging to solve efﬁciently compared to their lin-

ear counterparts. To bridge this gap, we propose

SurCo

that learns linear

Sur

rogate costs which

can be used in existing

mbinatorial solvers to

output good solutions to the original nonlinear

combinatorial optimization problem. The surro-

gate costs are learned end-to-end with nonlinear

loss by differentiating through the linear surrogate

solver, combining the ﬂexibility of gradient-based

methods with the structure of linear combinato-

rial optimization. We propose three

SurCo

vari-

ants:

SurCo −zero

for individual nonlinear

problems,

SurCo −prior

for problem distribu-

tions, and

SurCo−hybrid

to combine both dis-

tribution and problem-speciﬁc information. We

give theoretical intuition motivating

SurCo

, and

evaluate it empirically. Experiments show that

SurCo

ﬁnds better solutions faster than state-

of-the-art and domain expert approaches in real-

world optimization problems such as embedding

table sharding, inverse photonic design, and non-

linear route planning.

1. Introduction

Combinatorial optimization problems with linear objec-

tive functions such as mixed integer linear programming

(MILP) (Wolsey,2007), and occasionally linear program-

Work done during Aaron and Taoan’s internship in Meta AI.

Project page at

https://sites.google.com/usc.edu/

surco/ 1

Center for AI in Society, University of Southern Califor-

nia

Rice University

Meta AI, FAIR

Anthropic. Correspondence

to: Aaron Ferber

aferber@usc.edu

, Yuandong Tian

yuan-

dong@meta.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

ming (LP) (Chvatal et al.,1983), have been extensively

studied in operations research (OR). The resulting high-

performance solvers like Gurobi (Gurobi Optimization,

LLC,2022) can solve industrial-scale optimization prob-

lems with tens of thousands of variables in a few minutes.

However, even with perfect solvers, one issue remains: the

cost functions

f(x)

in many practical problems are nonlin-

ear, and the highly-optimized solvers mainly handle linear

or convex formulations while real-world problems have less

constrained objectives. For example, in embedding table

sharding (Zha et al.,2022a) one needs to distribute embed-

ding tables to multiple GPUs for the deployment of recom-

mendation systems. Due to the batching behaviors within a

single GPU and communication cost among different GPUs,

the overall latency (cost function) in this application de-

pends on interactions of multiple tables and thus can be

highly nonlinear (Zha et al.,2022a).

To obtain useful solutions to real-world problems, one may

choose to directly optimize the nonlinear cost, which can

be the black-box output of a simulator (Gosavi et al.,2015;

Ye et al.,2019), or the output of a cost estimator learned

by machine learning techniques (e.g., deep models) from

ofﬂine data (Steiner et al.,2021;Koziel et al.,2021;Wang

et al.,2021b;Cozad et al.,2014). However, many of these

direct optimization approaches either rely on human-deﬁned

heuristics (e.g., greedy (Korte & Hausmann,1978;Reingold

& Tarjan,1981;Wolsey,1982), local improvement (Voß

et al.,2012;Li et al.,2021)), or resort to general nonlin-

ear optimization techniques like gradient descent (Ruder,

2016), reinforcement learning (Mazyavkina et al.,2021),

or evolutionary algorithms (Simon,2013). While these ap-

proaches can work in certain settings, they may lead to

a slow optimization process, in particular when the cost

function is expensive to evaluate, and they often ignore the

combinatorial nature of most real-world applications.

In this work, we propose a systematic framework

SurCo

that leverages existing efﬁcient combinatorial solvers to ﬁnd

solutions to nonlinear combinatorial optimization problems

arising in real-world scenarios. When only one nonlinear

differentiable cost

f(x)

needs to be minimized, we propose

SurCo

zero

that optimizes a linear surrogate cost

arXiv:2210.12547v2 [cs.LG] 19 Jul 2023

SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

Figure 1: Overview of our proposed framework SurCo.

that the surrogate optimizer (SO)

minx∈Ωˆ

c⊤x

outputs a

solution that is expected to be optimal w.r.t. the original

nonlinear cost

f(x)

. Due to its linear nature, SO can be

solved efﬁciently with existing solvers, and the surrogate

cost

can be optimized in an end-to-end manner by back-

propagating through the solver via methods proposed in

previous work (Pogan

c et al.,2019;Niepert et al.,2021;

Berthet et al.,2020).

Thus,

SurCo

is a general-purpose method for solving com-

binatorial nonlinear optimization. Off-the-shelf nonlinear

optimizers are often not directly applicable to these prob-

lem domains and often require domain-speciﬁc solution

methodologies to give high-quality solutions in a reasonable

amount of time, and solution prediction methods fail to give

combinatorially feasible solutions without problem-speciﬁc

intervention. Here, learning a linear surrogate problem en-

sures that the surrogate solver is practically efﬁcient, yields

gradient information for ofﬂine training, and generates solu-

tions that are combinatorially feasible.

When solving a family of nonlinear differentiable functions

f(x;y)

parameterized by instance description

, the sur-

rogate coefﬁcients

c(y;θ)

are learned on a set of optimiza-

tion instances (called the training set

{yi}

), by optimizing

the parameters

. For an unseen held-out instance

y′

, we

propose

SurCo

prior

that directly optimizes linear SO:

x∗(y′) := arg minx∈Ω(y′)ˆ

c⊤(y′;θ)x

to get the solution,

avoiding optimizing the cost

f(x;y′)

from scratch. Based

on the solution predicted by

SurCo

prior

, we also pro-

pose

SurCo

hybrid

that ﬁne-tunes the surrogate costs

with

SurCo

zero

to leverage both domain knowledge

synthesized ofﬂine and information about the speciﬁc in-

stance. We provide a comprehensive description of

SurCo

in Section 3.

We evaluate

SurCo

in three settings: embedding table

sharding (Zha et al.,2022a), photonic inverse design (Schu-

bert et al.,2022), and nonlinear route planning (Fan et al.,

2005). In the on-the-ﬂy setting,

SurCo

zero

achieves

higher quality solutions in comparable or less runtime,

thanks to the help of an efﬁcient combinatorial solver. in

SurCo

prior

, our method obtains better solutions in held-

out problems compared to other methods that require train-

ing (e.g., reinforcement learning).

Symbol Description

yParametric description of a speciﬁc instance.

xA solution to an instance.

f(x;y)The nonlinear objective (w.r.t x) for an instance y.

Ω(y)The feasible region of an instance y.

x∗(y)The optimal SO solution to an instance y.

c(y)The surrogate coefﬁcients for instance y.

Table 1: Notations used in this work.

We compare

SurCo

at a high level with related work inte-

grating learning and optimization at the end of our paper.

We additionally present theoretical intuition that helps moti-

vate why training a model to predict surrogate linear coefﬁ-

cients may exhibit better sample complexity than previous

approaches that directly predict the optimal solution (Li

et al.,2018;Ban & Rudin,2019).

2. Problem Speciﬁcation

Our goal is to solve the following nonlinear optimization

problem describe by y:

min

f(x;y) s.t.x∈Ω(y)(1)

where

x∈Rn

are the

variables to be optimized,

f(x;y)

is the nonlinear differentiable cost function to be minimized,

Ω(y)

is the feasible region, typically speciﬁed by linear

(in)equalities and integer constraints, and

y∈Y

are the

problem instance parameters drawn from a distribution

over

. For example, in the traveling salesman problem,

can be the distance matrix among cities.

Differentiable cost function. The nonlinear cost function

f(x;y)

can either be given analytically, or the result of a

simulator made differentiable via ﬁnite differencing (e.g.,

JAX (Bradbury et al.,2018)). If the cost function

f(x;y)

not differentiable as in one of our experimental settings, we

can use a cost model that is learned from an ofﬂine dataset,

often generated via sampling multiple feasible solutions

within

Ω(y)

, and recording their costs. In this work, we

assume the following property of f(x;y):

Assumption 2.1 (Differentiable cost function).During opti-

mization, the cost function

f(x;y)

and its partial derivative

∂f/∂xare accessible.

SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

Learning a good nonlinear cost model

is non-trivial for

practical applications (e.g., AlphaFold (Jumper et al.,2021),

Density Functional Theory (Nagai et al.,2020), cost model

for embedding tables (Zha et al.,2022a)) and is beyond the

scope of this work.

Evaluation Metric. We mainly focus on two aspects: the

solution quality evaluated by

f(ˆ

x;y)

, and the number of

queries of

during optimization to achieve the solution

For both, smaller measurements are favorable, i.e., fewer

query of fto get solutions closer to global optimum.

When

f(x;y)

is linear w.r.t

, and the feasible region

Ω(y)

can be encoded using mixed integer programs or other math-

ematical programs, the problem can be solved efﬁciently

using existing scalable optimization solvers. When

f(x;y)

is nonlinear, we propose

SurCo

that learns a surrogate lin-

ear objective function, which allow us to leverage these

existing scalable optimization solvers, and which results in

a solution that has high quality with respect to the original

hard-to-encode objective function f(x;y).

3. SurCo: Learning Linear Surrogates

3.1. SurCo-zero: on-the-ﬂy optimization

We start from the simplest case in which we focus on a single

instance with

f(x) = f(x;y)

and

Ω = Ω(y)

SurCo

zero aims to optimize the following objective:

(SurCo-zero): min

Lzero(c) := f(gΩ(c)) (2)

where the surrogate optimizer

gΩ:Rn7→ Rn

is the output

of certain combinatorial solvers with linear cost weight

c∈Rn

and feasible region

Ω⊆Rn

. For example,

gΩ

can

be the following:

gΩ(c) := arg min

c⊤xs.t.x∈Ω := {Ax≤b,x∈Zn}

(3)

which is the output of a MILP solver. Thanks to previ-

ous works (Ferber et al.,2020;Pogan

c et al.,2019), we

can efﬁciently compute the partial derivative

∂gΩ(c)/∂c

Intuitively, this means that

gΩ(c)

can be backpropagated

through. Since

is also differentiable with respect to the

solution it is evaluating, we thus can optimize Eqn. 2in an

end-to-end manner using any gradient-based optimizer:

c(t+ 1) = c(t)−α∂gΩ

∂c

∂f

∂x,(4)

where

is the learning rate. The procedure starts from a ran-

domly initialized

c(0)

and converges at a local optimal so-

lution of

. While Eqn. 2is still nonlinear optimization and

there is no guarantee about the quality of the ﬁnal solution

we argue that optimizing Eqn. 2is better than optimizing the

original nonlinear cost

minx∈Ωf(x)

. Furthermore, while

we cannot guarantee optimality, we guarantee feasibility by

leveraging a linear combinatorial solver.

Intuitively, instead of optimizing directly over the solution

space

, we optimize over the space of surrogate costs

and delegate the combinatorial feasibility requirements of

the nonlinear problem to SoTA combinatorial solvers. Com-

pared to naive approaches that directly optimize

f(x)

via

general optimization techniques, our method readily handles

complex constraints of the feasible regions, and thus makes

the optimization procedure easier. Furthermore, it also helps

escape from local minima, thanks to the embedded search

component of existing combinatorial solvers (e.g., branch-

and-bound (Land & Doig,2010) in MILP solvers). As we

see in the experiments, this is particularly important when

the problem becomes large-scale with more local optima.

This approach works well when we are optimizing individ-

ual instances and may not have access to ofﬂine training

data or the training time is cost-prohibitive.

Limitation. Note that due to linear surrogate, our approach

will always return a vertex in the feasible region, while the

solution to the original nonlinear objective may be in the

interior. We leave this limitation for future work. In many

real-world settings, such as in the three domains we tested,

the solutions are indeed on the vertices of feasible regions.

3.2. SurCo-prior: ofﬂine surrogate training

We now consider a more general case where we have

optimization instances, each parameterized by an instance

description

i= 1 . . . N

, and we want to ﬁnd their so-

lutions to a collection of nonlinear loss functions

f(x;yi)

simultaneously. Here we write

Dtrain := {yi}N

i=1

as the

training set. A naive approach is just to apply

SurCo

zero N

times, which leads to

independent surrogate

costs

{ci}N

i=1

. However, this approach does not consider

two important characteristics. First, it fails to leverage pos-

sible relationship between the instance descriptor

and

its associated surrogate cost

, since every surrogate cost

is independently estimated. Second, it fails to learn any

useful knowledge from the

instances after optimization.

As a result, for an unseen instance, the entire optimization

process needs to be conducted again, which is slow. This

motivates us to add a surrogate cost model

c(y;θ)

into the

optimization as a regularizer:

(SurCo-prior-λ): min

θ,{ci}Lprior(θ,{ci};λ)

i=1

f(gΩ(yi)(ci); yi) + λ∥ci−ˆ

c(yi;θ))∥2(5)

The regressor model

c(y;θ)

directly predicts the surrogate

cost from the instance description. The form of the regressor

can be a neural network, in which

is its parameters. Note

that when

λ= 0

, it reduces to

independent optimizations,

while when

λ > 0

, the surrogate costs

{ci}

interact with

each other. With the regressor, we distill knowledge gained

SurCo: Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

Methods Applicable to Objective can be Training Set Generalize to Built-in handling of

nonlinear objective free form unseen instances combinatorial constraints

Gradient Descent Yes Yes N/A No No

Evolutionary Algorithm Yes Yes N/A No No

Nonlinear combinatorial solvers Yes No N/A No Yes

Learning direct mapping Yes Yes {yi,x∗

i}Yes No

Predict-then-optimize Limited No {yi,x∗

i}Yes Yes

SurCo (proposed) Yes Yes {yi}Yes Yes

Table 2: Conceptual comparison of optimizers (both traditional and ML-guided). Our approach (SurCo) can handle nonlinear objective

without a predeﬁned analytical form, does not require pre-computed optimal solutions in its training set, can handle combinatorial

constraints (via commercial solvers it incorporates), and can generalize to unseen instances.

from the optimization procedure into θ, which can be used

for an unseen instance

y′

. Indeed, we use the learned regres-

sor model to predict the surrogate cost

c′=ˆ

c(y′;θ)

, and

directly solve the surrogate optimization (SO):

x∗(y′) = arg min

x∈Ω(y)ˆ

c⊤(y′;θ)x(6)

A special case is when

λ→+∞

, we directly learn the

network parameters

instead of individual surrogate costs:

(SurCo-prior): min

θLprior(θ)

i=1

f(gΩ(yi)(ˆ

c(yi;θ)); yi)(7)

This approach is useful when the goal is to ﬁnd high-quality

solutions for unseen instances of a problem distribution

when the upfront cost of ofﬂine training is acceptable but

the cost of optimizing on-the-ﬂy is prohibitive. Here, we

require access to a distribution of training optimization prob-

lems, but at test time only require the feasible region and

not the nonlinear objective. Different from predict-then-

optimize (Elmachtoub & Grigas,2022a;Ferber et al.,2020)

or ML optimizers (Ban & Rudin,2019), we do not require

the optimal solution {x∗

i}N

i=1 as part of the training set.

3.3. SurCo-hybrid: ﬁne-tuning a predicted surrogate

Naturally, we consider

SurCo

hybrid

, a hybrid approach

which initializes the coefﬁcients of

SurCo

zero

with

the coefﬁcients predicted from

SurCo

prior

which was

trained on ofﬂine data. This allows

SurCo

hybrid

start out optimization from an initial prediction that has

good performance for the distribution at large but which

is then ﬁne-tuned for the speciﬁc instance. Formally, we

initialize

c(0) = ˆ

c(yi;θ)

and then continue optimizing

based on the update from

SurCo

zero

. This approach is

geared towards optimizing the nonlinear objective using a

high-quality initial prediction that is based on the problem

distribution and then ﬁne-tuning the objective coefﬁcients

based on the speciﬁc problem instance at test time. Here,

high performance comes at the runtime cost of both hav-

ing to train ofﬂine on a problem distribution as well as

performing ﬁne-tuning steps on-the-ﬂy. However, this ad-

ditional cost is often worthwhile when the main goal is to

ﬁnd the best possible solutions by leveraging synthesized

domain knowledge in combination with individual problem

instances as arises in chip design (Mirhoseini et al.,2021)

and compiler optimization (Zhou et al.,2020).

4. Is Predicting Surrogate Cost better than

Predicting Solution? A Theoretical Analysis

One of the key ingredient of our proposed methods (

SurCo

prior

and

SurCo

hybrid

) is to learn a model to predict

surrogate cost

from instance description

, which is in

contrast with previous solution regression approaches that

directly learn a mapping from problem description

to the

solution

x∗(y)

(Ban & Rudin,2019). A natural question

arise: which one is better?

In this section, we give theoretical intuition to compare

the two approaches using a simple 1-nearest-neighbor (1-

NN) solution regressor (Fix,1985). We ﬁrst relate the

number of samples needed to learn any mapping to its

Lipschitz constant

, and then show that for the direct

mapping

y7→ x∗(y)

can be very large. Therefore,

there exist fundamental difﬁculties to learn such a mapping.

When this happens, we can still ﬁnd surrogate cost mapping

y7→ c∗(y)

with ﬁnite

that leads to the optimal solution

x∗(y)of the original nonlinear problems.

4.1. Lipschitz constant and sample complexity

Formally, consider ﬁtting any mapping

ϕ:Rd⊇Y7→ Rm

with a dataset

C:= {yi,ϕi}

. Here

is a compact region

with ﬁnite volume

vol(Y)

. The Lipschitz constant

is the

smallest number so that

∥ϕ(y1)−ϕ(y2)∥2≤L∥y1−y2∥2

holds for any

y1,y2∈Y

. The following theorem shows

that if the dataset covers the space

, we could achieve high

accuracy prediction: ∥ϕ(y)−ˆ

ϕ(y)∥2≤ϵfor any y∈Y.

Deﬁnition 4.1 (

-cover).A dataset

C:= {(yi,ϕi)}N

i=1 δ

covers the space

, if for any

y∈Y

, there exists at least

one yiso that ∥y−yi∥2≤δ.

Lemma 4.2 (Sufﬁcient condition of prediction with

-accuracy).If the dataset

can

(ϵ/L)

-cover

, then for

any

y∈Y

, a 1-nearest-neighbor regressor

leads to

∥ˆ

ϕ(y)−ϕ(y)∥2≤ϵ.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SurCo:LearningLinearSURrogatesforCOmbinatorialNonlinearOptimizationProblemsAaronFerber1TaoanHuang1DaochenZha2MartinSchubert3BenoitSteiner4BistraDilkina1YuandongTian3AbstractOptimizationproblemswithnonlinearcostfunc-tionsandcombinatorialconstraintsappearinmanyreal-worldapplicationsbutremainchal-lengi...

展开>> 收起<<

SurCo Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SurCo Learning Linear SURrogates for COmbinatorial Nonlinear Optimization Problems

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: