Alternating Dierentiation for Optimization Layers Haixiang Sun1 Ye Shi1 Jingya Wang1 Hoang Duong Tuan2 H. Vincent Poor3 and Dacheng Tao45_2

2025-04-30 0 0 932.22KB 23 页 10玖币
侵权投诉
Alternating Differentiation for Optimization Layers
Haixiang Sun1, Ye Shi1, Jingya Wang1, Hoang Duong Tuan2, H. Vincent Poor3, and
Dacheng Tao4,5
1School of Information Science and Technology, ShanghaiTech University, China
2School of Electrical and Data Engineering, University of Technology Sydney, Australia
3Department of Electrical and Computer Engineering, Princeton University, USA
4School of Computer Science, The University of Sydney, Australia
5JD Explore Academy, China
April 2023
Abstract
The idea of embedding optimization problems into deep neural networks as optimization layers
to encode constraints and inductive priors has taken hold in recent years. Most existing methods
focus on implicitly differentiating Karush–Kuhn–Tucker (KKT) conditions in a way that requires
expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In
this paper, we developed a new framework, named Alternating Differentiation (Alt-Diff), that
differentiates optimization problems (here, specifically in the form of convex optimization problems
with polyhedral constraints) in a fast and recursive way. Alt-Diff decouples the differentiation
procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diff
substantially decreases the dimensions of the Jacobian matrix especially for optimization with large-
scale constraints and thus increases the computational speed of implicit differentiation. We show
that the gradients obtained by Alt-Diff are consistent with those obtained by differentiating KKT
conditions. In addition, we propose to truncate Alt-Diff to further accelerate the computational
speed. Under some standard assumptions, we show that the truncation error of gradients is upper
bounded by the same order of variables’ estimation error. Therefore, Alt-Diff can be truncated to
further increase computational speed without sacrificing much accuracy. A series of comprehensive
experiments validate the superiority of Alt-Diff.
1 Introduction
Recent years have seen a variety of applications in machine learning that consider optimization
as a tool for inference learning [2, 4, 5, 8, 9, 17]. Embedding optimization problems as optimization
layers in deep neural networks can capture useful inductive bias, such as domain-specific knowledge and
priors. Unlike conventional neural networks, which are defined by an explicit formulation in each layer,
optimization layers are defined implicitly by solving optimization problems. They can be treated as
implicit functions where inputs are mapped to optimal solutions. However, training optimization layers
together with explicit layers is not easy since explicit closed-form solutions typically do not exist for
the optimization layers.
Generally, computing the gradients of the optimization layers can be classified into two main cate-
gories: differentiating the optimality conditions implicitly and applying unrolling methods. The ideas
of differentiating optimality conditions have been used in bilevel optimization [28, 36] and sensitivity
analysis [12]. Recently, OptNet [4] and CvxpyLayer [2] have extended this method to optimization
layers so as to enable end-to-end learning within the deep learning structure. However, these methods
Corresponding author: shiye@shanghaitech.edu.cn
1
arXiv:2210.01802v2 [cs.LG] 24 Apr 2023
inevitably require expensive computation on the Jacobian matrix. Thus they are prone to instability
and are often intractable, especially for large-scale optimization layers. Another direction to obtain the
gradients of optimization layers is based on the unrolling methods [17, 50], where an iterative first-order
gradient method is applied. However, they are memory-intensive since all the intermediate results have
to be recorded. Besides, unrolling methods are not quite suitable to constrained optimization problems
as expensive projection operators are needed.
In this paper, we aim to develop a new method that significantly increases the computational speed of
the differentiation procedure for convex optimization problems with polyhedral constraints. Motivated
by the scalability of the operator splitting method for optimization problem [27, 48], we developed a
new framework, namely Alternating Differentiation (Alt-Diff), that differentiates optimization layers in
a fast and recursive way. Alt-Diff first decouples the constrained optimization problem into multiple
subproblems based on the well-known alternating direction method of multipliers (ADMM) [13]. Then,
the differentiation operators for obtaining the derivatives of the primal and dual variables w.r.t the
parameters are implemented on these subproblems in an alternating manner. Accordingly, Alt-Diff
substantially decreases the dimensions of the Jacobian matrix, significantly increasing the computational
speed of implicit differentiation, especially for large-scale problems. Unlike most existing methods that
directly differentiate KKT conditions after obtaining the optimal solution for an optimization problem,
Alt-Diff performs the forward and backward procedures simultaneously. Both the forward and backward
procedures can be truncated without sacrificing much accuracy in terms of the gradients of optimization
layers. Overall, our contributions are three-fold:
We develop a new differentiation framework Alt-Diff that decouples the optimization layers in an
alternating way. Alt-Diff significantly reduces the dimension of the KKT matrix, especially for
optimization with large-scale constraints and thus increases the computational speed of implicit
differentiation.
We show that: 1) the gradient obtained by Alt-Diff are consistent with those obtained by differ-
entiating the KKT conditions; 2) the truncation error of gradients is upper bounded by the same
order of variables’ estimation error given some standard assumptions. Therefore, Alt-Diff can be
truncated to accelerate the computational speed without scarifying much accuracy.
We conduct a series of experiments and show that Alt-Diff can achieve results comparable to
state-of-the-art methods in much less time, especially for large-scale optimization problems. The
fast performance of Alt-Diff comes from the dimension reduction of the KKT matrix and the
truncated capability of Alt-Diff.
2 Related work
Differentiation for optimization layers Embedding optimization problems into deep learning
architectures as optimization layers to integrate domain knowledge has received increasing attention
in recent years [2, 4, 28, 29]. For example, [28] collected some general techniques for differentiating
argmin and argmax optimization problems based on the Implicit Function Theorem, while [4] proposed
a network architecture, namely OptNet, that integrates quadratic optimization into deep networks.
This work was further extended for end-to-end learning with stochastic programming [19]. Recently, an
efficient differentiable quadratic optimization layer [15] based on ADMM was proposed to accelerate the
differentiation procedure. However, these methods are only restricted to handling the differentiation of
quadratic problems. For more general convex optimization, [3] proposed CvxpyLayer that differentiates
through disciplined convex programs and some sparse operations and LSQR [46] to speed up the dif-
ferentiation procedure. Although these have been worthwhile attempts to accelerate the differentiation
procedure, they still largely focus on implicitly differentiating the KKT conditions in a way that requires
expensive operations on a large-scale Jacobian matrix. Recently, a differentiation solver named JaxOpt
[10] has also been put forward that is based on an implicit automatic differentiation mechanism under
the Jax framework.
2
Unrolling methods Another direction for differentiating optimization problems is the unrolling
methods [18, 45]. These approximate the objective function with a first-order gradient method and then
incorporate the gradient of the inner loop optimization into the training procedures [5, 8, 9, 43]. This is
an iterative procedure, which is usually truncated to a fixed number of iterations. Although unrolling
methods are easy to implement, most of their applications are limited to unconstrained problems. If
constraints are added, the unrolling solutions have to be projected into the feasible region, significantly
increasing the computational burdens. By contrast, Alt-Diff only requires a very simple operation that
projects the slack variable to the nonnegative orthant. This substantially improves the efficiency of
subsequent updates and reduces the method’s computational complexity.
Implicit models There has been growing interest in implicit models in recent years [11, 34, 51].
Implicit layers replace the traditional feed-forward layers of neural networks with fixed point iterations to
compute inferences. They have been responsible for substantial advances in many applications, including
Neural ODE [16, 20], Deep Equilibrium Models [6, 7, 32], nonconvex optimization problems [49] and
implicit 3D surface layers [44, 47], etc. Implicit layers have a similar computational complexity to
optimization layers as they also requires solving a costly Jacobian-based equation based on the Implicit
Function Theorem. Recently, [23] proposed a Jacobian-free backpropagation method to accelerate the
speed of training implicit layers. However, this method is not suitable for the optimization layers with
complex constraints. In terms of training implicit models, [25] also proposed a novel gradient estimate
called phantom gradient which relies on fixed-point unrolling and a Neumann series to provide a new
update direction; computation of precise gradient is forgone. Implicit models have also been extended
to more complex learning frameworks, such as attention mechanisms [24] and Graph Neural Networks
[30].
3 Preliminary: Differentiable optimization layers
We consider a parameterized convex optimization problems with polyhedral constraints:
min
xf(x;θ)
s.t. x ∈ C(θ)(1)
where xRnis the decision variable, the objective function f:RnRis convex and the constraint
C(θ) := {x|Ax =b, Gx h}is a polyhedron. For simplification, we use θto collect the parameters
in the objective function and constraints. For any given θ, a solution of optimization problem (1) is
x?Rnthat minimizes f(x;θ) while satisfying the constraints, i.e. x?∈ C(θ). Then the optimization
problem (1) can be viewed as a mapping that maps the parameters θto the solution x?. Here, we
focus on convex optimization with affine constraints due to its wide applications in control systems [31],
signal processing [42], communication networks [14, 37], etc.
Since optimization problems can capture well-defined specific domain knowledge in a model-driven
way, embedding such optimization problems into neural networks within an end-to-end learning frame-
work can simultaneously leverage the strengths of both the model-driven and the data-driven methods.
Unlike conventional neural networks, which are defined through explicit expressions of each layer, an
optimization layer embedded in optimization problem (1) is defined as follows:
Definition 3.1. (Optimization Layer) A layer in a neural network is defined as an optimization layer
if its input is the optimization parameters θRmand its output x?Rnis the solution of the
optimization problem (1).
The optimization layer can be treated as an implicit function F:Rn×RmRnwith F(x?, θ) = 0.
In the deep learning architecture, optimization layers are implemented together with other explicit layers
in an end-to-end framework. During the training procedure, the chain rule is used to back propagate
the gradient through the optimization layer. Given a loss function R, the derivative of the loss w.r.t
the parameter θof the optimization layer is
R
θ =R
x?
x?
θ .(2)
3
Obviously, the derivative R
x?can be easily obtained by automatic differentiation techniques on explicit
layers, such as fully connected layers and convolutional layers. However, since explicit closed-form
solutions typically do not exist for optimization layers, this brings additional computational difficulties
during implementation. Recent work, such as OptNet [4] and CvxpyLayer [2] have shown that the
Jacobian x?
θ can be derived by differentiating the KKT conditions of the optimization problem based
on the Implicit Function Theorem [35]. This is briefly recalled in Lemma 3.2.
Lemma 3.2. (Implicit Function Theorem) Let F(x, θ)denote a continuously differentiable function
with F(x?, θ) = 0. Suppose the Jacobian of F(x, θ)is invertible at (x?;θ), then the derivative of the
solution x?with respect to θis
x?
θ =[JF;x]1JF;θ,(3)
where JF;x:= xF(x?, θ)and JF;θ:= θF(x?, θ)are respectively the Jacobian of F(x, θ)w.r.t. xand
θ.
It is worth noting that the differentiation procedure needs to calculate the optimal value x?in the
forward pass and involves solving a linear system with Jacobian matrix JF;xin the backward pass. Both
the forward and backward pass are generally very computationally expensive, especially for large-scale
problems. This means that differentiating KKT conditions directly is not scalable to large optimization
problems. Existing solvers have made some attempts to alleviate this issue. Specifically, CvxpyLayer
adopted an LSQR technique to accelerate implicit differentiation for sparse optimization problems.
However, the method is not necessarily efficient for more general cases, which may not be sparse.
Although OptNet uses a primal-dual interior point method in the forward pass and its backward pass
can be very easy but OptNet is suitable only for quadratic optimization problems. In this paper, our
main target is to develop a new method that can increase the computational speed of the differentiation
procedure especially for large-scale constrained optimization problems.
4 Alternating differentiation for optimization layers
This section provides the details of the Alt-Diff algorithm for optimization layers. Alt-Diff decom-
poses the differentiation of a large-scale KKT system into a smaller problems that are solved in a
primal-dual alternating way (see Section 4.1). Alt-Diff reduces the computational complexity of the
model and significantly improves the computational efficiency of the optimization layers. In section 4.2,
we theoretically analyze the truncated capability of Alt-Diff. Notably, Alt-Diff can be truncated for
inexact solutions to further increase computational speeds without sacrificing much accuracy.
4.1 Alternating differentiation
Motivated by the scalability of the operator splitting method [27, 48], Alt-Diff first decouples a
constrained optimization problem into multiple subproblems based on ADMM. Each splitted operator
is then differentiated to establish the derivatives of the primal and dual variables w.r.t the parameters
in an alternating fashion. The augmented Lagrange function of problem (1) with a quadratic penalty
term is:
max
λ,ν min
x,s0L(x, s, λ, ν;θ)
=f(x;θ) + hλ, Ax bi+hν, Gx +shi+ρ
2(kAx bk2+kGx +shk2),
(4)
where s0 is a non-negative slack variable, and ρ > 0 is a hyperparameter associated with the penalty
term. Accordingly, the following ADMM procedures are used to alternatively update the primary, slack
4
and dual variables:
xk+1 = arg min
xL(x, sk, λk, νk;θ),
sk+1 = arg min
s0L(xk+1, s, λk, νk;θ),
λk+1 =λk+ρ(Axk+1 b),
νk+1 =νk+ρ(Gxk+1 +sk+1 h).
(5a)
(5b)
(5c)
(5d)
Note that the primal variable xk+1 is updated by solving an unconstrained optimization problem.
The update of the slack variable sk+1 is easily obtained by a closed-form solution via a Rectified Linear
Unit (ReLU) as
sk+1 =ReLU 1
ρνk(Gxk+1 h).(6)
The differentiations for the slack and dual variables are trivial and can be done using an automatic
differentiation technique. Therefore, the computational difficulty is now concentrated around differ-
entiating the primal variables, which can be done using the Implicit Function Theorem. Applying a
differentiation technique to procedure (5) leads to:
xk+1
θ =2
xL(xk+1)1x,θ L(xk+1),
sk+1
θ =1
ρsgn(sk+1)·1Tνk
θ +ρ(Gxk+1 h)
θ ,
λk+1
θ =λk
θ +ρ(Axk+1 b)
θ ,
νk+1
θ =νk
θ +ρ(Gxk+1 +sk+1 h)
θ ,
(7a)
(7b)
(7c)
(7d)
where 2
xL(xk+1) = 2
xf(xk+1)T+ρATA+ρGTG,represents Hadamard production, sgn(sk+1)
denotes a function such that sgn(si
k+1) = 1 if si
k+1 0 and sgn(si
k+1) = 0 vice versa.
Algorithm 1 Alt-Diff
Parameter:θfrom the previous layer
Initialize slack variable and dual variables
while kxk+1 xkk/kxkk ≥ do
Forward update by (5)
Primal update xk+1
θ by (7a)
Slack update sk+1
θ by (7b)
Dual update λk+1
θ and νk+1
θ by (7c) and (7d)
k:= k+ 1
end while
return x?and its gradient x?
θ
We summarize the procedure of Alt-Diff
in Algorithm 1 and provide a framework of
Alt-Diff in Appendix A. The convergence be-
havior of Alt-Diff can be inspected by check-
ing kxk+1 xkk/kxkk< , where is a pre-
defined threshold.
Notably, Alt-Diff is somehow similar to
unrolling methods as in [18, 22]. However,
these unrolling methods were designed for
unconstrained optimization. If constraints
are added, the unrolling solutions have to
be projected into a feasible region. Gener-
ally this projector operators are very com-
putationally expensive. By contrast, Alt-Diff can decouple constraints from the optimization and only
involves a very simple operation that projects the slack variable sto the nonnegative orthant s0.
This significantly improves the efficiency of subsequent updates and reduces the overall computational
complexity of Alt-Diff. Besides, conventional unrolling methods usually need more memory consump-
tion as all the intermediate computational results have to be recorded. However, when updating (7)
continuously, Alt-Diff does not need to save the results of the previous round. Instead, it only saves
the results of the last round, that is, the previous xk
θ is replaced by the xk+1
θ iteratively. Besides, we
show that the gradient obtained by Alt-Diff is consistent with the one obtained by the gradient of the
optimality conditions implicitly. A detailed proof as well as the procedure for differentiating the KKT
conditions is presented in the Appendix E.
5
摘要:

AlternatingDi erentiationforOptimizationLayersHaixiangSun1,YeShi*1,JingyaWang1,HoangDuongTuan2,H.VincentPoor3,andDachengTao4,51SchoolofInformationScienceandTechnology,ShanghaiTechUniversity,China2SchoolofElectricalandDataEngineering,UniversityofTechnologySydney,Australia3DepartmentofElectricalandCom...

展开>> 收起<<
Alternating Dierentiation for Optimization Layers Haixiang Sun1 Ye Shi1 Jingya Wang1 Hoang Duong Tuan2 H. Vincent Poor3 and Dacheng Tao45_2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:932.22KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注