
Unrolling methods Another direction for differentiating optimization problems is the unrolling
methods [18, 45]. These approximate the objective function with a first-order gradient method and then
incorporate the gradient of the inner loop optimization into the training procedures [5, 8, 9, 43]. This is
an iterative procedure, which is usually truncated to a fixed number of iterations. Although unrolling
methods are easy to implement, most of their applications are limited to unconstrained problems. If
constraints are added, the unrolling solutions have to be projected into the feasible region, significantly
increasing the computational burdens. By contrast, Alt-Diff only requires a very simple operation that
projects the slack variable to the nonnegative orthant. This substantially improves the efficiency of
subsequent updates and reduces the method’s computational complexity.
Implicit models There has been growing interest in implicit models in recent years [11, 34, 51].
Implicit layers replace the traditional feed-forward layers of neural networks with fixed point iterations to
compute inferences. They have been responsible for substantial advances in many applications, including
Neural ODE [16, 20], Deep Equilibrium Models [6, 7, 32], nonconvex optimization problems [49] and
implicit 3D surface layers [44, 47], etc. Implicit layers have a similar computational complexity to
optimization layers as they also requires solving a costly Jacobian-based equation based on the Implicit
Function Theorem. Recently, [23] proposed a Jacobian-free backpropagation method to accelerate the
speed of training implicit layers. However, this method is not suitable for the optimization layers with
complex constraints. In terms of training implicit models, [25] also proposed a novel gradient estimate
called phantom gradient which relies on fixed-point unrolling and a Neumann series to provide a new
update direction; computation of precise gradient is forgone. Implicit models have also been extended
to more complex learning frameworks, such as attention mechanisms [24] and Graph Neural Networks
[30].
3 Preliminary: Differentiable optimization layers
We consider a parameterized convex optimization problems with polyhedral constraints:
min
xf(x;θ)
s.t. x ∈ C(θ)(1)
where x∈Rnis the decision variable, the objective function f:Rn→Ris convex and the constraint
C(θ) := {x|Ax =b, Gx ≤h}is a polyhedron. For simplification, we use θto collect the parameters
in the objective function and constraints. For any given θ, a solution of optimization problem (1) is
x?∈Rnthat minimizes f(x;θ) while satisfying the constraints, i.e. x?∈ C(θ). Then the optimization
problem (1) can be viewed as a mapping that maps the parameters θto the solution x?. Here, we
focus on convex optimization with affine constraints due to its wide applications in control systems [31],
signal processing [42], communication networks [14, 37], etc.
Since optimization problems can capture well-defined specific domain knowledge in a model-driven
way, embedding such optimization problems into neural networks within an end-to-end learning frame-
work can simultaneously leverage the strengths of both the model-driven and the data-driven methods.
Unlike conventional neural networks, which are defined through explicit expressions of each layer, an
optimization layer embedded in optimization problem (1) is defined as follows:
Definition 3.1. (Optimization Layer) A layer in a neural network is defined as an optimization layer
if its input is the optimization parameters θ∈Rmand its output x?∈Rnis the solution of the
optimization problem (1).
The optimization layer can be treated as an implicit function F:Rn×Rm→Rnwith F(x?, θ) = 0.
In the deep learning architecture, optimization layers are implemented together with other explicit layers
in an end-to-end framework. During the training procedure, the chain rule is used to back propagate
the gradient through the optimization layer. Given a loss function R, the derivative of the loss w.r.t
the parameter θof the optimization layer is
∂R
∂θ =∂R
∂x?
∂x?
∂θ .(2)
3