
update for a pre-determined number of iterations, fixing the overall computational budget. Note that the
network is only implicitly learning the regularizer. In practice, it is actually learning an update step, which
can be thought of as de-noising or a projection onto the manifold of the data. LU is typically trained end-
to-end, i.e., when fed some initialization x0, the network will output the final estimate xK, compute a loss
with respect to the ground truth, and then back-propagate across the entire computational graph to update
network parameters. While end-to-end training is easier to perform and encourages faster convergence, it
requires all intermediate activations to be stored in memory. Thus, the maximum number of iterations is
always kept small compared to classical iterative inverse problem solvers.
Due to the limitation in memory, there is a trade-off between the depth of a LU and the richness of
each regularization network. Intuitively, one can raise the network performance by increasing the number
of loop unrolled iterations. For example, [2] extends the LU model to potentially infinite number of itera-
tions using an implicit network, and [7] allows deeper unrolling iterations using invertible networks, while
requiring recalculation of the intermediate results from the output in training phase. This approach can
be computationally intensive for large-scale inverse problems or when the forward operator is nonlinear and
computationally expensive to apply. For example, the forward operator may involve solving differential equa-
tions such as the wave equation for seismic wave propagation [8] and the Lorenz equations for atmospheric
modeling[9].
Alternatively, one can design a richer regularization network. For example [10] uses a transformer as the
regularization network and achieves extremely competitive results in the fastMRI challenge [11], but requires
multiple 24GB GPU for training with batch size of 1, which is often impractical, especially for large systems.
Our design strikes a balance in the expressiveness in regularization networks and memory efficiency during
training. Our proposed work is an alternative method to achieve a rich regularization network without the
additional computational memory costs during training.
2.2 Deep Equilibrium Models for Inverse Problems (DEQ4IP)
Deep equilibrium (DEQ) models introduce an alternative to traditional feed-forward networks [3–6]. Rather
than feed an input through a fixed (relatively small) number of layers, DEQ models solve for the “fixed-point"
given some input. More formally, given a network fθand some input x(0) and y, we recursively apply the
network via
x(k+1) =fθ(x(k),y),(6)
until convergence.1In this instance, yacts as an input injection that determines the final output. This is
termed the forward process. The weights θof the model can be trained via implicit differentiation, removing
the need to store all intermediate activations from recursively applying the network. This allows for deeper,
more expressive models without the associated memory footprint to train such models.
[2] demonstrates an application of one such model, applying similar principles to a single iteration of an LU
architecture. Such an idea is a natural extension as it allows the model to “iterate until convergence" rather
than rely on a “fixed budget". In another word, looping over (4) and (5) many times (in practice, usually
around 50 iterations) until xkconverges. However, such a model can be unstable to train and often performs
best with pre-training of the learned portion of the model (typically acting as a learned regularizer/de-noiser).
It is also important to note is that such a model would have to apply the forward operator (and potentially
the adjoint) many times during the forward process. Although this can be accelerated to reduce the number
of applications, it is still often more than the number of applications for an LU equivalent. This can be an
issue if the forward operator is computationally expensive to apply, an issue LU methods somewhat mitigate
by fixing the total number of iterations.
2.3 Alternative Approaches to Tackle Memory Issues
I-RIM [7] is a deep invertible network that address the memory issue by recalculating the intermediate
results from the output. However it is not ideal when forward model is computationally expensive. Gradient
checkpointing [12] is another practical technique to reduce memory costs for deep neural networks. It saves
1Note that, since our approach will ultimately use both methods, to aid in a clearer presentation we use subscript, i.e., xk,
to denote the LU iterations, and superscript with parenthesis, i.e., r(i), to denote the iterations in the deep equilibrium model.
3