(He et al. 2016a,b) and similar models such as ResNeXts (Xie
et al. 2017) and Wide ResNets (Zagoruyko and Komodakis
2016), but is easily usable on other models, and ResNets
themselves remain competitive (Wightman, Touvron, and
J
´
egou 2021). The method leverages the analogy between
ResNets and the Euler scheme for ODEs (Weinan 2017) to
penalize the kinetic energy of the network. Intuitively, if the
kinetic energy is penalized enough, the current module will
barely move the points, thus at least preserving the accuracy
of the previous module and avoiding its collapse.
After a discussion of related work in Section 2, we present
the method in Section 3 and show that it amounts to a trans-
port regularization of each module, which we prove forces the
solution module to be an optimal transport map and makes it
regular. This also suggests a simple principled extension of
the method to non-residual networks. In Section 3, we link the
method with gradient flows and the minimizing movement
scheme in the Wasserstein space, which allows to invoke con-
vergence results to a minimizer under additional hypotheses
on the loss. Section 4 discusses different practical implemen-
tations and introduces a new variant of layer-wise training we
call multi-lap sequential training. It is a slight variation on
sequential layer-wise training that has the same advantages
and offers a non-negligible improvement in many cases over
sequential training for the same computational and memory
costs. Experiments using different architectures and on differ-
ent classification datasets in Section 5 show that our method
consistently improves the test accuracy of block-wise and
module-wise trained residual networks, particularly in small
data regimes, whether the block-wise training is carried out
sequentially or in parallel.
Related work
Layer-wise training of neural networks has been considered
as a pre-training and initialization method (Bengio et al.
2006; Marquez, Hare, and Niranjan 2018) and was shown
recently to perform competitively with end-to-end training
(Belilovsky, Eickenberg, and Oyallon 2019; Nøkland and
Eidnes 2019). This has led to it being considered in practical
settings with limited resources such as embedded training
(Teng et al. 2020; Tang et al. 2021). For layer-wise train-
ing, many papers consider using a different auxiliary loss,
instead of or in addition to the classification loss: kernel sim-
ilarity (Mandar Kulkarni 2016), information-theory-inspired
losses (Sindy L
¨
owe 2019; Nguyen and Choi 2019; Ma, Lewis,
and Kleijn 2020; Wang et al. 2021) and biologically plau-
sible losses (Sindy L
¨
owe 2019; Nøkland and Eidnes 2019;
Gupta 2020; Bernd Illing 2020; Yuwen Xiong 2020). Paper
(Belilovsky, Eickenberg, and Oyallon 2019) reports the best
experimental results when solving the layer-wise problems
sequentially. Methods PredSim (Nøkland and Eidnes 2019),
DGL (Belilovsky, Eickenberg, and Oyallon 2020), Sedona
(Pyeon et al. 2021) and InfoPro (Wang et al. 2021) report the
best results when solving the layer-wise problems in paral-
lel, albeit each in a somewhat different setting. (Belilovsky,
Eickenberg, and Oyallon 2019, 2020) do it simply through
architectural considerations mostly regarding the auxiliary
networks. However, (Belilovsky, Eickenberg, and Oyallon
2019) do not consider ResNets and PredSim state that their
method does not perform well on ResNets, specifically be-
cause of the skip connections. DGL only considers a ResNet
architecture by splitting it in the middle and training the two
halves without backpropagating between them. All three fo-
cus on VGG architectures and networks that are not deep.
Sedona applies architecture search to decide on where to split
the network into 2 or 4 modules and what auxiliary classi-
fier to use before module-wise training. Only BoostResNet
(Huang et al. 2018) also proposes a block-wise training idea
geared for ResNets. However, their results only show better
early performance on limited experiments and end-to-end
fine-tuning is required to be competitive. A method called
ResIST (Dun et al. 2021) that is similar to block-wise training
of ResNets randomly assigns residual blocks to one of up to
8 sub-networks that are trained independently and reassem-
bled before another random partition. But only blocks in the
third section of the ResNet are partitioned, the same block
can appear in many sub-networks and the sub-networks are
not necessarily made up of successive blocks. Considered
as a distributed training method, it is only compared with
local SGD (Stich 2019). These methods can all be combined
with our regularization, and we use the auxiliary network
architecture from (Belilovsky, Eickenberg, and Oyallon 2019,
2020). We also show the benefits of our method both with
full layer-wise training and when the network is split into a
few modules.
Besides layer-wise training, methods such as DNI (Jader-
berg et al. 2017; Czarnecki et al. 2017), DDG (Huo et al.
2018) and Features Replay (Huo, Gu, and Huang 2018),
solve the update locking problem and the backward locking
problem with an eye towards parallelization by using delayed
or synthetic predicted gradients, or even predicted inputs to
address forward locking. But they only fully apply this to
quite shallow networks and only split deeper ones into a small
number of sub-modules (less than five) that don’t backpropa-
gate to each other and observe training issues with more splits
(Huo, Gu, and Huang 2018). This makes them compare unfa-
vorably to layer-wise training (Belilovsky, Eickenberg, and
Oyallon 2020). The high dimension of the predicted gradient
which scales with the size of the network renders (Jaderberg
et al. 2017; Czarnecki et al. 2017) challenging in practice.
Therefore, despite its simplicity, greedy layer-wise training
is more appealing when working in a constrained setting.
Viewing residual networks as dynamical transport systems
(de B
´
ezenac, Ayed, and Gallinari 2019; Karkar et al. 2020)
followed from their view as a discretization of differential
equations (Weinan 2017; Lu et al. 2018). Transport regular-
ization was also used in (Finlay et al. 2020) to accelerate the
training of the NeuralODE model (Chen et al. 2018). Trans-
port regularization of ResNets in particular is motivated by
the observation that they are naturally biased towards mini-
mally modifying their input (Jastrzebski et al. 2018; Hauser
2019; Karkar et al. 2020). We further link this transport view-
point with gradient flows in the Wasserstein space to apply it
in a principled way to module-wise training. Gradient flows
in the Wasserstein space operating on the data space appeared
recently in deep learning. In (Alvarez-Melis and Fusi 2021),
the focus is on functionals of measures whose first variations
are known in closed form and used, through their gradients,