Block-wise Training of Residual Networks via the Minimizing Movement Scheme

2025-04-22 0 0 909.42KB 12 页 10玖币
侵权投诉
Block-wise Training of Residual Networks via the Minimizing Movement Scheme
Skander Karkar,1 2 Ibrahim Ayed,1 3 Emmanuel de B´
ezenac,1Patrick Gallinari1 2
1LIP6, Sorbonne Universit´
e, France
2Criteo AI Lab, Criteo, France
3Theresis lab, Thales, France
{skander.karkar, ibrahim.ayed, emmanuel.de-bezenac, patrick.gallinari}@lip6.fr
Abstract
End-to-end backpropagation has a few shortcomings: it re-
quires loading the entire model during training, which can be
impossible in constrained settings, and suffers from three lock-
ing problems (forward locking, update locking and backward
locking), which prohibit training the layers in parallel. Solving
layer-wise optimization problems can address these problems
and has been used in on-device training of neural networks.
We develop a layer-wise training method, particularly well-
adapted to ResNets, inspired by the minimizing movement
scheme for gradient flows in distribution space. The method
amounts to a kinetic energy regularization of each block that
makes the blocks optimal transport maps and endows them
with regularity. It works by alleviating the stagnation problem
observed in layer-wise training, whereby greedily-trained early
layers overfit and deeper layers stop increasing test accuracy
after a certain depth. We show on classification tasks that the
test accuracy of block-wise trained ResNets is improved when
using our method, whether the blocks are trained sequentially
or in parallel.
Introduction
End-to-end backpropagation is the standard training method
of neural nets. But there are reasons to look for alternatives.
It is considered biologically unrealistic (Mostafa, Ramesh,
and Cauwenberghs 2018) and it requires loading the whole
model during training which can be impossible in constrained
settings such as training on mobile devices (Teng et al. 2020;
Tang et al. 2021). It also prohibits training layers in parallel
as it suffers from three locking problems (forward locking:
each layer must wait for the previous layers to process its
input, update locking: each layer must wait for the end of
the forward pass to be updated, and backward locking: each
layer must wait for errors to backpropagate from the last
layer to be updated) (Jaderberg et al. 2017). These locking
problems force the training and deployment of networks to be
sequential and synchronous, and breaking them would allow
for more flexibility when using networks that are distributed
between a central agent and clients and that operate at differ-
ent rates (Jaderberg et al. 2017). Greedily solving layer-wise
optimization problems, sequentially (i.e. one after the other)
or in parallel (i.e. batch-wise), solves update locking (and so
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
also backward locking). When combined with buffers, paral-
lel layer-wise training solves all three problems (Belilovsky,
Eickenberg, and Oyallon 2020) and allows distributed train-
ing of the layers. Layer-wise training is appealing in memory-
constrained settings as it works without most gradients and
activations needed in end-to-end training, and when done
sequentially, only requires loading and training one layer at
a time. Despite its simplicity, layer-wise training has been
shown (Belilovsky, Eickenberg, and Oyallon 2019, 2020) to
scale well. It outperforms more complicated ideas developed
to address the locking problems such as synthetic (Jaderberg
et al. 2017; Czarnecki et al. 2017) and delayed (Huo et al.
2018; Huo, Gu, and Huang 2018) gradients. We can also
deduce theoretical results about a network of greedily-trained
shallow sub-modules from the theoretical results about shal-
low networks (Belilovsky, Eickenberg, and Oyallon 2019,
2020). Module-wise training, where the network is split into
modules that are trained greedily, may not offer the same
computational gains as full layer-wise training, but it gets
closer to the accuracy of end-to-end training and is explored
often (Pyeon et al. 2021; Wang et al. 2021).
The typical setting of (sequential) module-wise training
for minimizing a loss
L
, is, given a dataset
˜ρ0
, to solve one
after the other, for 0kK, the problems
(Tk, Fk)arg min
T,F X
x˜ρ0
L(F, T (Gk1(x)) (1)
where
Gk=Tk... T0
for
0kK
and
G1=id
.
Here,
Tk
is the module (a single layer or a group of layers)
and
Fk
is an auxiliary network (a classifier if the task is clas-
sification) that processes the outputs of
Tk
so that the loss can
be computed. In a generation task, auxiliary networks might
not be needed. In all cases, module
Tk+1
receives the output
of module
Tk
. The final network trained this way is
FKGK
,
but we can stop at any previous depth
k
and use
FkGk
if it performs better. In fact, and especially when modules
are shallow, module-wise training suffers from a stagnation
problem, whereby greedily-trained early modules overfit and
deeper modules stop improving the test accuracy after a cer-
tain depth, or even degrade it (Marquez, Hare, and Niranjan
2018; Wang et al. 2021). We observe this experimentally (Fig-
ure 1) and propose a regularization for module-wise training
that addresses this problem by increasing training stability.
The regularization is particularly well-adapted to ResNets
arXiv:2210.00949v2 [cs.LG] 6 Jun 2023
(He et al. 2016a,b) and similar models such as ResNeXts (Xie
et al. 2017) and Wide ResNets (Zagoruyko and Komodakis
2016), but is easily usable on other models, and ResNets
themselves remain competitive (Wightman, Touvron, and
J
´
egou 2021). The method leverages the analogy between
ResNets and the Euler scheme for ODEs (Weinan 2017) to
penalize the kinetic energy of the network. Intuitively, if the
kinetic energy is penalized enough, the current module will
barely move the points, thus at least preserving the accuracy
of the previous module and avoiding its collapse.
After a discussion of related work in Section 2, we present
the method in Section 3 and show that it amounts to a trans-
port regularization of each module, which we prove forces the
solution module to be an optimal transport map and makes it
regular. This also suggests a simple principled extension of
the method to non-residual networks. In Section 3, we link the
method with gradient flows and the minimizing movement
scheme in the Wasserstein space, which allows to invoke con-
vergence results to a minimizer under additional hypotheses
on the loss. Section 4 discusses different practical implemen-
tations and introduces a new variant of layer-wise training we
call multi-lap sequential training. It is a slight variation on
sequential layer-wise training that has the same advantages
and offers a non-negligible improvement in many cases over
sequential training for the same computational and memory
costs. Experiments using different architectures and on differ-
ent classification datasets in Section 5 show that our method
consistently improves the test accuracy of block-wise and
module-wise trained residual networks, particularly in small
data regimes, whether the block-wise training is carried out
sequentially or in parallel.
Related work
Layer-wise training of neural networks has been considered
as a pre-training and initialization method (Bengio et al.
2006; Marquez, Hare, and Niranjan 2018) and was shown
recently to perform competitively with end-to-end training
(Belilovsky, Eickenberg, and Oyallon 2019; Nøkland and
Eidnes 2019). This has led to it being considered in practical
settings with limited resources such as embedded training
(Teng et al. 2020; Tang et al. 2021). For layer-wise train-
ing, many papers consider using a different auxiliary loss,
instead of or in addition to the classification loss: kernel sim-
ilarity (Mandar Kulkarni 2016), information-theory-inspired
losses (Sindy L
¨
owe 2019; Nguyen and Choi 2019; Ma, Lewis,
and Kleijn 2020; Wang et al. 2021) and biologically plau-
sible losses (Sindy L
¨
owe 2019; Nøkland and Eidnes 2019;
Gupta 2020; Bernd Illing 2020; Yuwen Xiong 2020). Paper
(Belilovsky, Eickenberg, and Oyallon 2019) reports the best
experimental results when solving the layer-wise problems
sequentially. Methods PredSim (Nøkland and Eidnes 2019),
DGL (Belilovsky, Eickenberg, and Oyallon 2020), Sedona
(Pyeon et al. 2021) and InfoPro (Wang et al. 2021) report the
best results when solving the layer-wise problems in paral-
lel, albeit each in a somewhat different setting. (Belilovsky,
Eickenberg, and Oyallon 2019, 2020) do it simply through
architectural considerations mostly regarding the auxiliary
networks. However, (Belilovsky, Eickenberg, and Oyallon
2019) do not consider ResNets and PredSim state that their
method does not perform well on ResNets, specifically be-
cause of the skip connections. DGL only considers a ResNet
architecture by splitting it in the middle and training the two
halves without backpropagating between them. All three fo-
cus on VGG architectures and networks that are not deep.
Sedona applies architecture search to decide on where to split
the network into 2 or 4 modules and what auxiliary classi-
fier to use before module-wise training. Only BoostResNet
(Huang et al. 2018) also proposes a block-wise training idea
geared for ResNets. However, their results only show better
early performance on limited experiments and end-to-end
fine-tuning is required to be competitive. A method called
ResIST (Dun et al. 2021) that is similar to block-wise training
of ResNets randomly assigns residual blocks to one of up to
8 sub-networks that are trained independently and reassem-
bled before another random partition. But only blocks in the
third section of the ResNet are partitioned, the same block
can appear in many sub-networks and the sub-networks are
not necessarily made up of successive blocks. Considered
as a distributed training method, it is only compared with
local SGD (Stich 2019). These methods can all be combined
with our regularization, and we use the auxiliary network
architecture from (Belilovsky, Eickenberg, and Oyallon 2019,
2020). We also show the benefits of our method both with
full layer-wise training and when the network is split into a
few modules.
Besides layer-wise training, methods such as DNI (Jader-
berg et al. 2017; Czarnecki et al. 2017), DDG (Huo et al.
2018) and Features Replay (Huo, Gu, and Huang 2018),
solve the update locking problem and the backward locking
problem with an eye towards parallelization by using delayed
or synthetic predicted gradients, or even predicted inputs to
address forward locking. But they only fully apply this to
quite shallow networks and only split deeper ones into a small
number of sub-modules (less than five) that don’t backpropa-
gate to each other and observe training issues with more splits
(Huo, Gu, and Huang 2018). This makes them compare unfa-
vorably to layer-wise training (Belilovsky, Eickenberg, and
Oyallon 2020). The high dimension of the predicted gradient
which scales with the size of the network renders (Jaderberg
et al. 2017; Czarnecki et al. 2017) challenging in practice.
Therefore, despite its simplicity, greedy layer-wise training
is more appealing when working in a constrained setting.
Viewing residual networks as dynamical transport systems
(de B
´
ezenac, Ayed, and Gallinari 2019; Karkar et al. 2020)
followed from their view as a discretization of differential
equations (Weinan 2017; Lu et al. 2018). Transport regular-
ization was also used in (Finlay et al. 2020) to accelerate the
training of the NeuralODE model (Chen et al. 2018). Trans-
port regularization of ResNets in particular is motivated by
the observation that they are naturally biased towards mini-
mally modifying their input (Jastrzebski et al. 2018; Hauser
2019; Karkar et al. 2020). We further link this transport view-
point with gradient flows in the Wasserstein space to apply it
in a principled way to module-wise training. Gradient flows
in the Wasserstein space operating on the data space appeared
recently in deep learning. In (Alvarez-Melis and Fusi 2021),
the focus is on functionals of measures whose first variations
are known in closed form and used, through their gradients,
in the algorithm. This limits the scope of their applications
to transfer learning and similar tasks. Likewise, (Gao et al.
2019; Liutkus et al. 2019; Arbel et al. 2019; Ansari, Ang, and
Soh 2021) use the explicit gradient flow of
f
-divergences
and other distances between measures for generation and
generator refinement. In contrast, we use the minimizing
movement scheme which does not require computation of
the first variation and allows to consider classification.
Regularized block-wise training of ResNets
In this section we state the module-wise problems we solve
and show that the modules they induce are regular as they
approximate optimal transport maps. We show that solving
these problems sequentially means following a minimizing
movement that approximates the Wasserstein gradient flow
that minimizes the loss, which offers hints as to why it works
well in practice outside the theoretical hypotheses.
Method statement
In a ResNet, a data point
x0
is transported by applying
xm+1 =xm+gm(xm)
for
M
ResBlocks (a ResBlock is a
function
id +gm
), and
xM+1
is then classified. To keep the
greedily-trained modules from overfitting and destroying in-
formation needed by deeper modules, we propose to penalize
their kinetic energy to force them to preserve the geometry
of the problem as much as possible. The total discrete kinetic
energy of a ResNet (for a single point
x0
) is
Pgm(xm)2
,
since a ResNet can be seen as an Euler scheme for an ODE
with velocity field g(Weinan 2017):
xm+1 =xm+gm(xm)txt=gt(xt)(2)
That ResNets are already biased towards small displacements
and therefore low kinetic energy and that this bias is desirable
and should be encouraged has been observed in many works
(Jastrzebski et al. 2018; Zhang et al. 2019; Hauser 2019; De
and Smith 2020; Karkar et al. 2020). Using the notations
from
(1)
, if each module
Tk
is made up of
M
ResBlocks, i.e.
has the form
(id +gM1)... (id +g0)
, we propose to
penalize its kinetic energy over its input points by adding it
to the loss
L
in the target of the greedy problems
(1)
. We
denote
ψx
m
the position of an input
x
after
m
ResBlocks.
Given
τ > 0
used to weight the regularization, we solve, for
0kK, problems
(Tτ
k, F τ
k)(3)
arg min
T,F X
x˜ρ0
(L(F, T (Gτ
k1(x)) + 1
2τ
M1
X
m=0
gm(ψx
m)2)
s.t. T= (id +gM1)... (id +g0)
ψx
0=Gτ
k1(x), ψx
m+1 =ψx
m+gm(ψx
m)
where
Gτ
k=Tτ
k... Tτ
0
for
0kK
and
Gτ
1=id
.
The final network is now
Fτ
KGτ
K
. Intuitively, we can think
that this biases the modules towards moving the points as
little as possible, thus at least keeping the performance of the
previous module. In our experiments, we will mostly focus
on block-wise training, i.e. the case
M= 1
where each
Tτ
k
is a single residual block, as it is more challenging.
Regularity result
The Appendix gives the necessary background on optimal
transport (OT) theory to prove a regularity result for our
method. We start by moving to a continuous viewpoint. We
denote
ρ0=ρτ
0
the data distribution of which
˜ρ0
is a sample
and
L
the distribution-wide loss that arises from the point-
wise loss
L
. As expressed in
(2)
, a residual network can
be seen as an Euler discretization of a differential equation.
Problem (3) is then the discretization of problem
(Tτ
k, F τ
k)(4)
arg min
T,F L(F, Tρτ
k) + 1
2τZ1
0
vt2
L2((ϕ·
t)ρτ
k)dt
s.t. T=ϕ·
1, ∂tϕx
t=vt(ϕx
t), ϕ·
0=id
where
ρτ
k+1 = (Tτ
k)ρτ
k
and
gm
is the discretization of vector
field
vt
at time
t=m/M
. Here, data distributions
ρτ
k
are
pushed forward through the maps
Tτ
k
which correspond to the
flow at
t= 1
of the kinetically-regularized velocity field
vt
.
Given the equivalence between the Monge OT problem
(12)
and the OT problem in dynamic form
(14)
in the Appendix,
problem (4) is equivalent to
(Tτ
k, F τ
k)(5)
arg min
T,F L(F, Tρτ
k) + 1
2τZ
T(x)x2dρτ
k(x)
where points are moved instantly through
T
instead of in-
finitesimally through velocity filed
v
. This equivalent for-
mulation leads to another discretization
(10)
and implemen-
tation of the method more easily applicable in practice to
non-residual architectures by simply penalizing the differ-
ence between the module’s output and its input. We can show
that problem
(5)
indeed has a solution and that
Tτ
k
is nec-
essarily an optimal transport between its input and output
distributions, which means that it comes with some regularity.
We assume that the minimization in
F
is over a compact set
F
, that
ρτ
k
is absolutely continuous, that
L
is continuous and
non-negative and that is compact.
Theorem 1. Problems
(5)
and
(4)
have a minimizer
(Tτ
k, F τ
k)
such that
Tτ
k
is an optimal transport map. And for any mini-
mizer (Tτ
k, F τ
k),Tτ
kis an optimal transport map.
The proof is in the Appendix. Optimal transport maps have
regularity properties under some boundedness assumptions.
Given Theorem 2 in the Appendix taken from (Figalli 2017),
Tτ
k
is
η
-H
¨
older continuous almost everywhere and if the op-
timization algorithm we use to solve the discretized problem
(3)
returns an approximate solution pair
(˜
Fτ
k,˜
Tτ
k)
such that
˜
Tτ
k
is an
ϵ
-optimal transport map, i.e.
˜
Tτ
kTτ
kϵ
, then
we have (using the triangle inequality) the following stability
property of the neural module ˜
Tτ
k:
˜
Tτ
k(x)˜
Tτ
k(y)∥ ≤ 2ϵ+Cxyη(6)
for almost every
x, y supp(ρτ
k)
and a constant
C > 0
.
The experimental advantages of using such networks have
been shown in (Karkar et al. 2020). Naively composing these
stability bounds on
Tτ
k
and
˜
Tτ
k
allows to get stability bounds
for the composition networks Gτ
Kand ˜
Gτ
K=˜
Tτ
K.. ˜
Tτ
0.
摘要:

Block-wiseTrainingofResidualNetworksviatheMinimizingMovementSchemeSkanderKarkar,12IbrahimAyed,13EmmanueldeB´ezenac,1PatrickGallinari121LIP6,SorbonneUniversit´e,France2CriteoAILab,Criteo,France3Theresislab,Thales,France{skander.karkar,ibrahim.ayed,emmanuel.de-bezenac,patrick.gallinari}@lip6.frAbstrac...

展开>> 收起<<
Block-wise Training of Residual Networks via the Minimizing Movement Scheme.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:909.42KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注