Block-wise Training of Residual Networks via the Minimizing Movement Scheme

2025-04-22 0 0 909.42KB 12 页 10玖币

侵权投诉

Skander Karkar,1 2 Ibrahim Ayed,1 3 Emmanuel de B´

ezenac,1Patrick Gallinari1 2

1LIP6, Sorbonne Universit´

e, France

2Criteo AI Lab, Criteo, France

3Theresis lab, Thales, France

{skander.karkar, ibrahim.ayed, emmanuel.de-bezenac, patrick.gallinari}@lip6.fr

Abstract

End-to-end backpropagation has a few shortcomings: it re-

quires loading the entire model during training, which can be

impossible in constrained settings, and suffers from three lock-

ing problems (forward locking, update locking and backward

locking), which prohibit training the layers in parallel. Solving

layer-wise optimization problems can address these problems

and has been used in on-device training of neural networks.

We develop a layer-wise training method, particularly well-

adapted to ResNets, inspired by the minimizing movement

scheme for gradient ﬂows in distribution space. The method

amounts to a kinetic energy regularization of each block that

makes the blocks optimal transport maps and endows them

with regularity. It works by alleviating the stagnation problem

observed in layer-wise training, whereby greedily-trained early

layers overﬁt and deeper layers stop increasing test accuracy

after a certain depth. We show on classiﬁcation tasks that the

test accuracy of block-wise trained ResNets is improved when

using our method, whether the blocks are trained sequentially

or in parallel.

Introduction

End-to-end backpropagation is the standard training method

of neural nets. But there are reasons to look for alternatives.

It is considered biologically unrealistic (Mostafa, Ramesh,

and Cauwenberghs 2018) and it requires loading the whole

model during training which can be impossible in constrained

settings such as training on mobile devices (Teng et al. 2020;

Tang et al. 2021). It also prohibits training layers in parallel

as it suffers from three locking problems (forward locking:

each layer must wait for the previous layers to process its

input, update locking: each layer must wait for the end of

the forward pass to be updated, and backward locking: each

layer must wait for errors to backpropagate from the last

layer to be updated) (Jaderberg et al. 2017). These locking

problems force the training and deployment of networks to be

sequential and synchronous, and breaking them would allow

for more ﬂexibility when using networks that are distributed

between a central agent and clients and that operate at differ-

ent rates (Jaderberg et al. 2017). Greedily solving layer-wise

optimization problems, sequentially (i.e. one after the other)

or in parallel (i.e. batch-wise), solves update locking (and so

also backward locking). When combined with buffers, paral-

lel layer-wise training solves all three problems (Belilovsky,

Eickenberg, and Oyallon 2020) and allows distributed train-

ing of the layers. Layer-wise training is appealing in memory-

constrained settings as it works without most gradients and

activations needed in end-to-end training, and when done

sequentially, only requires loading and training one layer at

a time. Despite its simplicity, layer-wise training has been

shown (Belilovsky, Eickenberg, and Oyallon 2019, 2020) to

scale well. It outperforms more complicated ideas developed

to address the locking problems such as synthetic (Jaderberg

et al. 2017; Czarnecki et al. 2017) and delayed (Huo et al.

2018; Huo, Gu, and Huang 2018) gradients. We can also

deduce theoretical results about a network of greedily-trained

shallow sub-modules from the theoretical results about shal-

low networks (Belilovsky, Eickenberg, and Oyallon 2019,

2020). Module-wise training, where the network is split into

modules that are trained greedily, may not offer the same

computational gains as full layer-wise training, but it gets

closer to the accuracy of end-to-end training and is explored

often (Pyeon et al. 2021; Wang et al. 2021).

The typical setting of (sequential) module-wise training

for minimizing a loss

, is, given a dataset

˜ρ0

, to solve one

after the other, for 0≤k≤K, the problems

(Tk, Fk)∈arg min

T,F X

x∈˜ρ0

L(F, T (Gk−1(x)) (1)

where

Gk=Tk◦... ◦T0

for

0≤k≤K

and

G−1=id

Here,

is the module (a single layer or a group of layers)

and

is an auxiliary network (a classiﬁer if the task is clas-

siﬁcation) that processes the outputs of

so that the loss can

be computed. In a generation task, auxiliary networks might

not be needed. In all cases, module

Tk+1

receives the output

of module

. The ﬁnal network trained this way is

FK◦GK

but we can stop at any previous depth

and use

Fk◦Gk

if it performs better. In fact, and especially when modules

are shallow, module-wise training suffers from a stagnation

problem, whereby greedily-trained early modules overﬁt and

deeper modules stop improving the test accuracy after a cer-

tain depth, or even degrade it (Marquez, Hare, and Niranjan

2018; Wang et al. 2021). We observe this experimentally (Fig-

ure 1) and propose a regularization for module-wise training

that addresses this problem by increasing training stability.

The regularization is particularly well-adapted to ResNets

arXiv:2210.00949v2 [cs.LG] 6 Jun 2023

(He et al. 2016a,b) and similar models such as ResNeXts (Xie

et al. 2017) and Wide ResNets (Zagoruyko and Komodakis

2016), but is easily usable on other models, and ResNets

themselves remain competitive (Wightman, Touvron, and

egou 2021). The method leverages the analogy between

ResNets and the Euler scheme for ODEs (Weinan 2017) to

penalize the kinetic energy of the network. Intuitively, if the

kinetic energy is penalized enough, the current module will

barely move the points, thus at least preserving the accuracy

of the previous module and avoiding its collapse.

After a discussion of related work in Section 2, we present

the method in Section 3 and show that it amounts to a trans-

port regularization of each module, which we prove forces the

solution module to be an optimal transport map and makes it

regular. This also suggests a simple principled extension of

the method to non-residual networks. In Section 3, we link the

method with gradient ﬂows and the minimizing movement

scheme in the Wasserstein space, which allows to invoke con-

vergence results to a minimizer under additional hypotheses

on the loss. Section 4 discusses different practical implemen-

tations and introduces a new variant of layer-wise training we

call multi-lap sequential training. It is a slight variation on

sequential layer-wise training that has the same advantages

and offers a non-negligible improvement in many cases over

sequential training for the same computational and memory

costs. Experiments using different architectures and on differ-

ent classiﬁcation datasets in Section 5 show that our method

consistently improves the test accuracy of block-wise and

module-wise trained residual networks, particularly in small

data regimes, whether the block-wise training is carried out

sequentially or in parallel.

Related work

Layer-wise training of neural networks has been considered

as a pre-training and initialization method (Bengio et al.

2006; Marquez, Hare, and Niranjan 2018) and was shown

recently to perform competitively with end-to-end training

(Belilovsky, Eickenberg, and Oyallon 2019; Nøkland and

Eidnes 2019). This has led to it being considered in practical

settings with limited resources such as embedded training

(Teng et al. 2020; Tang et al. 2021). For layer-wise train-

ing, many papers consider using a different auxiliary loss,

instead of or in addition to the classiﬁcation loss: kernel sim-

ilarity (Mandar Kulkarni 2016), information-theory-inspired

losses (Sindy L

owe 2019; Nguyen and Choi 2019; Ma, Lewis,

and Kleijn 2020; Wang et al. 2021) and biologically plau-

sible losses (Sindy L

owe 2019; Nøkland and Eidnes 2019;

Gupta 2020; Bernd Illing 2020; Yuwen Xiong 2020). Paper

(Belilovsky, Eickenberg, and Oyallon 2019) reports the best

experimental results when solving the layer-wise problems

sequentially. Methods PredSim (Nøkland and Eidnes 2019),

DGL (Belilovsky, Eickenberg, and Oyallon 2020), Sedona

(Pyeon et al. 2021) and InfoPro (Wang et al. 2021) report the

best results when solving the layer-wise problems in paral-

lel, albeit each in a somewhat different setting. (Belilovsky,

Eickenberg, and Oyallon 2019, 2020) do it simply through

architectural considerations mostly regarding the auxiliary

networks. However, (Belilovsky, Eickenberg, and Oyallon

2019) do not consider ResNets and PredSim state that their

method does not perform well on ResNets, speciﬁcally be-

cause of the skip connections. DGL only considers a ResNet

architecture by splitting it in the middle and training the two

halves without backpropagating between them. All three fo-

cus on VGG architectures and networks that are not deep.

Sedona applies architecture search to decide on where to split

the network into 2 or 4 modules and what auxiliary classi-

ﬁer to use before module-wise training. Only BoostResNet

(Huang et al. 2018) also proposes a block-wise training idea

geared for ResNets. However, their results only show better

early performance on limited experiments and end-to-end

ﬁne-tuning is required to be competitive. A method called

ResIST (Dun et al. 2021) that is similar to block-wise training

of ResNets randomly assigns residual blocks to one of up to

8 sub-networks that are trained independently and reassem-

bled before another random partition. But only blocks in the

third section of the ResNet are partitioned, the same block

can appear in many sub-networks and the sub-networks are

not necessarily made up of successive blocks. Considered

as a distributed training method, it is only compared with

local SGD (Stich 2019). These methods can all be combined

with our regularization, and we use the auxiliary network

architecture from (Belilovsky, Eickenberg, and Oyallon 2019,

2020). We also show the beneﬁts of our method both with

full layer-wise training and when the network is split into a

few modules.

Besides layer-wise training, methods such as DNI (Jader-

berg et al. 2017; Czarnecki et al. 2017), DDG (Huo et al.

2018) and Features Replay (Huo, Gu, and Huang 2018),

solve the update locking problem and the backward locking

problem with an eye towards parallelization by using delayed

or synthetic predicted gradients, or even predicted inputs to

address forward locking. But they only fully apply this to

quite shallow networks and only split deeper ones into a small

number of sub-modules (less than ﬁve) that don’t backpropa-

gate to each other and observe training issues with more splits

(Huo, Gu, and Huang 2018). This makes them compare unfa-

vorably to layer-wise training (Belilovsky, Eickenberg, and

Oyallon 2020). The high dimension of the predicted gradient

which scales with the size of the network renders (Jaderberg

et al. 2017; Czarnecki et al. 2017) challenging in practice.

Therefore, despite its simplicity, greedy layer-wise training

is more appealing when working in a constrained setting.

Viewing residual networks as dynamical transport systems

(de B

ezenac, Ayed, and Gallinari 2019; Karkar et al. 2020)

followed from their view as a discretization of differential

equations (Weinan 2017; Lu et al. 2018). Transport regular-

ization was also used in (Finlay et al. 2020) to accelerate the

training of the NeuralODE model (Chen et al. 2018). Trans-

port regularization of ResNets in particular is motivated by

the observation that they are naturally biased towards mini-

mally modifying their input (Jastrzebski et al. 2018; Hauser

2019; Karkar et al. 2020). We further link this transport view-

point with gradient ﬂows in the Wasserstein space to apply it

in a principled way to module-wise training. Gradient ﬂows

in the Wasserstein space operating on the data space appeared

recently in deep learning. In (Alvarez-Melis and Fusi 2021),

the focus is on functionals of measures whose ﬁrst variations

are known in closed form and used, through their gradients,

in the algorithm. This limits the scope of their applications

to transfer learning and similar tasks. Likewise, (Gao et al.

2019; Liutkus et al. 2019; Arbel et al. 2019; Ansari, Ang, and

Soh 2021) use the explicit gradient ﬂow of

-divergences

and other distances between measures for generation and

generator reﬁnement. In contrast, we use the minimizing

movement scheme which does not require computation of

the ﬁrst variation and allows to consider classiﬁcation.

Regularized block-wise training of ResNets

In this section we state the module-wise problems we solve

and show that the modules they induce are regular as they

approximate optimal transport maps. We show that solving

these problems sequentially means following a minimizing

movement that approximates the Wasserstein gradient ﬂow

that minimizes the loss, which offers hints as to why it works

well in practice outside the theoretical hypotheses.

Method statement

In a ResNet, a data point

is transported by applying

xm+1 =xm+gm(xm)

for

ResBlocks (a ResBlock is a

function

id +gm

), and

xM+1

is then classiﬁed. To keep the

greedily-trained modules from overﬁtting and destroying in-

formation needed by deeper modules, we propose to penalize

their kinetic energy to force them to preserve the geometry

of the problem as much as possible. The total discrete kinetic

energy of a ResNet (for a single point

) is

P∥gm(xm)∥2

since a ResNet can be seen as an Euler scheme for an ODE

with velocity ﬁeld g(Weinan 2017):

xm+1 =xm+gm(xm)←→ ∂txt=gt(xt)(2)

That ResNets are already biased towards small displacements

and therefore low kinetic energy and that this bias is desirable

and should be encouraged has been observed in many works

(Jastrzebski et al. 2018; Zhang et al. 2019; Hauser 2019; De

and Smith 2020; Karkar et al. 2020). Using the notations

from

(1)

, if each module

is made up of

ResBlocks, i.e.

has the form

(id +gM−1)◦... ◦(id +g0)

, we propose to

penalize its kinetic energy over its input points by adding it

to the loss

in the target of the greedy problems

(1)

. We

denote

ψx

the position of an input

after

ResBlocks.

Given

τ > 0

used to weight the regularization, we solve, for

0≤k≤K, problems

(Tτ

k, F τ

k)∈(3)

arg min

T,F X

x∈˜ρ0

(L(F, T (Gτ

k−1(x)) + 1

2τ

M−1

m=0

∥gm(ψx

m)∥2)

s.t. T= (id +gM−1)◦... ◦(id +g0)

ψx

0=Gτ

k−1(x), ψx

m+1 =ψx

m+gm(ψx

where

Gτ

k=Tτ

k◦... ◦Tτ

for

0≤k≤K

and

Gτ

−1=id

The ﬁnal network is now

Fτ

K◦Gτ

. Intuitively, we can think

that this biases the modules towards moving the points as

little as possible, thus at least keeping the performance of the

previous module. In our experiments, we will mostly focus

on block-wise training, i.e. the case

M= 1

where each

Tτ

is a single residual block, as it is more challenging.

Regularity result

The Appendix gives the necessary background on optimal

transport (OT) theory to prove a regularity result for our

method. We start by moving to a continuous viewpoint. We

denote

ρ0=ρτ

the data distribution of which

˜ρ0

is a sample

and

the distribution-wide loss that arises from the point-

wise loss

. As expressed in

(2)

, a residual network can

be seen as an Euler discretization of a differential equation.

Problem (3) is then the discretization of problem

(Tτ

k, F τ

k)∈(4)

arg min

T,F L(F, T♯ρτ

k) + 1

2τZ1

∥vt∥2

L2((ϕ·

t)♯ρτ

k)dt

s.t. T=ϕ·

1, ∂tϕx

t=vt(ϕx

t), ϕ·

0=id

where

ρτ

k+1 = (Tτ

k)♯ρτ

and

is the discretization of vector

ﬁeld

at time

t=m/M

. Here, data distributions

ρτ

are

pushed forward through the maps

Tτ

which correspond to the

ﬂow at

t= 1

of the kinetically-regularized velocity ﬁeld

Given the equivalence between the Monge OT problem

(12)

and the OT problem in dynamic form

(14)

in the Appendix,

problem (4) is equivalent to

(Tτ

k, F τ

k)∈(5)

arg min

T,F L(F, T♯ρτ

k) + 1

2τZΩ

∥T(x)−x∥2dρτ

k(x)

where points are moved instantly through

instead of in-

ﬁnitesimally through velocity ﬁled

. This equivalent for-

mulation leads to another discretization

(10)

and implemen-

tation of the method more easily applicable in practice to

non-residual architectures by simply penalizing the differ-

ence between the module’s output and its input. We can show

that problem

(5)

indeed has a solution and that

Tτ

is nec-

essarily an optimal transport between its input and output

distributions, which means that it comes with some regularity.

We assume that the minimization in

is over a compact set

, that

ρτ

is absolutely continuous, that

is continuous and

non-negative and that Ωis compact.

Theorem 1. Problems

(5)

and

(4)

have a minimizer

(Tτ

k, F τ

such that

Tτ

is an optimal transport map. And for any mini-

mizer (Tτ

k, F τ

k),Tτ

kis an optimal transport map.

The proof is in the Appendix. Optimal transport maps have

regularity properties under some boundedness assumptions.

Given Theorem 2 in the Appendix taken from (Figalli 2017),

Tτ

-H

older continuous almost everywhere and if the op-

timization algorithm we use to solve the discretized problem

(3)

returns an approximate solution pair

(˜

Fτ

k,˜

Tτ

such that

Tτ

is an

-optimal transport map, i.e.

∥˜

Tτ

k−Tτ

k∥∞≤ϵ

, then

we have (using the triangle inequality) the following stability

property of the neural module ˜

Tτ

∥˜

Tτ

k(x)−˜

Tτ

k(y)∥ ≤ 2ϵ+C∥x−y∥η(6)

for almost every

x, y ∈supp(ρτ

and a constant

C > 0

The experimental advantages of using such networks have

been shown in (Karkar et al. 2020). Naively composing these

stability bounds on

Tτ

and

Tτ

allows to get stability bounds

for the composition networks Gτ

Kand ˜

Gτ

K=˜

Tτ

K◦.. ◦˜

Tτ

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Block-wiseTrainingofResidualNetworksviatheMinimizingMovementSchemeSkanderKarkar,12IbrahimAyed,13EmmanueldeB´ezenac,1PatrickGallinari121LIP6,SorbonneUniversit´e,France2CriteoAILab,Criteo,France3Theresislab,Thales,France{skander.karkar,ibrahim.ayed,emmanuel.de-bezenac,patrick.gallinari}@lip6.frAbstrac...

展开>> 收起<<

Block-wise Training of Residual Networks via the Minimizing Movement Scheme.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Block-wise Training of Residual Networks via the Minimizing Movement Scheme

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: