Scaling Laws Beyond Backpropagation Matthew J. Filipovich12Alessandro Cappelli1Daniel Hesslow1Julien Launay13 1LightOn2Queens University3LPENS École Normale Supérieure

2025-05-03 0 0 986.51KB 11 页 10玖币

侵权投诉

Scaling Laws Beyond Backpropagation

Matthew J. Filipovich1,2Alessandro Cappelli1Daniel Hesslow1Julien Launay1,3

1LightOn 2Queen’s University 3LPENS, École Normale Supérieure

{firstname}@lighton.ai

Abstract

Alternatives to backpropagation have long been studied to better understand how

biological brains may learn. Recently, they have also garnered interest as a way

to train neural networks more efﬁciently. By relaxing constraints inherent to

backpropagation (e.g., symmetric feedforward and feedback weights, sequential

updates), these methods enable promising prospects, such as local learning. How-

ever, the tradeoffs between different methods in terms of ﬁnal task performance,

convergence speed, and ultimately compute and data requirements are rarely out-

lined. In this work, we use scaling laws to study the ability of Direct Feedback

Alignment (DFA) to train causal decoder-only Transformers efﬁciently. Scaling

laws provide an overview of the tradeoffs implied by a modeling decision, up to

extrapolating how it might transfer to increasingly large models. We ﬁnd that DFA

fails to offer more efﬁcient scaling than backpropagation: there is never a regime

for which the degradation in loss incurred by using DFA is worth the potential

reduction in compute budget. Our ﬁnding comes at variance with previous beliefs

in the alternative training methods community, and highlights the need for holistic

empirical approaches to better understand modeling decisions.

1 Introduction

Backpropagation (BP) [

] is just one of many ways to solve the credit assignment problem–and

hence to train neural networks. BP estimates the individual contribution of each parameter to the error,

but approximate methods can be employed: either through fundamentally different approaches (e.g.

Hebbian learning) [

], or by relaxing constraints of BP [

–

]. Beyond backpropagation methods

have a history of being studied to understand how biological brains may learn [

]. Indeed, key

features of backpropagation are not possible to implement under biological constraints. For instance,

using the transpose of the weights

in the feedback path is not possible, as the feedforward and

feedback pathways are physically distinct–this is known as the weight transport problem [10].

Once relegated to toy problems [

], alternatives to BP have now been demonstrated to be able to

achieve competitive performance on challenging tasks across a variety of architectures [

]. This

could result in more efﬁcient training: for instance, local learning may enable easier parallelization of

computations at scale [

]. Alternative methods may even be co-designed with hardware [

either for novel systems such as photonic co-processors [

] and memristors [

], or to circumvent

distributed communication bottlenecks in the large clusters used to train state-of-the-art models [

However, the tradeoffs between BP and alternatives are not always clear. If a method enables 25%

faster training, is it worth a 5% decrease in end-task performance? Or would a model trained with

BP using a 25% smaller compute budget still be better? As most works usually only offer a few

cherry-picked datapoints, this is a difﬁcult question to answer. Instead, deriving scaling laws [

]

may provide a more complete picture: by obtaining the full power-law relationship between compute

spent and task performance, one may easily identify regimes in which the alternative method is

competitive with BP, and even extrapolate results to larger scales.

I Can’t Believe It’s Not Better Workshop, NeurIPS 2022

arXiv:2210.14593v1 [cs.LG] 26 Oct 2022

Contributions.

We use scaling laws to study an alternative to BP, with the following contributions:

•

Drawing inspiration from work using scaling laws to evaluate modeling decisions [

–

we are the ﬁrst to use scaling laws to study an alternative to backpropagation.

•

At variance with previous beliefs [

we ﬁnd that the gains in compute efﬁciency from

using the alternative method studied are never worth the degradation in performance

This holds even if we consider the use of exotic hardware, such as optical co-processors [

which would ofﬂoad some computations and effectively make the alternative method "free".

2 Framing

Can alternative training methods accelerate neural network training?

Surveying the current

state-of-the-art, one may ﬁnd numerous claims of alternative training methods achieving competitive

performance with BP across a variety of settings and tasks (e.g., [12, 18, 24]).

We seek to study this claim, with three restrictions in scope:

We focus on Direct Feedback Alignment [

], due its simplicity and wide applicability [

as well as its broad hardware prospects [14, 26, 27], and theoretical background [28].

We study compute-efﬁciency speciﬁcally (i.e, best performance achievable for a given

compute budget), as this usually a signiﬁcant bottleneck for scaling-up models.

We conduct our study on "GPT-like" [

] causal decoder-only Transformers trained on

English data. These models are known to possess smooth scaling laws [

]. Because

of their unique abilities [

], they also command some of the largest training budgets in

machine learning [32], making them a prime target for more compute-efﬁcient training.

These restrictions lead us to test the following hypothesis:

Hypothesis.

Direct Feedback Alignment can train causal decoder-only models more efﬁciently

than backpropagation, achieving better performance for a given compute budget.

Scaling laws as a holistic empirical tool.

Scaling laws have been proposed as an empirical ap-

proach to connect hyperparameters of neural networks (e.g., parameter count, training dataset size) to

their performance. They have been derived both on speciﬁc downstream tasks [

] and on up-

stream modeling loss [

]. Scaling laws can characterize the inﬂuence of data & modeling decisions

[21, 22], or even unveil new, more optimal training practices [35, 36].

As illustrated in Figure 1, it is possible to derive a so-called compute optimal frontier for a class

of models: this deﬁnes

L(C)

, the best performance

achievable for a compute budget

. We ﬁt

a power-law

L(C)=(CcC)αC

over the Pareto front of multiple runs, as proposed in [

is a

constant offsetting the frontier, while

αC

controls the slope. Improvements in

αC

are rare [

but valuable as they would point to modeling decisions leading to increased gains at scale.

Figure 1:

Scaling laws provide optimal compute frontiers

. Left: compute-optimal frontier. Right:

scenarios for DFA (red) & BP (green) scaling laws, shading is best method at a compute budget.

Potential outcomes. We identify four potential outcomes for our study, illustrated in Figure 1:

(A) DFA always outperform BP.

Thanks to a better offset or improvements in scaling, the

increased compute-efﬁciency from DFA makes it always favorable to BP.

(B) DFA outperforms BP at scale.

Thanks to improvements in scaling (e.g., increased efﬁ-

ciency compared to BP at scale), DFA eventually makes the training of large models more

efﬁcient.

DFA may exhibit poor scaling behavior [

], and may not scale

to larger models, leading to BP eventually outperforming DFA.

(D) BP always outperform DFA.

The degradation in performance observed with DFA may

never be worth the potential gains in compute-efﬁciency.

Both (A) and (B) may be viewed as validating our hypothesis, as they both potentially motivate the

use of DFA over BP. (C) and (D) would however be negative outcomes, either restraining the efﬁcient

applicability of DFA to small models, or indicating that DFA is never competitive with BP.

3 Methods

Direct Feedback Alignment.

Direct Feedback Alignment (DFA) [

] is an extension of Feedback

Alignment (FA) [5] which uses a random projection of the global error to directly train each layer.

We introduce at layer

the weights,

the pre-activations,

the non-linearity and its derivative

the activations,

δx

the derivative of the loss against

, and



the Hadamard product. DFA

replaces the backpropagated signal from the

(i+ 1)

-th layer

i+1δai+1

by a random projection of

the global error

. For most common losses, this error

is simply the difference between targets

and predictions. Accordingly, the update at layer

is now

δWi=−[(Be)f0(ai)]hT

i−1

, the

ﬁxed random Gaussian matrix, can be shared across all layers [

], reducing memory and compute

costs signiﬁcantly–as a single

is now calculated and used for all layers . With DFA, the update

now does not depend on the backward of other layers; thus, once the forward pass is complete, all of

the layers can be updated concurrently, achieving so-called backward-unlocking [7].

Learning with DFA is made possible through a process called alignment. During the training, the

forward weights will eventually align with the ﬁxed backward weights, enabling updates which

approximate backpropagation [

]. This is best illustrated in the simpler case of FA [

]. For FA,

the learning signal still comes from the (i+1)-th layer:

Bi+1δai+1

. For this to approximate BP, we

only need

i+1 ∼Bi+1

. Altough we don’t report on it in this work, this is a valuable diagnostic

tool when experimenting: at any step, it is possible to measure the angle (cosine similarity) between

the gradients predicated by backpropagation and the ones approximated by DFA. Higher alignment

values are usually correlated with networks which achieve better end-task performance [12, 37].

Scaling laws for compute-efﬁciency.

We are interested in scaling according to compute budget,

L(C)

. We split

C=CF+CB+CU

, for computing the forward pass, backpropagation of the error,

and weight updates respectively. For causal decoder-only models, each phase costs roughly

2ND

(in

FLOP) with

the number of model parameters and

the dataset size in tokens [

]–the factor 2

coming from the multiply-accumulate. Hence,

CBP = 6ND

. When using DFA, the backpropagation

of the error is not necessary, and instead a single random projection

is shared across all layers.

Accordingly,

CDFA

B= 2d2

modelD

, as

is of shape

(dmodel, dmodel)

. Because

N'12nlayerd2

model

CDFA

BCBP

B. We will neglect it and consider CDFA = 4ND, a ∼30% improvement.

Finally, note that this only takes into account improvements in the FLOP compute budget. However,

practitioners are usually constrained by the actual compute budget in dollars, which is best represented

by the number of GPU-hours required.

CFLOP

and

CGPU-hours

can be linked through the throughput

achieved per GPU, in TFLOPS. Alternative training methods like DFA may improve this throughput,

by enabling increased parallelization and reducing communication bottlenecks. Nevertheless, state-of-

the-art methods already achieve hardware utilization of

∼50%

at scale [

]: at best, a 2x improvement

can be expected. We thus also introduce

CDFA = 2N D

, a (very) optimistic estimation which supposes

DFA would enable a doubling in effective throughput. In practice, current implementations of DFA

are not optimized, and it is unrealistic for DFA to be able to lift all bottlenecks currently encountered

in distributed training–we use this estimate as an absolute lower bound of what is possible.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ScalingLawsBeyondBackpropagationMatthewJ.Filipovich1;2AlessandroCappelli1DanielHesslow1JulienLaunay1;31LightOn2Queen'sUniversity3LPENS,ÉcoleNormaleSupérieure{firstname}@lighton.aiAbstractAlternativestobackpropagationhavelongbeenstudiedtobetterunderstandhowbiologicalbrainsmaylearn.Recently,theyhaveal...

展开>> 收起<<

Scaling Laws Beyond Backpropagation Matthew J. Filipovich12Alessandro Cappelli1Daniel Hesslow1Julien Launay13 1LightOn2Queens University3LPENS École Normale Supérieure.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Scaling Laws Beyond Backpropagation Matthew J. Filipovich12Alessandro Cappelli1Daniel Hesslow1Julien Launay13 1LightOn2Queens University3LPENS École Normale Supérieure

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: