Scaling Laws Beyond Backpropagation Matthew J. Filipovich12Alessandro Cappelli1Daniel Hesslow1Julien Launay13 1LightOn2Queens University3LPENS École Normale Supérieure

2025-05-03 0 0 986.51KB 11 页 10玖币
侵权投诉
Scaling Laws Beyond Backpropagation
Matthew J. Filipovich1,2Alessandro Cappelli1Daniel Hesslow1Julien Launay1,3
1LightOn 2Queen’s University 3LPENS, École Normale Supérieure
{firstname}@lighton.ai
Abstract
Alternatives to backpropagation have long been studied to better understand how
biological brains may learn. Recently, they have also garnered interest as a way
to train neural networks more efficiently. By relaxing constraints inherent to
backpropagation (e.g., symmetric feedforward and feedback weights, sequential
updates), these methods enable promising prospects, such as local learning. How-
ever, the tradeoffs between different methods in terms of final task performance,
convergence speed, and ultimately compute and data requirements are rarely out-
lined. In this work, we use scaling laws to study the ability of Direct Feedback
Alignment (DFA) to train causal decoder-only Transformers efficiently. Scaling
laws provide an overview of the tradeoffs implied by a modeling decision, up to
extrapolating how it might transfer to increasingly large models. We find that DFA
fails to offer more efficient scaling than backpropagation: there is never a regime
for which the degradation in loss incurred by using DFA is worth the potential
reduction in compute budget. Our finding comes at variance with previous beliefs
in the alternative training methods community, and highlights the need for holistic
empirical approaches to better understand modeling decisions.
1 Introduction
Backpropagation (BP) [
1
,
2
] is just one of many ways to solve the credit assignment problem–and
hence to train neural networks. BP estimates the individual contribution of each parameter to the error,
but approximate methods can be employed: either through fundamentally different approaches (e.g.
Hebbian learning) [
3
,
4
], or by relaxing constraints of BP [
5
7
]. Beyond backpropagation methods
have a history of being studied to understand how biological brains may learn [
8
,
9
]. Indeed, key
features of backpropagation are not possible to implement under biological constraints. For instance,
using the transpose of the weights
WT
in the feedback path is not possible, as the feedforward and
feedback pathways are physically distinct–this is known as the weight transport problem [10].
Once relegated to toy problems [
11
], alternatives to BP have now been demonstrated to be able to
achieve competitive performance on challenging tasks across a variety of architectures [
12
]. This
could result in more efficient training: for instance, local learning may enable easier parallelization of
computations at scale [
7
,
13
]. Alternative methods may even be co-designed with hardware [
14
,
15
]:
either for novel systems such as photonic co-processors [
16
] and memristors [
17
], or to circumvent
distributed communication bottlenecks in the large clusters used to train state-of-the-art models [
18
].
However, the tradeoffs between BP and alternatives are not always clear. If a method enables 25%
faster training, is it worth a 5% decrease in end-task performance? Or would a model trained with
BP using a 25% smaller compute budget still be better? As most works usually only offer a few
cherry-picked datapoints, this is a difficult question to answer. Instead, deriving scaling laws [
19
]
may provide a more complete picture: by obtaining the full power-law relationship between compute
spent and task performance, one may easily identify regimes in which the alternative method is
competitive with BP, and even extrapolate results to larger scales.
I Can’t Believe It’s Not Better Workshop, NeurIPS 2022
arXiv:2210.14593v1 [cs.LG] 26 Oct 2022
Contributions.
We use scaling laws to study an alternative to BP, with the following contributions:
Drawing inspiration from work using scaling laws to evaluate modeling decisions [
20
22
],
we are the first to use scaling laws to study an alternative to backpropagation.
At variance with previous beliefs [
12
],
we find that the gains in compute efficiency from
using the alternative method studied are never worth the degradation in performance
.
This holds even if we consider the use of exotic hardware, such as optical co-processors [
23
],
which would offload some computations and effectively make the alternative method "free".
2 Framing
Can alternative training methods accelerate neural network training?
Surveying the current
state-of-the-art, one may find numerous claims of alternative training methods achieving competitive
performance with BP across a variety of settings and tasks (e.g., [12, 18, 24]).
We seek to study this claim, with three restrictions in scope:
1.
We focus on Direct Feedback Alignment [
25
], due its simplicity and wide applicability [
12
],
as well as its broad hardware prospects [14, 26, 27], and theoretical background [28].
2.
We study compute-efficiency specifically (i.e, best performance achievable for a given
compute budget), as this usually a significant bottleneck for scaling-up models.
3.
We conduct our study on "GPT-like" [
29
] causal decoder-only Transformers trained on
English data. These models are known to possess smooth scaling laws [
19
,
30
]. Because
of their unique abilities [
31
], they also command some of the largest training budgets in
machine learning [32], making them a prime target for more compute-efficient training.
These restrictions lead us to test the following hypothesis:
Hypothesis.
Direct Feedback Alignment can train causal decoder-only models more efficiently
than backpropagation, achieving better performance for a given compute budget.
Scaling laws as a holistic empirical tool.
Scaling laws have been proposed as an empirical ap-
proach to connect hyperparameters of neural networks (e.g., parameter count, training dataset size) to
their performance. They have been derived both on specific downstream tasks [
33
,
34
] and on up-
stream modeling loss [
19
]. Scaling laws can characterize the influence of data & modeling decisions
[21, 22], or even unveil new, more optimal training practices [35, 36].
As illustrated in Figure 1, it is possible to derive a so-called compute optimal frontier for a class
of models: this defines
L(C)
, the best performance
L
achievable for a compute budget
C
. We fit
a power-law
L(C)=(CcC)αC
over the Pareto front of multiple runs, as proposed in [
19
].
Cc
is a
constant offsetting the frontier, while
αC
controls the slope. Improvements in
αC
are rare [
20
,
22
],
but valuable as they would point to modeling decisions leading to increased gains at scale.
Figure 1:
Scaling laws provide optimal compute frontiers
. Left: compute-optimal frontier. Right:
scenarios for DFA (red) & BP (green) scaling laws, shading is best method at a compute budget.
2
Potential outcomes. We identify four potential outcomes for our study, illustrated in Figure 1:
(A) DFA always outperform BP.
Thanks to a better offset or improvements in scaling, the
increased compute-efficiency from DFA makes it always favorable to BP.
(B) DFA outperforms BP at scale.
Thanks to improvements in scaling (e.g., increased effi-
ciency compared to BP at scale), DFA eventually makes the training of large models more
efficient.
(C) BP outperforms at scale.
DFA may exhibit poor scaling behavior [
11
], and may not scale
to larger models, leading to BP eventually outperforming DFA.
(D) BP always outperform DFA.
The degradation in performance observed with DFA may
never be worth the potential gains in compute-efficiency.
Both (A) and (B) may be viewed as validating our hypothesis, as they both potentially motivate the
use of DFA over BP. (C) and (D) would however be negative outcomes, either restraining the efficient
applicability of DFA to small models, or indicating that DFA is never competitive with BP.
3 Methods
Direct Feedback Alignment.
Direct Feedback Alignment (DFA) [
25
] is an extension of Feedback
Alignment (FA) [5] which uses a random projection of the global error to directly train each layer.
We introduce at layer
i
:
Wi
the weights,
ai
the pre-activations,
f
the non-linearity and its derivative
f0
,
hi
the activations,
δx
the derivative of the loss against
x
, and
the Hadamard product. DFA
replaces the backpropagated signal from the
(i+ 1)
-th layer
WT
i+1δai+1
by a random projection of
the global error
Be
. For most common losses, this error
e
is simply the difference between targets
and predictions. Accordingly, the update at layer
i
is now
δWi=[(Be)f0(ai)]hT
i1
.
B
, the
fixed random Gaussian matrix, can be shared across all layers [
37
], reducing memory and compute
costs significantly–as a single
Be
is now calculated and used for all layers . With DFA, the update
now does not depend on the backward of other layers; thus, once the forward pass is complete, all of
the layers can be updated concurrently, achieving so-called backward-unlocking [7].
Learning with DFA is made possible through a process called alignment. During the training, the
forward weights will eventually align with the fixed backward weights, enabling updates which
approximate backpropagation [
28
]. This is best illustrated in the simpler case of FA [
5
]. For FA,
the learning signal still comes from the (i+1)-th layer:
Bi+1δai+1
. For this to approximate BP, we
only need
WT
i+1 Bi+1
. Altough we don’t report on it in this work, this is a valuable diagnostic
tool when experimenting: at any step, it is possible to measure the angle (cosine similarity) between
the gradients predicated by backpropagation and the ones approximated by DFA. Higher alignment
values are usually correlated with networks which achieve better end-task performance [12, 37].
Scaling laws for compute-efficiency.
We are interested in scaling according to compute budget,
L(C)
. We split
C=CF+CB+CU
, for computing the forward pass, backpropagation of the error,
and weight updates respectively. For causal decoder-only models, each phase costs roughly
2ND
(in
FLOP) with
N
the number of model parameters and
D
the dataset size in tokens [
19
]–the factor 2
coming from the multiply-accumulate. Hence,
CBP = 6ND
. When using DFA, the backpropagation
of the error is not necessary, and instead a single random projection
Be
is shared across all layers.
Accordingly,
CDFA
B= 2d2
modelD
, as
B
is of shape
(dmodel, dmodel)
. Because
N'12nlayerd2
model
,
CDFA
BCBP
B. We will neglect it and consider CDFA = 4ND, a 30% improvement.
Finally, note that this only takes into account improvements in the FLOP compute budget. However,
practitioners are usually constrained by the actual compute budget in dollars, which is best represented
by the number of GPU-hours required.
CFLOP
and
CGPU-hours
can be linked through the throughput
T
achieved per GPU, in TFLOPS. Alternative training methods like DFA may improve this throughput,
by enabling increased parallelization and reducing communication bottlenecks. Nevertheless, state-of-
the-art methods already achieve hardware utilization of
50%
at scale [
38
]: at best, a 2x improvement
can be expected. We thus also introduce
˜
CDFA = 2N D
, a (very) optimistic estimation which supposes
DFA would enable a doubling in effective throughput. In practice, current implementations of DFA
are not optimized, and it is unrealistic for DFA to be able to lift all bottlenecks currently encountered
in distributed training–we use this estimate as an absolute lower bound of what is possible.
3
摘要:

ScalingLawsBeyondBackpropagationMatthewJ.Filipovich1;2AlessandroCappelli1DanielHesslow1JulienLaunay1;31LightOn2Queen'sUniversity3LPENS,ÉcoleNormaleSupérieure{firstname}@lighton.aiAbstractAlternativestobackpropagationhavelongbeenstudiedtobetterunderstandhowbiologicalbrainsmaylearn.Recently,theyhaveal...

展开>> 收起<<
Scaling Laws Beyond Backpropagation Matthew J. Filipovich12Alessandro Cappelli1Daniel Hesslow1Julien Launay13 1LightOn2Queens University3LPENS École Normale Supérieure.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:986.51KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注