Potential outcomes. We identify four potential outcomes for our study, illustrated in Figure 1:
(A) DFA always outperform BP.
Thanks to a better offset or improvements in scaling, the
increased compute-efficiency from DFA makes it always favorable to BP.
(B) DFA outperforms BP at scale.
Thanks to improvements in scaling (e.g., increased effi-
ciency compared to BP at scale), DFA eventually makes the training of large models more
efficient.
(C) BP outperforms at scale.
DFA may exhibit poor scaling behavior [
11
], and may not scale
to larger models, leading to BP eventually outperforming DFA.
(D) BP always outperform DFA.
The degradation in performance observed with DFA may
never be worth the potential gains in compute-efficiency.
Both (A) and (B) may be viewed as validating our hypothesis, as they both potentially motivate the
use of DFA over BP. (C) and (D) would however be negative outcomes, either restraining the efficient
applicability of DFA to small models, or indicating that DFA is never competitive with BP.
3 Methods
Direct Feedback Alignment.
Direct Feedback Alignment (DFA) [
25
] is an extension of Feedback
Alignment (FA) [5] which uses a random projection of the global error to directly train each layer.
We introduce at layer
i
:
Wi
the weights,
ai
the pre-activations,
f
the non-linearity and its derivative
f0
,
hi
the activations,
δx
the derivative of the loss against
x
, and
the Hadamard product. DFA
replaces the backpropagated signal from the
(i+ 1)
-th layer
WT
i+1δai+1
by a random projection of
the global error
Be
. For most common losses, this error
e
is simply the difference between targets
and predictions. Accordingly, the update at layer
i
is now
δWi=−[(Be)f0(ai)]hT
i−1
.
B
, the
fixed random Gaussian matrix, can be shared across all layers [
37
], reducing memory and compute
costs significantly–as a single
Be
is now calculated and used for all layers . With DFA, the update
now does not depend on the backward of other layers; thus, once the forward pass is complete, all of
the layers can be updated concurrently, achieving so-called backward-unlocking [7].
Learning with DFA is made possible through a process called alignment. During the training, the
forward weights will eventually align with the fixed backward weights, enabling updates which
approximate backpropagation [
28
]. This is best illustrated in the simpler case of FA [
5
]. For FA,
the learning signal still comes from the (i+1)-th layer:
Bi+1δai+1
. For this to approximate BP, we
only need
WT
i+1 ∼Bi+1
. Altough we don’t report on it in this work, this is a valuable diagnostic
tool when experimenting: at any step, it is possible to measure the angle (cosine similarity) between
the gradients predicated by backpropagation and the ones approximated by DFA. Higher alignment
values are usually correlated with networks which achieve better end-task performance [12, 37].
Scaling laws for compute-efficiency.
We are interested in scaling according to compute budget,
L(C)
. We split
C=CF+CB+CU
, for computing the forward pass, backpropagation of the error,
and weight updates respectively. For causal decoder-only models, each phase costs roughly
2ND
(in
FLOP) with
N
the number of model parameters and
D
the dataset size in tokens [
19
]–the factor 2
coming from the multiply-accumulate. Hence,
CBP = 6ND
. When using DFA, the backpropagation
of the error is not necessary, and instead a single random projection
Be
is shared across all layers.
Accordingly,
CDFA
B= 2d2
modelD
, as
B
is of shape
(dmodel, dmodel)
. Because
N'12nlayerd2
model
,
CDFA
BCBP
B. We will neglect it and consider CDFA = 4ND, a ∼30% improvement.
Finally, note that this only takes into account improvements in the FLOP compute budget. However,
practitioners are usually constrained by the actual compute budget in dollars, which is best represented
by the number of GPU-hours required.
CFLOP
and
CGPU-hours
can be linked through the throughput
T
achieved per GPU, in TFLOPS. Alternative training methods like DFA may improve this throughput,
by enabling increased parallelization and reducing communication bottlenecks. Nevertheless, state-of-
the-art methods already achieve hardware utilization of
∼50%
at scale [
38
]: at best, a 2x improvement
can be expected. We thus also introduce
˜
CDFA = 2N D
, a (very) optimistic estimation which supposes
DFA would enable a doubling in effective throughput. In practice, current implementations of DFA
are not optimized, and it is unrealistic for DFA to be able to lift all bottlenecks currently encountered
in distributed training–we use this estimate as an absolute lower bound of what is possible.
3