•
We find that the inconsistency between training and inference leads to the failure of BN
in NLP, supported by our extensive experiments, including image classification, neural
machine translation, language modeling, sequence labeling, and text classification tasks.
•
We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency
and show that TID can serve as an indicator of BN’s performance. In particular, BN reaches
much better test performance than LN when TID keeps small through training, e.g., in image
recognition and language modeling tasks.
•
We propose Regularized BN (RBN) that adds a regularization term in BN to penalize and
reduce the TID when the TID of BN is large. We reveal the optimization advantages of RBN
over LN by exploring the layer-wise training dynamics of Transformer.
•
We empirically show that RBN can exceed or match the performance of LN, sometimes with
a large margin, on 17 out of 20 settings, involving ten datasets and two common variants of
Transformer. Besides, RBN introduces no extra computation at inference compared to LN.
2 Related Work
Analyses of BN’s Success
As BN becomes an indispensable component in deep neural networks
deployed in CV tasks, a bunch of works explore the theoretical reasons behind its success. From
the view of optimization, the original BN paper [
17
] argues that BN can reduce internal covariate
shift and thus stabilize the training, while Santurkar et al.
[36]
debate that BN could smooth the loss
landscape and thus enable training of neural network with larger learning rate [
4
]. Daneshmand
et al.
[8
,
9]
prove that a stack of randomized linear layers and BN layers will endow the intermediate
features of neural network with sufficient numerical rank as depth increases, which is beneficial for
optimization and learning discriminative hierarchical features. Huang et al.
[13]
show that BN could
improve the layer-wise conditioning of the neural network optimization by exploring the spectrum of
Hessian matrix with block diagonal approximation [
28
]. From the view of generalization, Ioffe and
Szegedy
[17]
, Luo et al.
[25]
, Li et al.
[22]
, Wu and Johnson
[43]
argue that BN serves as regularizer
which reduces over-fitting when its stochasticity is small and may have detrimental effect when it is
large [
43
]. Huang et al.
[12]
further propose Stochastic Normalization Disturbance (SND) to measure
such stochasticity and shows that large SND will hinder the training of neural networks.
Training Inference Inconsistency of BN
Normalizing along the batch dimension usually intro-
duces training inference inconsistency since mini-batch data is neither necessary nor desirable during
inference. BN uses population statistics, estimated by running average over mini-batch statistics,
for inference. The training inference inconsistency usually harms the performance of BN for small-
batch-size training since the estimation of population statistics could be inaccurate [
42
]. One way
to reduce the inconsistency between training and inference is to exploit the estimated population
statistics for normalization during training [
16
,
6
,
47
,
50
,
49
]. These works may outperform BN
when the batch size is small, where inaccurate estimation may be the main issue [
17
,
18
], but they
usually work inferior to BN under moderate batch-size training [
24
]. Another way to reduce the
inconsistency is estimating corrected normalization statistics during inference only, either for domain
adaptation [
23
], corruption robustness [
37
,
31
,
2
], or small-batch-size training [
39
,
40
]. We note that
a recent work [
14
] investigates the estimation shift problem of BN. Unlike this work that addresses
the accumulated estimation shift due to the stack of BNs for CNNs in CV tasks, our work pays more
attention to how the training inference inconsistency of BN correlates with its performances for
Transformers in NLP tasks. Besides, the estimation shift of BN defined in [
14
], which addresses the
differences between the estimated population statistics and the expected statistics, differs from our
TID of BN that addresses the differences between the mini-batch statistics and populations statistics.
Exploring the Failure of BN in Transformer
Similar to our work, Power Normalization (PN) [
38
]
also investigates the reason behind the failure of BN in Transformers. Our work significantly differs
from PN [
38
] in the following facets. PN attributes the failure of BN to the unstable training of BN
incurred by fluctuated forward and backward batch statistics with outlier values, while we observe
that the training of BN is as good as LN and the inconsistency between training and inference of BN
matters more. Based on our observation, we propose a regularization term to reduce the TID of BN.
Compared with PN, which incorporates a layer-scale layer (root mean square layer normalization [
51
]
without affine transformation [
45
]), our method introduces no extra computation at inference. Besides,
we use a more reasonable index to measure inconsistency which is invariant to the scale of data.
Furthermore, we show that our RBN can improve the layer-wise training dynamics of LN, which
reveals the optimization advantages of RBN.
2