Foundation Transformers Hongyu Wang Shuming Ma Shaohan Huang Li Dong Wenhui Wang Zhiliang Peng Yu Wu Payal Bajaj Saksham Singhal Alon Benhaim Barun Patra Zhun Liu Vishrav Chaudhary

2025-04-26 0 0 722.58KB 22 页 10玖币
侵权投诉
Foundation Transformers
Hongyu Wang
, Shuming Ma
, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu
Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary
Xia Song, Furu Wei
Microsoft
https://github.com/microsoft/unilm
Abstract
A big convergence of model architectures across language, vision, speech, and
multimodal is emerging. However, under the same name “Transformers”, the above
areas use different implementations for better performance, e.g., Post-LayerNorm
for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for
the development of
Foundation Transformer
for true general-purpose model-
ing, which serves as a go-to architecture for various tasks and modalities with
guaranteed training stability. In this work, we introduce a Transformer variant,
named
MAGNETO
, to fulfill the goal. Specifically, we propose Sub-LayerNorm for
good expressivity, and the initialization strategy theoretically derived from Deep-
Net (Wang et al.,2022a) for stable scaling up. Extensive experiments demonstrate
its superior performance and better stability than the de facto Transformer variants
designed for various applications, including language modeling (i.e., BERT, and
GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and
multimodal pretraining (i.e., BEiT-3).
Models Previous This work
Vision Encoder ViT/BEiT Pre-LN
Sub-LN
Language
Encoder BERT Post-LN
Decoder GPT Pre-LN
Encoder-Decoder NMT/BART Post-LN
Speech Encoder T-T Pre-LN
Multimodal Encoder BEiT-3 Pre-LN
Linear
𝒙
LN
Linear
Attention
Linear
LN
Linear
ReLU
(a) Post-LN
Linear
𝒙
LN
Linear
Attention
Linear
Linear
ReLU
LN
(b) Pre-LN
Linear
LN
Linear
Attention
LN
Linear
LN
Linear
ReLU
LN
𝑾~𝑵(𝟎, 𝜸)
Initialization:
𝒙
(c) Sub-LN
Figure 1:
Top
: the architectures of SOTA models across language, vision, speech, and multimodal.
Bottom
: the proposed Foundation Transformer uses Sub-LN and theoretically derived initialization.
Equal contribution. Corresponding author.
arXiv:2210.06423v2 [cs.LG] 19 Oct 2022
1 Introduction
Recent years have witnessed a big convergence of model architectures across language, vision, speech,
and multimodal. Specifically, starting from the natural language processing, Transformers (Vaswani
et al.,2017) have become the de facto standard for various areas, including computer vision (Doso-
vitskiy et al.,2021), speech (Zhang et al.,2020b), and multimodal (Kim et al.,2021;Wang et al.,
2022b). Transformers fully leverage the parallelism advantage of GPU hardware and large-scale
data. It is appealing that we can use the same network architecture for a broad range of applications.
So the pretrained models can be seamlessly reused with the shared implementation and hardware
optimization. Moreover, general-purpose modeling is important to multimodal models, as different
modalities can be jointly encoded and fused by one model.
However, despite using the same name “Transformers”, there are significant differences in the
implementation of the architectures for different tasks. Figure 1summarizes the architectures for
state-of-the-art models that are widely used in various communities. For instance, some models (e.g.,
GPT, and ViT) adopt Pre-LayerNorm (Pre-LN) Transformers, while others use Post-LayerNorm
(Post-LN) variants (e.g., BERT, and machine translation) for better performance. Rather than directly
using the same architecture, we need to compare two Transformer variants on the specific tasks or
modalities to determine the backbone, which is ineffective for model development. More importantly,
considering multimodal models, the optimal Transformer variants are usually different for input
modalities. For the example of BEiT-3 (Wang et al.,2022b) vision-language pretraining, using
Post-LN is sub-optimal for vision encoding while Pre-LN is sub-optimal for the language part. The
true convergence of multimodal pretraining requires a unified architecture that performs well across
tasks and modalities. In addition, a pain point of Transformer architectures is training stability,
especially for large-scale models. We usually need significant efforts to tune hyperparameters or
babysit training processes.
As a result, we call for developing
Foundation Transformers
for true general-purpose modeling.
First, the desired modeling should be able to serve as a go-to architecture for various tasks and
modalities, so that we can use the same backbone without trial and error. The general-purpose
design principle also greatly supports the development of multimodal foundation models, as we can
use one unified Transformer for various modalities without performance degradation. Second, the
architectures should provide guaranteed training stability. The favored property can significantly
mitigate the difficulty of large-scale pretraining of foundation models.
In this work, we introduce MAGNETO as an implementation of Foundation Transformers to fulfill the
above goals. Specifically, we introduce Sub-LayerNorm (Sub-LN), which adds an extra LayerNorm
to each sublayer (i.e., multi-head self-attention, and feed-forward network). Moreover, MAGNETO has
a novel initialization method that has a theoretical guarantee to fundamentally improve the training
stability. This allows the models to be scaled up without pain. We evaluate MAGNETO on extensive
tasks and modalities, namely, masked language modeling (i.e., BERT), causal language modeling
(i.e., GPT), machine translation, masked image modeling (i.e., BEiT), speech recognition, and
vision-language pretraining (i.e., BEiT-3). Experimental results show that MAGNETO significantly
outperforms de facto Transformer variants on the downstream tasks. In addition, MAGNETO is more
stable in terms of optimization, which allows larger learning rates to improve results without training
divergence.
2 TL;DR for Practitioners
Figure 1illustrates the overview of the MAGNETO architecture. There are two key improvements in
terms of modeling. First, compared to the Pre-LN variant, Sub-LN introduces another LayerNorm
inside each sublayer (i.e., multi-head self-attention, and feed-forward network): one before the input
projection, and the other before the output projection. Second, we use the initialization with the
theoretical derivation from DeepNet (Wang et al.,2022a), which fundamentally improves the training
stability, allowing the model to be scaled up to massive sizes without pain.
As shown in Figure 2, we present the implementation of MAGNETO. There are only lines of code
changes on top of the vanilla Transformer architecture. Notably, following the derivation from
DeepNet, the weights of query projection and key projection are not scaled during initialization.
2
def subln(x):
return x + fout(LN(fin(LN(x))))
def subln_init(w):
if wis ['ffn','v_proj','out_proj']:
nn.init.xavier_normal_(w, gain=γ)
elif wis ['q_proj','k_proj']:
nn.init.xavier_normal_(w, gain=1)
Architectures Encoder Decoder
γ γ
Encoder-only log 2N-
(e.g., BERT, ViT)
Decoder-only -log 2M
(e.g., GPT)
Encoder-decoder q1
3log 3Mlog 2Nlog 3M
(e.g., NMT, BART)
Linear
LN
Linear
Attention
LN
Linear
LN
Linear
ReLU
LN
𝒙
× 𝑵
𝑾~𝑵(𝟎, 𝜸)
Initialization:
(a) Encoder or Decoder
Linear
LN
Linear
Attention
LN
Linear
LN
Linear
ReLU
LN
𝒙
Linear
LN
Linear
Attention
LN
Linear
LN
Linear
ReLU
LN
𝒚
Linear
LN
Linear
Attention
× 𝑵 × 𝑴
𝑾~𝑵(𝟎, 𝜸)
Initialization:
(b) Encoder-Decoder
Figure 2:
Top left
: pseudocode of Sub-LN. We take Xavier initialization (Glorot and Bengio,2010)
as an example, and it can be replaced with other standard initialization. Notice that
γ
is a constant.
Top right
: parameters of Sub-LN for different architectures (
N
-layer encoder,
M
-layer decoder).
Bottom: the layout of Sub-LN for different architectures.
Besides, there is only one LayerNorm inside the cross-attention for the encoder-decoder architecture
and we do not scale the initialized weights of cross-attention.
3 MAGNETO: A Foundation Transformer
3.1 Architecture: Sub-LayerNorm
Vanilla Transformers are based on either Pre-LayerNorm (Pre-LN) structures or Post-LayerNorm
(Post-LN). Different from them, MAGNETO is built on the Sub-LayerNorm (Sub-LN). It inherits
the multihead attentions and the feed-forward network from Transformers and introduces two layer
normalization modules inside each sublayer (except the cross-attention).
For the multihead attentions, the layer normalization modules are before the
qkv
projection and the
output projection, which can be formulated as:
Q, K, V =WQLN(x), W KLN(x), W VLN(x)(1)
MSA(x) = x+WOLN(Attention(Q, K, V )) (2)
where
WQ
,
WK
,
WV
, and
WO
are the parameters of the multihead self-attention. Similarly, for the
feed-forward network, the layer normalization modules are before the input projection and the output
projection, which are written as:
FC1(x) = W1LN(x)(3)
FC2(x) = W2LN(x)(4)
FFN(x) = FC2(φ(FC1(x))) (5)
3
where
W1
and
W2
are parameters of the feed-forward layers, and
φ
is the non-linear activation
function.
3.2 Initialization: Theoretical Derivation from DeepNet
We adopt the theoretical derivation from DeepNet (Wang et al.,2022a) to improve the training
stability. DeepNet estimates the expected model update for Post-LN and introduces DeepNorm to
bound the model update to a constant. Following DeepNet, we first estimate the expected model
update of Sub-LN and then demonstrate how to bound the model update with a proper initialization.
Expected Model Update for Pre-LN
We start with the expected model update for Pre-LN. The
forward propagation for an
N
-layer Pre-LN Transformer with
N
attention sub-layers and
N
feed-
forward sub-layers can be formulated as:
F(x;θ) = Wvocabxe(6)
xe=LN(x+
L
X
l=1
Gl(xl1, θel)), xl=Gl(xl1, θel)and x0=x(7)
where
xl1
,
xl
denotes the input and output for the
l
-th sub-layer
Gl
. If
l
is odd,
Gl
refers to
self-attention MSA; if
l
is even,
Gl
refers to FFN.
xe
is the output of the backbone.
θ
denotes the
parameters of output projection
Wvocab
and the backbone
{θel}L
l=1
.
Wvocab RV×d
, where
d
is
hidden dimension,
V
is dictionary size.
L
equals to
2N
for simplicity. Without the loss of generality,
we set the intermediate dimension of feed-forward layers equals to hidden dimension.
Following Wang et al. (2022a), the magnitude of attention output only depends on value and out-
put projection:
MSA(X)Θ
=WOWVLN(X)
. Similarly we have
FFN(x) = W2φ(W1LN(X))
.
Therefore, for vanilla Pre-LN, the forward computation of the l-th sub-layer can be formulated as:
xl=xl1+Wl,2φ(Wl,1LN(xl1)) (8)
We introduce two constants
vl, wl
to represent the scales of
Wl,2, W l,1
respectively. For example,
the i-th row, j-th column entry of Wl,2satisfies that:
Wl,2
ij vN(0,v2
l
d)(9)
We define the model update
F=||γT(F(x;θ)F(x;θ))||
, where
γ, F (x)RV×1
.
x
and
F(x)
denote the input and output of the model respectively.
γ
is the label of
x
, which is a one-hot vector
with a single entry as 1 and all the others as 0. With above analysis, we have the following theorem
to characterize Fpre for an N-layer, encoder-only Pre-LN Transformer under SGD update.
Theorem 3.1.
Given an
N
-layer Pre-LN Transformer
F(x, θ)
, the
l
-th sub-layer is formulated as
xl=xl1+Wl,2φ(Wl,1LN(xl1)). Under SGD update, Fpre satisfies:
Fpre ηd(PL
l=1 v2
l+w2
l
PL
n=1 v2
nw2
n
+
L
X
l=1
L
X
k=2
v2
l+w2
l
PL
n=1 v2
nw2
n
v2
kw2
k
Pk1
n=1 v2
nw2
n
)) (10)
where ηis learning rate, Lequals to 2N.
Based on Theorem 3.1, with
vl=wl= 1
(i.e., standard initialization) for vanilla Pre-LN, we have
Fpre =O(ηd log L)
, which shows that the magnitude of the model update grows logarithmically
as the depth increases. It is also verified by Liu et al. (2020). Wang et al. (2022a) proves that under
SGD update, the model update of vanilla Post-LN
Fpost
is
O(PL
l=1 v2
l+w2
l)
.
Fpre
is much
smaller than
Fpost
with the same model depth
L
. It indicates that the loss landscape of vanilla
Pre-LN is smoother than that of vanilla Post-LN, which leads to faster and more stable optimization.
4
Expected Model Update for MAGNETO
Based on the analysis on Pre-LN, we further estimate
the expected model update of Sub-LN. With Sub-LN, the forward signal propagation of the
l
-th
sub-layer can be formulated as:
xl=xl1+Wl,2LN(φ(Wl,1LN(xl1))) (11)
We then give the expected bound of the model update’s magnitude
Fsub
for an
N
-layer, encoder-
only MAGNETO.
Theorem 3.2.
Given an
N
-layer MAGNETO
F(x, θ)
, the
l
-th sub-layer is formulated as
xl=
xl1+Wl,2LN(φ(Wl,1LN(xl1))). Under SGD update, Fsub satisfies:
Fsub ηd(PL
l=1(1 + v2
l
w2
l
)
PL
n=1 v2
n
+
L
X
l=1
L
X
k=2
1 + v2
l
w2
l
PL
n=1 v2
n
v2
k
Pk1
n=1 v2
n
)(12)
where ηis learning rate, Lequals to 2N.
When the activation of the
l
-th sub-layer explodes, it leads to
wlwi, i 6=l
. Equation (13) proves
that the model update of MAGNETO is smaller than that of vanilla Pre-LN in this case.
1 + v2
l
w2
l
PL
n=1 v2
n
=v2
l+w2
l
w2
lPL
n=1 v2
nv2
l+w2
l
PL
n=1 v2
nw2
n
, wlwi, i 6=l(13)
Furthermore, we study the magnitude of model update for MAGNETO with the encoder-decoder
architecture.
θe
follows the same definition as in Theorem 3.2. Similarly
θd
denotes parameters of
decoder. Theorem 3.3 shows that the bound of the magnitude of model update under SGD update
Fed =||γT(Fed(x, y, θ
e, θ
d)Fed(x, y, θe, θd))||
, where
x
and
y
denote the input of encoder and
decoder respectively.
Theorem 3.3.
Given an encoder-decoder MAGNETO
Fed(x, y, θe, θd)
with N encoder layers and M
decoder layers, where the
l
-th sub-layer is formulated as
xl=xl1+Wl,2LN(φ(Wl,1LN(xl1)))
.
Under SGD update, Fed satisfies:
Fed Fd+
Ld
X
l=1,l%3=1
v2
dl
PLd
n=1 v2
dn
(1 +
Ld
X
k=2
v2
dk
Pk1
n=1 v2
dn
)∆Fe(14)
Fd
Θ
=ηd(PLd
l=1(1 + v2
dl
w2
dl
)
PLd
n=1 v2
dn
+1
PLd
n=1 v2
dn
Ld
X
l=1
Ld
X
k=2
(1 + v2
dl
w2
dl
)v2
dk
Pk1
n=1 v2
dn
)(15)
Fe
Θ
=ηd(PLe
l=1(1 + v2
el
w2
el
)
PLe
n=1 v2
en
+1
PLe
n=1 v2
en
Le
X
l=1
Le
X
k=2
(1 + v2
el
w2
el
)v2
ek
Pk1
n=1 v2
en
)(16)
where ηis learning rate, Ldequals to 3Mand Leequals to 2N.
Derivation and Implementation
We then demonstrate that the expected model update of MAG-
NETO above can be bounded with proper initialization. We provide the analysis on the encoder-only
architecture, which can be naturally extended to encoder-decoder models in the same way. Analogous
to Zhang et al. (2019) and Wang et al. (2022a), we set our goal for the model update as follows:
GOAL: F(x, θ)
is updated by
Θ(η)
per SGD step after initialization as
η0
. That is
Fsub = Θ(ηd)where Fsub
=F(x, θ ηL
θ )F(x, θ).
5
摘要:

FoundationTransformersHongyuWang,ShumingMa,ShaohanHuang,LiDong,WenhuiWang,ZhiliangPeng,YuWuPayalBajaj,SakshamSinghal,AlonBenhaim,BarunPatra,ZhunLiu,VishravChaudharyXiaSong,FuruWeiyMicrosofthttps://github.com/microsoft/unilmAbstractAbigconvergenceofmodelarchitecturesacrosslanguage,vision,speech,and...

展开>> 收起<<
Foundation Transformers Hongyu Wang Shuming Ma Shaohan Huang Li Dong Wenhui Wang Zhiliang Peng Yu Wu Payal Bajaj Saksham Singhal Alon Benhaim Barun Patra Zhun Liu Vishrav Chaudhary.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:722.58KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注