Foundation Transformers Hongyu Wang Shuming Ma Shaohan Huang Li Dong Wenhui Wang Zhiliang Peng Yu Wu Payal Bajaj Saksham Singhal Alon Benhaim Barun Patra Zhun Liu Vishrav Chaudhary

2025-04-26 0 0 722.58KB 22 页 10玖币

侵权投诉

Foundation Transformers

Hongyu Wang∗

, Shuming Ma∗

, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu

Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary

Xia Song, Furu Wei†

Microsoft

https://github.com/microsoft/unilm

Abstract

A big convergence of model architectures across language, vision, speech, and

multimodal is emerging. However, under the same name “Transformers”, the above

areas use different implementations for better performance, e.g., Post-LayerNorm

for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for

the development of

Foundation Transformer

for true general-purpose model-

ing, which serves as a go-to architecture for various tasks and modalities with

guaranteed training stability. In this work, we introduce a Transformer variant,

named

MAGNETO

, to fulﬁll the goal. Speciﬁcally, we propose Sub-LayerNorm for

good expressivity, and the initialization strategy theoretically derived from Deep-

Net (Wang et al.,2022a) for stable scaling up. Extensive experiments demonstrate

its superior performance and better stability than the de facto Transformer variants

designed for various applications, including language modeling (i.e., BERT, and

GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and

multimodal pretraining (i.e., BEiT-3).

Models Previous This work

Vision Encoder ViT/BEiT Pre-LN

Sub-LN

Language

Encoder BERT Post-LN

Decoder GPT Pre-LN

Encoder-Decoder NMT/BART Post-LN

Speech Encoder T-T Pre-LN

Multimodal Encoder BEiT-3 Pre-LN

Linear

𝒙

Linear

Attention

Linear

ReLU

(a) Post-LN

Linear

𝒙

Linear

Attention

Linear

ReLU

(b) Pre-LN

Linear

Attention

Linear

ReLU

𝑾~𝑵(𝟎, 𝜸)

Initialization:

𝒙

Figure 1:

Top

: the architectures of SOTA models across language, vision, speech, and multimodal.

Bottom

: the proposed Foundation Transformer uses Sub-LN and theoretically derived initialization.

∗Equal contribution. †Corresponding author.

arXiv:2210.06423v2 [cs.LG] 19 Oct 2022

1 Introduction

Recent years have witnessed a big convergence of model architectures across language, vision, speech,

and multimodal. Speciﬁcally, starting from the natural language processing, Transformers (Vaswani

et al.,2017) have become the de facto standard for various areas, including computer vision (Doso-

vitskiy et al.,2021), speech (Zhang et al.,2020b), and multimodal (Kim et al.,2021;Wang et al.,

2022b). Transformers fully leverage the parallelism advantage of GPU hardware and large-scale

data. It is appealing that we can use the same network architecture for a broad range of applications.

So the pretrained models can be seamlessly reused with the shared implementation and hardware

optimization. Moreover, general-purpose modeling is important to multimodal models, as different

modalities can be jointly encoded and fused by one model.

However, despite using the same name “Transformers”, there are signiﬁcant differences in the

implementation of the architectures for different tasks. Figure 1summarizes the architectures for

state-of-the-art models that are widely used in various communities. For instance, some models (e.g.,

GPT, and ViT) adopt Pre-LayerNorm (Pre-LN) Transformers, while others use Post-LayerNorm

(Post-LN) variants (e.g., BERT, and machine translation) for better performance. Rather than directly

using the same architecture, we need to compare two Transformer variants on the speciﬁc tasks or

modalities to determine the backbone, which is ineffective for model development. More importantly,

considering multimodal models, the optimal Transformer variants are usually different for input

modalities. For the example of BEiT-3 (Wang et al.,2022b) vision-language pretraining, using

Post-LN is sub-optimal for vision encoding while Pre-LN is sub-optimal for the language part. The

true convergence of multimodal pretraining requires a uniﬁed architecture that performs well across

tasks and modalities. In addition, a pain point of Transformer architectures is training stability,

especially for large-scale models. We usually need signiﬁcant efforts to tune hyperparameters or

babysit training processes.

As a result, we call for developing

Foundation Transformers

for true general-purpose modeling.

First, the desired modeling should be able to serve as a go-to architecture for various tasks and

modalities, so that we can use the same backbone without trial and error. The general-purpose

design principle also greatly supports the development of multimodal foundation models, as we can

use one uniﬁed Transformer for various modalities without performance degradation. Second, the

architectures should provide guaranteed training stability. The favored property can signiﬁcantly

mitigate the difﬁculty of large-scale pretraining of foundation models.

In this work, we introduce MAGNETO as an implementation of Foundation Transformers to fulﬁll the

above goals. Speciﬁcally, we introduce Sub-LayerNorm (Sub-LN), which adds an extra LayerNorm

to each sublayer (i.e., multi-head self-attention, and feed-forward network). Moreover, MAGNETO has

a novel initialization method that has a theoretical guarantee to fundamentally improve the training

stability. This allows the models to be scaled up without pain. We evaluate MAGNETO on extensive

tasks and modalities, namely, masked language modeling (i.e., BERT), causal language modeling

(i.e., GPT), machine translation, masked image modeling (i.e., BEiT), speech recognition, and

vision-language pretraining (i.e., BEiT-3). Experimental results show that MAGNETO signiﬁcantly

outperforms de facto Transformer variants on the downstream tasks. In addition, MAGNETO is more

stable in terms of optimization, which allows larger learning rates to improve results without training

divergence.

2 TL;DR for Practitioners

Figure 1illustrates the overview of the MAGNETO architecture. There are two key improvements in

terms of modeling. First, compared to the Pre-LN variant, Sub-LN introduces another LayerNorm

inside each sublayer (i.e., multi-head self-attention, and feed-forward network): one before the input

projection, and the other before the output projection. Second, we use the initialization with the

theoretical derivation from DeepNet (Wang et al.,2022a), which fundamentally improves the training

stability, allowing the model to be scaled up to massive sizes without pain.

As shown in Figure 2, we present the implementation of MAGNETO. There are only lines of code

changes on top of the vanilla Transformer architecture. Notably, following the derivation from

DeepNet, the weights of query projection and key projection are not scaled during initialization.

def subln(x):

return x + fout(LN(fin(LN(x))))

def subln_init(w):

if wis ['ffn','v_proj','out_proj']:

nn.init.xavier_normal_(w, gain=γ)

elif wis ['q_proj','k_proj']:

nn.init.xavier_normal_(w, gain=1)

Architectures Encoder Decoder

γ γ

Encoder-only √log 2N-

(e.g., BERT, ViT)

Decoder-only -√log 2M

(e.g., GPT)

Encoder-decoder q1

3log 3Mlog 2N√log 3M

(e.g., NMT, BART)

Linear

Attention

Linear

ReLU

𝒙

× 𝑵

𝑾~𝑵(𝟎, 𝜸)

Initialization:

(a) Encoder or Decoder

Linear

Attention

Linear

ReLU

𝒙

Linear

Attention

Linear

ReLU

𝒚

Linear

Attention

× 𝑵 × 𝑴

𝑾~𝑵(𝟎, 𝜸)

Initialization:

(b) Encoder-Decoder

Figure 2:

Top left

: pseudocode of Sub-LN. We take Xavier initialization (Glorot and Bengio,2010)

as an example, and it can be replaced with other standard initialization. Notice that

is a constant.

Top right

: parameters of Sub-LN for different architectures (

-layer encoder,

-layer decoder).

Bottom: the layout of Sub-LN for different architectures.

Besides, there is only one LayerNorm inside the cross-attention for the encoder-decoder architecture

and we do not scale the initialized weights of cross-attention.

3 MAGNETO: A Foundation Transformer

3.1 Architecture: Sub-LayerNorm

Vanilla Transformers are based on either Pre-LayerNorm (Pre-LN) structures or Post-LayerNorm

(Post-LN). Different from them, MAGNETO is built on the Sub-LayerNorm (Sub-LN). It inherits

the multihead attentions and the feed-forward network from Transformers and introduces two layer

normalization modules inside each sublayer (except the cross-attention).

For the multihead attentions, the layer normalization modules are before the

qkv

projection and the

output projection, which can be formulated as:

Q, K, V =WQLN(x), W KLN(x), W VLN(x)(1)

MSA(x) = x+WOLN(Attention(Q, K, V )) (2)

where

, and

are the parameters of the multihead self-attention. Similarly, for the

feed-forward network, the layer normalization modules are before the input projection and the output

projection, which are written as:

FC1(x) = W1LN(x)(3)

FC2(x) = W2LN(x)(4)

FFN(x) = FC2(φ(FC1(x))) (5)

where

and

are parameters of the feed-forward layers, and

is the non-linear activation

function.

3.2 Initialization: Theoretical Derivation from DeepNet

We adopt the theoretical derivation from DeepNet (Wang et al.,2022a) to improve the training

stability. DeepNet estimates the expected model update for Post-LN and introduces DeepNorm to

bound the model update to a constant. Following DeepNet, we ﬁrst estimate the expected model

update of Sub-LN and then demonstrate how to bound the model update with a proper initialization.

Expected Model Update for Pre-LN

We start with the expected model update for Pre-LN. The

forward propagation for an

-layer Pre-LN Transformer with

attention sub-layers and

feed-

forward sub-layers can be formulated as:

F(x;θ) = Wvocabxe(6)

xe=LN(x+

l=1

Gl(xl−1, θel)), xl=Gl(xl−1, θel)and x0=x(7)

where

xl−1

denotes the input and output for the

-th sub-layer

. If

is odd,

refers to

self-attention MSA; if

is even,

refers to FFN.

is the output of the backbone.

denotes the

parameters of output projection

Wvocab

and the backbone

{θel}L

l=1

Wvocab ∈RV×d

, where

hidden dimension,

is dictionary size.

equals to

for simplicity. Without the loss of generality,

we set the intermediate dimension of feed-forward layers equals to hidden dimension.

Following Wang et al. (2022a), the magnitude of attention output only depends on value and out-

put projection:

MSA(X)Θ

=WOWVLN(X)

. Similarly we have

FFN(x) = W2φ(W1LN(X))

Therefore, for vanilla Pre-LN, the forward computation of the l-th sub-layer can be formulated as:

xl=xl−1+Wl,2φ(Wl,1LN(xl−1)) (8)

We introduce two constants

vl, wl

to represent the scales of

Wl,2, W l,1

respectively. For example,

the i-th row, j-th column entry of Wl,2satisﬁes that:

Wl,2

ij vN(0,v2

d)(9)

We deﬁne the model update

∆F=||γT(F(x;θ∗)−F(x;θ))||

, where

γ, F (x)∈RV×1

and

F(x)

denote the input and output of the model respectively.

is the label of

, which is a one-hot vector

with a single entry as 1 and all the others as 0. With above analysis, we have the following theorem

to characterize ∆Fpre for an N-layer, encoder-only Pre-LN Transformer under SGD update.

Theorem 3.1.

Given an

-layer Pre-LN Transformer

F(x, θ)

, the

-th sub-layer is formulated as

xl=xl−1+Wl,2φ(Wl,1LN(xl−1)). Under SGD update, ∆Fpre satisﬁes:

∆Fpre ≤ηd(PL

l=1 v2

l+w2

n=1 v2

nw2

l=1

k=2

l+w2

n=1 v2

nw2

kw2

Pk−1

n=1 v2

nw2

)) (10)

where ηis learning rate, Lequals to 2N.

Based on Theorem 3.1, with

vl=wl= 1

(i.e., standard initialization) for vanilla Pre-LN, we have

∆Fpre =O(ηd log L)

, which shows that the magnitude of the model update grows logarithmically

as the depth increases. It is also veriﬁed by Liu et al. (2020). Wang et al. (2022a) proves that under

SGD update, the model update of vanilla Post-LN

∆Fpost

O(PL

l=1 v2

l+w2

∆Fpre

is much

smaller than

∆Fpost

with the same model depth

. It indicates that the loss landscape of vanilla

Pre-LN is smoother than that of vanilla Post-LN, which leads to faster and more stable optimization.

Expected Model Update for MAGNETO

Based on the analysis on Pre-LN, we further estimate

the expected model update of Sub-LN. With Sub-LN, the forward signal propagation of the

-th

sub-layer can be formulated as:

xl=xl−1+Wl,2LN(φ(Wl,1LN(xl−1))) (11)

We then give the expected bound of the model update’s magnitude

∆Fsub

for an

-layer, encoder-

only MAGNETO.

Theorem 3.2.

Given an

-layer MAGNETO

F(x, θ)

, the

-th sub-layer is formulated as

xl=

xl−1+Wl,2LN(φ(Wl,1LN(xl−1))). Under SGD update, ∆Fsub satisﬁes:

∆Fsub ≤ηd(PL

l=1(1 + v2

)

n=1 v2

l=1

k=2

1 + v2

n=1 v2

Pk−1

n=1 v2

)(12)

where ηis learning rate, Lequals to 2N.

When the activation of the

-th sub-layer explodes, it leads to

wlwi, i 6=l

. Equation (13) proves

that the model update of MAGNETO is smaller than that of vanilla Pre-LN in this case.

1 + v2

n=1 v2

=v2

l+w2

lPL

n=1 v2

n≤v2

l+w2

n=1 v2

nw2

, wlwi, i 6=l(13)

Furthermore, we study the magnitude of model update for MAGNETO with the encoder-decoder

architecture.

θe

follows the same deﬁnition as in Theorem 3.2. Similarly

θd

denotes parameters of

decoder. Theorem 3.3 shows that the bound of the magnitude of model update under SGD update

∆Fed =||γT(Fed(x, y, θ∗

e, θ∗

d)−Fed(x, y, θe, θd))||

, where

and

denote the input of encoder and

decoder respectively.

Theorem 3.3.

Given an encoder-decoder MAGNETO

Fed(x, y, θe, θd)

with N encoder layers and M

decoder layers, where the

-th sub-layer is formulated as

xl=xl−1+Wl,2LN(φ(Wl,1LN(xl−1)))

Under SGD update, ∆Fed satisﬁes:

∆Fed ≤∆Fd+

l=1,l%3=1

PLd

n=1 v2

(1 +

k=2

Pk−1

n=1 v2

)∆Fe(14)

∆Fd

=ηd(PLd

l=1(1 + v2

)

PLd

n=1 v2

PLd

n=1 v2

l=1

k=2

(1 + v2

)v2

Pk−1

n=1 v2

)(15)

∆Fe

=ηd(PLe

l=1(1 + v2

)

PLe

n=1 v2

PLe

n=1 v2

l=1

k=2

(1 + v2

)v2

Pk−1

n=1 v2

)(16)

where ηis learning rate, Ldequals to 3Mand Leequals to 2N.

Derivation and Implementation

We then demonstrate that the expected model update of MAG-

NETO above can be bounded with proper initialization. We provide the analysis on the encoder-only

architecture, which can be naturally extended to encoder-decoder models in the same way. Analogous

to Zhang et al. (2019) and Wang et al. (2022a), we set our goal for the model update as follows:

GOAL: F(x, θ)

is updated by

Θ(η)

per SGD step after initialization as

η→0

. That is

∆Fsub = Θ(ηd)where ∆Fsub ∆

=F(x, θ −η∂L

∂θ )−F(x, θ).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FoundationTransformersHongyuWang,ShumingMa,ShaohanHuang,LiDong,WenhuiWang,ZhiliangPeng,YuWuPayalBajaj,SakshamSinghal,AlonBenhaim,BarunPatra,ZhunLiu,VishravChaudharyXiaSong,FuruWeiyMicrosofthttps://github.com/microsoft/unilmAbstractAbigconvergenceofmodelarchitecturesacrosslanguage,vision,speech,and...

展开>> 收起<<

Foundation Transformers Hongyu Wang Shuming Ma Shaohan Huang Li Dong Wenhui Wang Zhiliang Peng Yu Wu Payal Bajaj Saksham Singhal Alon Benhaim Barun Patra Zhun Liu Vishrav Chaudhary.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Foundation Transformers Hongyu Wang Shuming Ma Shaohan Huang Li Dong Wenhui Wang Zhiliang Peng Yu Wu Payal Bajaj Saksham Singhal Alon Benhaim Barun Patra Zhun Liu Vishrav Chaudhary

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: