Transformers meet Stochastic Block Models Attention with Data-Adaptive Sparsity and Cost Sungjun Cho1Seonwoo Min1Jinwoo Kim2

2025-05-06 0 0 1.75MB 19 页 10玖币

侵权投诉

Transformers meet Stochastic Block Models:

Attention with Data-Adaptive Sparsity and Cost

Sungjun Cho1Seonwoo Min1Jinwoo Kim2

Moontae Lee1,3Honglak Lee1Seunghoon Hong2,1

1LG AI Research 2KAIST 3University of Illinois Chicago

Abstract

To overcome the quadratic cost of self-attention, recent works have proposed

various sparse attention modules, most of which fall under one of two groups:

1) sparse attention under a hand-crafted patterns and 2) full attention followed

by a sparse variant of softmax such as

-entmax. Unfortunately, the ﬁrst group

lacks adaptability to data while the second still requires quadratic cost in train-

ing. In this work, we propose SBM-Transformer, a model that resolves both

problems by endowing each attention head with a mixed-membership Stochastic

Block Model (SBM). Then, each attention head data-adaptively samples a bipar-

tite graph, the adjacency of which is used as an attention mask for each input.

During backpropagation, a straight-through estimator is used to ﬂow gradients

beyond the discrete sampling step and adjust the probabilities of sampled edges

based on the predictive loss. The forward and backward cost are thus linear to

the number of edges, which each attention head can also choose ﬂexibly based

on the input. By assessing the distribution of graphs, we theoretically show that

SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence

functions in expectation. Empirical evaluations under the LRA and GLUE bench-

marks demonstrate that our model outperforms previous efﬁcient variants as well

as the original Transformer with full attention. Our implementation can be found

in https://github.com/sc782/SBM-Transformer.

1 Introduction

The Transformer [

] architecture has been the go-to method for encoding sequential data, due to its

superior performance in various tasks such as machine translation [

], image classiﬁcation [

], and

protein language modeling [

]. Its key strength stems from the multi-head attention module, where

a so-called attention score matrix computes how contextually important one token is to another for all

possible token pairs. Each Transformer layer simultaneously pools the token representations based

on the attention scores, eventually returning contextualized features without sequentially traversing

through the input sequence as its recurrent neural network-based predecessors [18].

A well-known drawback of the original Transformer is its high computational cost in time and memory

that increases quadratically with sequence length. This is due to the full pairwise computation of

attention scores, which prohibits applying it in tasks involving long-range dependencies such as

document summarization [

] or high-resolution image processing [

]. Many works have thus

focused on developing more efﬁcient alternatives by exploiting ﬁxed or learnable attention sparsity

patterns [9, 51, 22, 13], low-rank approximations [45, 48], or kernelized attention modules [21, 10].

Even though the efﬁcient alternatives hold theoretical expressibility guarantees [

], they are far

from sufﬁcient, still failing to convince practitioners to replace the original Transformer. We believe

this is mostly due to their lack of adaptability. They apply the same modiﬁcations to unanimously

sparsify all the attention modules across layers, without considering the tasks at hand. Such strategy

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15541v1 [cs.LG] 27 Oct 2022

Linear

Linear Linear

Linear

Linear Linear

Linear

Scaled Dot-Product Attention

Concatenate

Linear

Query 𝑸, Key 𝑲

𝑴

𝝈𝑴𝑴⊙𝑸𝑲𝑇

𝐷𝑽

Implicitly compute

edgelist

SBM

𝑸 𝑲 𝑽

Figure 1: The attention module in SBM-Transformer. In multi-head attention, each attention head

samples a bipartite graph connecting queries to keys from an underlying SBM. The adjacency of the

sampled graph is used as an attention mask to compute the dot products only for the sampled edges.

imposes inductive bias too strongly and often leads to sub-optimal cost vs. performance trade-offs in

downstream tasks [

]. In this work, we argue that to retain the utmost potential of Transformers,

each attention module should have the ability to ﬂexibly choose between sparse and full attention.

This is especially evident when considering many state-of-the-art systems suggest the need for a

mixture of dense and sparse attention layers. For example, a qualitative analysis on pretrained BERT

showed that lower layers exhibit broad dense attention while upper layers perform focused sparse

attention [

]. In the case of GPT-3 [

], the Transformer blocks are manually arranged to alternate

between dense and sparse attention.

To contribute to the efﬁcient Transformers lineage, we propose SBM-Transformer, capable of

adjusting its attention sparsity data-adaptively based without fully computing the attention score

matrix (Figure 1). Leveraging a mixed-membership Stochastic Block Model (SBM) [

], each

attention head samples a bipartite graph connecting queries to keys. Then, the adjacency of the

sampled graph is used as an attention mask so that only attention scores corresponding to sampled

edges are computed. The overall computational cost is linear in the number of edges, which can

range from linear to quadratic in sequence length depending on the data and task under concern.

Each attention head is equipped with its own underlying SBM, enabling the model to diversify the

attention sparsity across heads and layers. By incorporating a straight-through estimator [

] in the

discrete graph-sampling step, SBM-Transformer enjoys end-to-end differentiability and can ﬁnd the

proper attention sparsity based solely upon minimizing the predictive loss. The model can also easily

be further regularized by penalizing the number of sampled edges, which results in a lighter model

using less computational resources during inference. To the best of our knowledge, our method is the

ﬁrst Transformer architecture that can data-adaptively choose between linear to full attention with

respective computational costs. To summarize, our main contributions are as follows:

•

We present SBM-Transformer, a novel Transformer of which each attention head can

adaptively adjust its attention sparsity as well as computational cost based on the input data.

•

To demonstrate the beneﬁt of this ﬂexibility, we theoretically prove that SBM-Transformer

retains universal approximability, and also stress-test the model under a synthetic task where

full attention is required to achieve 100% accuracy.

•

Evaluations on LRA and GLUE benchmarks show that SBM-Transformer outperforms pre-

vious efﬁcient Transformer models as well as the vanilla Transformer with dense attention.

2 Related Work

In this section we discuss previous efﬁcient Transformer variants and several works similar to ours

with respect to adaptively learning sparse attention patterns. We also review several works on SBMs.

Efﬁcient Transformers.

Many efﬁcient Transformers tackle to reduce the quadratic cost of multi-

head attention with different approaches. While we discuss only a handful of representative ap-

proaches, a much more comprehensive survey can be found in [

]. The Linear Transformer [

]

achieves linear complexity by replacing the softmax with a low-rank kernelized function. Lin-

former [

] and Nyströmformer [

] use a similar approach by low-rank approximating the attention

score matrix. Performer [

] uses positive orthogonal random features to approximate the softmax

kernel. Reformer [

] gathers similar tokens together through locality-sensitive hashing (LSH) and

performs attention amongst tokens within the same bucket. Of all methods above, our method is

most similar to Reformer, in the sense that we adaptively assign queries and keys into clusters and

form a low-rank sparse attention pattern. However, our method performs soft-clustering with much

less structural constraints, allowing each attention head to represent a wider variety of dependency

structure and to adjust its sparsity towards full attention if needed.

Adaptive Sparsity.

With respect to ﬂexible training between sparse and dense attention, there exist

some works that parameterize how sparse the attention pattern should be based on the input. The

Adaptive Sparse Transformer [

] proposed replacing the usual softmax activation with

-entmax,

in which the

parameter can be differentiably trained to adjust the activation between softmax

and sparsemax activation [

]. SparseBERT [

] uses a differentiable masking technique where

each attention mask is sampled from a Gumbel-sigmoid distribution using data-independent mask

probability parameters. While these methods possess the ﬂexibility to adjust between sparse and

full attention based on data, they still require full computation of the attention score matrix before

sparsiﬁcation, and hence are unable to leverage the learned sparsity towards better model efﬁciency.

To the best of our knowledge, ours is the ﬁrst work to be able to adaptively tune its attention sparsity

between sparse to full attention without requiring the explicit computation of the attention score

matrix, thereby avoiding quadratic cost when possible.

Stochastic Block Models.

The Stochastic Block Model (SBM) is a generative model that encodes

the latent structure of graphs by grouping nodes into clusters. By modeling the cluster-membership

of each node as well as inter-cluster relationships, SBMs can represent a wide variety of graph

structures, which is a feature especially useful for generating new graphs or predicting missing

edges in noisy data [

]. The standard SBM assigns each node to a single cluster, and the probability

of an edge between two nodes strictly depends on the corresponding clusters. Several structural

extensions include overlapping SBM [

] and mixed-membership SBM [

], which allow each node to

be assigned to multiple clusters. The underlying SBM used by our framework mostly resembles these

two variants, while the edge probability is modeled by a nonlinear function of two node embeddings

rather than a bilinear one. There exist many other extensions including degree-corrected SBM [

]

for multi-graphs and hierarchical SBM [

] for multiplex-graphs. Further details can be found in a

recent survey [16].

3 Preliminaries: Sparse Transformers

We ﬁrst introduce the full attention mechanism used in the original Transformer [

] as well as

masked attention which will serve as a backbone of our approach.

3.1 Full Attention

In vanilla Transformer [

], each attention head takes a sequence of token features as input

X∈Rn×d

where

is the sequence length and

the embedding dimension. Weight parameters

WQ,WK∈

Rd×dh

and

WV∈Rd×dh

with head-dimension

ﬁrst maps the input features

into query

, key

, and value

, respectively. Then, the attention score matrix is computed with scaled dot-product

of queries and keys followed by row-wise softmax activation

σ(·)

. Note that explicit computation of

this matrix is the main bottleneck of full attention, incurring

O(n2)

asymptotic cost in both time and

memory. The value features

are then pooled based on the attention scores, returning the output

token representations. Altogether, the operation performed by each attention head can be written as

Q=XW Q,K=XW K,V=XW V(1)

Attn(X) = σQKT

√dhV.(2)

3.2 Masked Attention

One way to remove the quadratic bottleneck from the attention score matrix is to apply a binary mask

M∈ {0,1}n×n

and compute the scaled dot-products

QiKT

j/√dh

only if

Mij = 1

. In presence of

an attention mask, the operation is modiﬁed to

Attnmask(X,M) = σMMQKT

√dhV(3)

σM(A)ij :=





exp(Aij )

Pk∈{k0|Mik0=1}exp(Aik)if Mij = 1

0otherwise

(4)

𝑸

𝑲

𝑽

෡

𝑸

෡

𝑲

𝑪෡

𝑺Attention Mask 𝑴

Attnmask 𝑿

𝜕ℒ/𝜕𝑿

Forward

Backward

Inputs

Clusters

Query/Key

Nodes

SBM

STE

Output

𝑸, 𝑲

Figure 2: An illustration of the attention mechanism in SBM-Transformer. Each head ﬁrst maps

queries and keys to the node representation space through a shared MLP. The graph sampling module

samples an attention mask from a Stochastic Block Model (SBM) parameterized by the node and

cluster embeddings. The discrete sampling step is differentiable via a Straight-Through Estimator

(STE). Given the mask, the output is computed via masked attention.

where



indicates entry-wise multiplication. Note that the masked-softmax

σM(·)

operator only

computes unmasked terms, ensuring that each

(i, j)

-th attention score survives as nonzero if and only

Mij = 1

. This is thus equivalent to ﬁlling in the

(i, j)

-th attention score with

−∞

Mij = 0

then applying the standard softmax operator. Most sparsity-based efﬁcient Transformers fall under

this formulation, while using different methods to either manually ﬁx or learn the mask

. For

instance, local attention [

] with a sliding window sets

Mij = 1

|i−j|< c

for some context

window size cwhile Reformer [22] sets Mij = 1 if Qiand Kjare hashed into the same bucket.

4 Our Method: SBM-Transformer

Here we discuss the details of SBM-Transformer (Figure 2). We ﬁrst illustrate the forward step of our

attention module and how the underlying SBM [

] of each head, from which we sample our attention

masks, is parameterized by the input tensors. We then discuss how the model enables end-to-end

differentiability despite the discrete graph sampling step.

4.1 Forward step with the Stochastic Block Model

In our framework, we view the attention mask

as an adjacency matrix of a bipartite graph that

connects queries to keys, and let each attention head sample an adjacency matrix that best represents

the contextual dependencies amongst input tokens. In order to efﬁciently sample adjacency matrices

while avoiding the quadratic cost, the distribution of graphs must ﬁrst be parameterized with a

sub-quadratic number of latent variables. Stochastic Block Models ﬁt perfectly for our purpose as

it models graphs that are low-rank structured with

latent clusters, allowing full parameterization

using

O(nk)

memory. More concretely, the SBM distribution is deﬁned by two nonnegative node-

to-cluster memberships

Y,Z∈Rn×k

and a so-called block matrix

B∈Rk×k

that stores the

inter-cluster connection probabilities. The probability of node

being connected to node

computed as

p(i, j) = YiBZT

. Equivalently, the expectation of the adjacency matrix sampled from

A∼SBM(Y,B,Z)can be written as E[A] = Y BZT.

For proper parameterization of the SBM, we must infer the nonnegative node-memberships and block

matrix from the queries and keys. To do so, we equip each attention head a 2-layer

MLPdh→dh

with

ReLU activation, and a set of

trainable cluster-embeddings

C∈Rk×dh

. First, our model computes

the block matrix

S∈Rk×k

by taking dot products amongst cluster-embeddings

followed by a

2-dimensional softmax activation. The node embeddings are obtained by processing each query and

key through the

MLPdh→dh

, mapping token representations into the node representation space. The

memberships of query and key nodes, which we denote by

and

, are then inferred by taking

dot products of node and cluster embeddings, followed by a sigmoid function. The block matrix

, query node-memberships

, and key node-memberships

altogether provide a well-deﬁned

parameterization for the SBM. Thus, a bipartite graph adjacency

M∈ {0,1}n×m

can be sampled

from

M∼SBM(ˆ

Q,ˆ

S,ˆ

with expectation

E[M] = ˆ

Qˆ

Sˆ

: the probability of connecting

query Qito key Kjequals p(i, j) = ˆ

Qiˆ

Sˆ

j. Formally, the sampling procedure can be written as

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransformersmeetStochasticBlockModels:AttentionwithData-AdaptiveSparsityandCostSungjunCho1SeonwooMin1JinwooKim2MoontaeLee1;3HonglakLee1SeunghoonHong2;11LGAIResearch2KAIST3UniversityofIllinoisChicagoAbstractToovercomethequadraticcostofself-attention,recentworkshaveproposedvarioussparseattentionmodule...

展开>> 收起<<

Transformers meet Stochastic Block Models Attention with Data-Adaptive Sparsity and Cost Sungjun Cho1Seonwoo Min1Jinwoo Kim2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transformers meet Stochastic Block Models Attention with Data-Adaptive Sparsity and Cost Sungjun Cho1Seonwoo Min1Jinwoo Kim2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: