Transformers meet Stochastic Block Models Attention with Data-Adaptive Sparsity and Cost Sungjun Cho1Seonwoo Min1Jinwoo Kim2

2025-05-06 0 0 1.75MB 19 页 10玖币
侵权投诉
Transformers meet Stochastic Block Models:
Attention with Data-Adaptive Sparsity and Cost
Sungjun Cho1Seonwoo Min1Jinwoo Kim2
Moontae Lee1,3Honglak Lee1Seunghoon Hong2,1
1LG AI Research 2KAIST 3University of Illinois Chicago
Abstract
To overcome the quadratic cost of self-attention, recent works have proposed
various sparse attention modules, most of which fall under one of two groups:
1) sparse attention under a hand-crafted patterns and 2) full attention followed
by a sparse variant of softmax such as
α
-entmax. Unfortunately, the first group
lacks adaptability to data while the second still requires quadratic cost in train-
ing. In this work, we propose SBM-Transformer, a model that resolves both
problems by endowing each attention head with a mixed-membership Stochastic
Block Model (SBM). Then, each attention head data-adaptively samples a bipar-
tite graph, the adjacency of which is used as an attention mask for each input.
During backpropagation, a straight-through estimator is used to flow gradients
beyond the discrete sampling step and adjust the probabilities of sampled edges
based on the predictive loss. The forward and backward cost are thus linear to
the number of edges, which each attention head can also choose flexibly based
on the input. By assessing the distribution of graphs, we theoretically show that
SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence
functions in expectation. Empirical evaluations under the LRA and GLUE bench-
marks demonstrate that our model outperforms previous efficient variants as well
as the original Transformer with full attention. Our implementation can be found
in https://github.com/sc782/SBM-Transformer.
1 Introduction
The Transformer [
40
] architecture has been the go-to method for encoding sequential data, due to its
superior performance in various tasks such as machine translation [
30
], image classification [
15
], and
protein language modeling [
34
]. Its key strength stems from the multi-head attention module, where
a so-called attention score matrix computes how contextually important one token is to another for all
possible token pairs. Each Transformer layer simultaneously pools the token representations based
on the attention scores, eventually returning contextualized features without sequentially traversing
through the input sequence as its recurrent neural network-based predecessors [18].
A well-known drawback of the original Transformer is its high computational cost in time and memory
that increases quadratically with sequence length. This is due to the full pairwise computation of
attention scores, which prohibits applying it in tasks involving long-range dependencies such as
document summarization [
19
] or high-resolution image processing [
53
]. Many works have thus
focused on developing more efficient alternatives by exploiting fixed or learnable attention sparsity
patterns [9, 51, 22, 13], low-rank approximations [45, 48], or kernelized attention modules [21, 10].
Even though the efficient alternatives hold theoretical expressibility guarantees [
50
], they are far
from sufficient, still failing to convince practitioners to replace the original Transformer. We believe
this is mostly due to their lack of adaptability. They apply the same modifications to unanimously
sparsify all the attention modules across layers, without considering the tasks at hand. Such strategy
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15541v1 [cs.LG] 27 Oct 2022
1
Linear
Linear
Linear Linear
Linear
Linear Linear
Linear
Linear
Scaled Dot-Product Attention
Scaled Dot-Product Attention
Scaled Dot-Product Attention
Concatenate
Linear
Query 𝑸, Key 𝑲
𝑴
𝝈𝑴𝑴𝑸𝑲𝑇
𝐷𝑽
Implicitly compute
edgelist
SBM
𝑸 𝑲 𝑽
Figure 1: The attention module in SBM-Transformer. In multi-head attention, each attention head
samples a bipartite graph connecting queries to keys from an underlying SBM. The adjacency of the
sampled graph is used as an attention mask to compute the dot products only for the sampled edges.
imposes inductive bias too strongly and often leads to sub-optimal cost vs. performance trade-offs in
downstream tasks [
29
]. In this work, we argue that to retain the utmost potential of Transformers,
each attention module should have the ability to flexibly choose between sparse and full attention.
This is especially evident when considering many state-of-the-art systems suggest the need for a
mixture of dense and sparse attention layers. For example, a qualitative analysis on pretrained BERT
showed that lower layers exhibit broad dense attention while upper layers perform focused sparse
attention [
11
]. In the case of GPT-3 [
7
], the Transformer blocks are manually arranged to alternate
between dense and sparse attention.
To contribute to the efficient Transformers lineage, we propose SBM-Transformer, capable of
adjusting its attention sparsity data-adaptively based without fully computing the attention score
matrix (Figure 1). Leveraging a mixed-membership Stochastic Block Model (SBM) [
2
], each
attention head samples a bipartite graph connecting queries to keys. Then, the adjacency of the
sampled graph is used as an attention mask so that only attention scores corresponding to sampled
edges are computed. The overall computational cost is linear in the number of edges, which can
range from linear to quadratic in sequence length depending on the data and task under concern.
Each attention head is equipped with its own underlying SBM, enabling the model to diversify the
attention sparsity across heads and layers. By incorporating a straight-through estimator [
4
] in the
discrete graph-sampling step, SBM-Transformer enjoys end-to-end differentiability and can find the
proper attention sparsity based solely upon minimizing the predictive loss. The model can also easily
be further regularized by penalizing the number of sampled edges, which results in a lighter model
using less computational resources during inference. To the best of our knowledge, our method is the
first Transformer architecture that can data-adaptively choose between linear to full attention with
respective computational costs. To summarize, our main contributions are as follows:
We present SBM-Transformer, a novel Transformer of which each attention head can
adaptively adjust its attention sparsity as well as computational cost based on the input data.
To demonstrate the benefit of this flexibility, we theoretically prove that SBM-Transformer
retains universal approximability, and also stress-test the model under a synthetic task where
full attention is required to achieve 100% accuracy.
Evaluations on LRA and GLUE benchmarks show that SBM-Transformer outperforms pre-
vious efficient Transformer models as well as the vanilla Transformer with dense attention.
2 Related Work
In this section we discuss previous efficient Transformer variants and several works similar to ours
with respect to adaptively learning sparse attention patterns. We also review several works on SBMs.
Efficient Transformers.
Many efficient Transformers tackle to reduce the quadratic cost of multi-
head attention with different approaches. While we discuss only a handful of representative ap-
proaches, a much more comprehensive survey can be found in [
39
]. The Linear Transformer [
21
]
achieves linear complexity by replacing the softmax with a low-rank kernelized function. Lin-
former [
45
] and Nyströmformer [
48
] use a similar approach by low-rank approximating the attention
score matrix. Performer [
10
] uses positive orthogonal random features to approximate the softmax
kernel. Reformer [
22
] gathers similar tokens together through locality-sensitive hashing (LSH) and
performs attention amongst tokens within the same bucket. Of all methods above, our method is
2
most similar to Reformer, in the sense that we adaptively assign queries and keys into clusters and
form a low-rank sparse attention pattern. However, our method performs soft-clustering with much
less structural constraints, allowing each attention head to represent a wider variety of dependency
structure and to adjust its sparsity towards full attention if needed.
Adaptive Sparsity.
With respect to flexible training between sparse and dense attention, there exist
some works that parameterize how sparse the attention pattern should be based on the input. The
Adaptive Sparse Transformer [
12
] proposed replacing the usual softmax activation with
α
-entmax,
in which the
α
parameter can be differentiably trained to adjust the activation between softmax
and sparsemax activation [
27
]. SparseBERT [
36
] uses a differentiable masking technique where
each attention mask is sampled from a Gumbel-sigmoid distribution using data-independent mask
probability parameters. While these methods possess the flexibility to adjust between sparse and
full attention based on data, they still require full computation of the attention score matrix before
sparsification, and hence are unable to leverage the learned sparsity towards better model efficiency.
To the best of our knowledge, ours is the first work to be able to adaptively tune its attention sparsity
between sparse to full attention without requiring the explicit computation of the attention score
matrix, thereby avoiding quadratic cost when possible.
Stochastic Block Models.
The Stochastic Block Model (SBM) is a generative model that encodes
the latent structure of graphs by grouping nodes into clusters. By modeling the cluster-membership
of each node as well as inter-cluster relationships, SBMs can represent a wide variety of graph
structures, which is a feature especially useful for generating new graphs or predicting missing
edges in noisy data [
1
]. The standard SBM assigns each node to a single cluster, and the probability
of an edge between two nodes strictly depends on the corresponding clusters. Several structural
extensions include overlapping SBM [
24
] and mixed-membership SBM [
2
], which allow each node to
be assigned to multiple clusters. The underlying SBM used by our framework mostly resembles these
two variants, while the edge probability is modeled by a nonlinear function of two node embeddings
rather than a bilinear one. There exist many other extensions including degree-corrected SBM [
20
]
for multi-graphs and hierarchical SBM [
31
] for multiplex-graphs. Further details can be found in a
recent survey [16].
3 Preliminaries: Sparse Transformers
We first introduce the full attention mechanism used in the original Transformer [
40
] as well as
masked attention which will serve as a backbone of our approach.
3.1 Full Attention
In vanilla Transformer [
40
], each attention head takes a sequence of token features as input
XRn×d
where
n
is the sequence length and
d
the embedding dimension. Weight parameters
WQ,WK
Rd×dh
and
WVRd×dh
with head-dimension
dh
first maps the input features
X
into query
Q
, key
K
, and value
V
, respectively. Then, the attention score matrix is computed with scaled dot-product
of queries and keys followed by row-wise softmax activation
σ(·)
. Note that explicit computation of
this matrix is the main bottleneck of full attention, incurring
O(n2)
asymptotic cost in both time and
memory. The value features
V
are then pooled based on the attention scores, returning the output
token representations. Altogether, the operation performed by each attention head can be written as
Q=XW Q,K=XW K,V=XW V(1)
Attn(X) = σQKT
dhV.(2)
3.2 Masked Attention
One way to remove the quadratic bottleneck from the attention score matrix is to apply a binary mask
M∈ {0,1}n×n
and compute the scaled dot-products
QiKT
j/dh
only if
Mij = 1
. In presence of
an attention mask, the operation is modified to
Attnmask(X,M) = σMMQKT
dhV(3)
σM(A)ij :=
exp(Aij )
Pk∈{k0|Mik0=1}exp(Aik)if Mij = 1
0otherwise
(4)
3
𝑸
𝑲
𝑽
𝑸
𝑲
𝑪
𝑺Attention Mask 𝑴
Attnmask 𝑿
𝜕ℒ/𝜕𝑿
Forward
Backward
Inputs
Clusters
Query/Key
Nodes
SBM
+
STE
Output
𝑸, 𝑲
Figure 2: An illustration of the attention mechanism in SBM-Transformer. Each head first maps
queries and keys to the node representation space through a shared MLP. The graph sampling module
samples an attention mask from a Stochastic Block Model (SBM) parameterized by the node and
cluster embeddings. The discrete sampling step is differentiable via a Straight-Through Estimator
(STE). Given the mask, the output is computed via masked attention.
where
indicates entry-wise multiplication. Note that the masked-softmax
σM(·)
operator only
computes unmasked terms, ensuring that each
(i, j)
-th attention score survives as nonzero if and only
if
Mij = 1
. This is thus equivalent to filling in the
(i, j)
-th attention score with
−∞
if
Mij = 0
,
then applying the standard softmax operator. Most sparsity-based efficient Transformers fall under
this formulation, while using different methods to either manually fix or learn the mask
M
. For
instance, local attention [
9
,
3
,
51
] with a sliding window sets
Mij = 1
if
|ij|< c
for some context
window size cwhile Reformer [22] sets Mij = 1 if Qiand Kjare hashed into the same bucket.
4 Our Method: SBM-Transformer
Here we discuss the details of SBM-Transformer (Figure 2). We first illustrate the forward step of our
attention module and how the underlying SBM [
2
] of each head, from which we sample our attention
masks, is parameterized by the input tensors. We then discuss how the model enables end-to-end
differentiability despite the discrete graph sampling step.
4.1 Forward step with the Stochastic Block Model
In our framework, we view the attention mask
M
as an adjacency matrix of a bipartite graph that
connects queries to keys, and let each attention head sample an adjacency matrix that best represents
the contextual dependencies amongst input tokens. In order to efficiently sample adjacency matrices
while avoiding the quadratic cost, the distribution of graphs must first be parameterized with a
sub-quadratic number of latent variables. Stochastic Block Models fit perfectly for our purpose as
it models graphs that are low-rank structured with
k
latent clusters, allowing full parameterization
using
O(nk)
memory. More concretely, the SBM distribution is defined by two nonnegative node-
to-cluster memberships
Y,ZRn×k
+
and a so-called block matrix
BRk×k
+
that stores the
inter-cluster connection probabilities. The probability of node
i
being connected to node
j
is
computed as
p(i, j) = YiBZT
j
. Equivalently, the expectation of the adjacency matrix sampled from
ASBM(Y,B,Z)can be written as E[A] = Y BZT.
For proper parameterization of the SBM, we must infer the nonnegative node-memberships and block
matrix from the queries and keys. To do so, we equip each attention head a 2-layer
MLPdhdh
with
ReLU activation, and a set of
k
trainable cluster-embeddings
CRk×dh
. First, our model computes
the block matrix
ˆ
SRk×k
+
by taking dot products amongst cluster-embeddings
C
followed by a
2-dimensional softmax activation. The node embeddings are obtained by processing each query and
key through the
MLPdhdh
, mapping token representations into the node representation space. The
memberships of query and key nodes, which we denote by
ˆ
Q
and
ˆ
K
, are then inferred by taking
dot products of node and cluster embeddings, followed by a sigmoid function. The block matrix
ˆ
S
, query node-memberships
ˆ
Q
, and key node-memberships
ˆ
K
altogether provide a well-defined
parameterization for the SBM. Thus, a bipartite graph adjacency
M∈ {0,1}n×m
can be sampled
from
MSBM(ˆ
Q,ˆ
S,ˆ
K)
with expectation
E[M] = ˆ
Qˆ
Sˆ
KT
: the probability of connecting
query Qito key Kjequals p(i, j) = ˆ
Qiˆ
Sˆ
KT
j. Formally, the sampling procedure can be written as
4
摘要:

TransformersmeetStochasticBlockModels:AttentionwithData-AdaptiveSparsityandCostSungjunCho1SeonwooMin1JinwooKim2MoontaeLee1;3HonglakLee1SeunghoonHong2;11LGAIResearch2KAIST3UniversityofIllinoisChicagoAbstractToovercomethequadraticcostofself-attention,recentworkshaveproposedvarioussparseattentionmodule...

展开>> 收起<<
Transformers meet Stochastic Block Models Attention with Data-Adaptive Sparsity and Cost Sungjun Cho1Seonwoo Min1Jinwoo Kim2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.75MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注