Once is Enough A Lightweight Cross-Attention for Fast Sentence Pair Modeling Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2

2025-05-02 0 0 1.45MB 7 页 10玖币
侵权投诉
Once is Enough: A Lightweight Cross-Attention for
Fast Sentence Pair Modeling
Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2
Cuiyun Gao1Zenglin Xu1
1Harbin Institute of Technology, Shenzhen, China
2Meta AI, CA, USA
{ysngkil, syqi12138}@gmail.com liuchuanyi@hit.edu.cn wqfcr@fb.com
{gaocuiyun, xuzenglin}@hit.edu.cn
Abstract
Transformer-based models have achieved great
success on sentence pair modeling tasks, such
as answer selection and natural language infer-
ence (NLI). These models generally perform
cross-attention over input pairs, leading to pro-
hibitive computational costs. Recent studies
propose dual-encoder and late interaction ar-
chitectures for faster computation. However,
the balance between the expressive of cross-
attention and computation speedup still needs
better coordinated. To this end, this paper in-
troduces a novel paradigm MixEncoder for ef-
ficient sentence pair modeling. MixEncoder
involves a lightweight cross-attention mecha-
nism. It avoids the repeated encoding of the
same query for different candidates, thus al-
lowing modeling the query-candidate interac-
tion in parallel. Extensive experiments con-
ducted on four tasks demonstrate that our Mix-
Encoder can speed up sentence pairing by over
113x while achieving comparable performance
as the more expensive cross-attention mod-
els. The source code is available at
https:
//github.com/ysngki/MixEncoder.
1 Introduction
Sentence pair modeling, such as natural language
inference, question answering, and information
retrieval, is an essential task in natural language
processing (Nogueira and Cho,2020;Qu et al.,
2021;Zhao et al.,2021). These tasks can be de-
picted as a procedure of scoring the candidates
given a query. Recently, Transformer-based mod-
els (Vaswani et al.,2017;Devlin et al.,2019) have
shown promising performance on sentence pair
modeling tasks due to the expressiveness of the
pre-trained cross-encoder. As shown in Figure 1(a),
the cross-encoder takes a pair of query and candi-
date as input and calculates the interaction between
them at each layer by the input-wide self-attention
mechanism. Despite the effective text representa-
tion power, the cross-encoder leads to exhaustive
computation costs, especially when the number of
candidates is very large ( e.g., the interaction will
be calculated
N
times if there are
N
candidates).
This computation cost, therefore, restricts the use
of these cross-encoder models in many real-world
applications (Chen et al.,2020).
To tackle this issue, we propose a lightweight
cross-attention mechanism, called MixEncoder,
that speeds up the inference while maintaining
the expressiveness of cross-attention. Specifically,
the proposed MixEncoder accelerates the cross-
attention by performing attention only from candi-
dates to the query, involving few tokens and only
at a few layers. This lightweight cross-attention
avoids repetitive query encoding, supporting the
processing of multiple candidates in parallel and
thus reducing computation costs. Additionally,
MixEncoder allows to pre-compute the candidates
into several dense context embeddings and to store
them offline to accelerate the inference further.
We evaluate MixEncoder for sentence pair mod-
eling on four benchmark datasets related to tasks
of natural language inference, dialogue, and infor-
mation retrieval. The results demonstrate that Mix-
Encoder better balances the effectiveness and effi-
ciency. For example, MixEncoder achieves a sub-
stantial speedup of more than 113x over the cross-
encoder and provides competitive performance.
2 Background
Extensive studies, including dual-encoder (Reimers
and Gurevych,2019) and late interaction models
(MacAvaney et al.,2020;Gao et al.,2020;Chen
et al.,2020;Khattab and Zaharia,2020), have been
proposed to accelerate the transformer inference on
sentence pair modeling tasks.
As shown in Figure 1, dual-encoders process
the query and candidates separately, allowing pre-
computing the candidates to accelerate online in-
ference, resulting in fast inference speed. However,
this speedup is built upon sacrificing the expres-
arXiv:2210.05261v3 [cs.CL] 22 Oct 2023
(b) Dual-encoder (c) Late interaction
Query Candidate
CLS
(a) Cross-encoder
Query Candidate
CLS
Classifier
Cache
CLS Query CandidateCLS
Cache
CLS
Classifier
Classifier
Executed N times
Executed once
Vanilla Transformer layer
� �
Offline Calculation
Figure 1: Illustration of three popular sentence pair
approaches, where
N
denotes the number of candidates
and
s
denotes the relevance score of candidate-query
pairs. The cache stores the pre-computed embeddings.
QueryCLS
CandidateCLS
(a) Candidate Pre-computation (b) Online Encoding & Classification/Ranking
Cache Classifier
Interaction Layer
Cache
LN LN LN
Pooling
Att Att Gate
Att
Interaction Layer
Candidates
Figure 2: Overview of proposed MixEncoder.
siveness of cross-attention (Luan et al.,2021;Hu
et al.,2021;Zhang et al.,2021). Alternatively,
late-interaction models adjust dual-encoders by ap-
pending an interaction component, such as a stack
of Transformer layers (Cao et al.,2020;Nie et al.,
2020), for modeling the interaction between the
query and the cached candidates. These approaches
still suffer from the high costs of the interaction
component (Chen et al.,2020).
3 Method
In this section, we introduce the details of the
proposed MixEncoder, which simplifies cross-
attention by enabling pre-computation, reducing
the times of query encoding, and reducing the num-
ber of involved tokens and layers.
3.1 Candidate Pre-computation
Given a candidate that is a sequence of tokens
Ti= [t1,· · · , tl]
, we experiment with two strate-
gies to encode these tokens into
k
context embed-
dings in advance, where
kl
: (1) prepending
k
special tokens
{Si}k
i=1
to
Ti
before feeding
Ti
into the Transformer encoder (Vaswani et al.,2017;
Devlin et al.,2019), and using the output at these
special tokens as context embeddings (
S
-strategy);
(2) maintaining
k
context codes (Humeau et al.,
2020) to extract global features from output of the
encoder by attention mechanism (
C
-strategy). The
default configuration is
S
-strategy as it provides
slightly better performance. The pre-computed con-
text embeddings
ERN×k×d
are cached for on-
line inference, where
N
is the number of candi-
dates.
3.1.1 Query Encoding
Since the cross-encoder performs
N
times of query
encoding, which contributes to the inefficiency, a
straightforward way to accelerate the inference is
to reduce the encoding times of the query. Here we
encode the query without taking its candidates into
account, thus requiring the encoding only once.
To preserve the expressiveness of the cross-
attention, the simplified cross-attention is per-
formed at several interaction layers. As shown
in Figure 2, the context embeddings
Ej1
of can-
didates are allowed to attend over the intermedi-
ate token embeddings of the query, thus obtaining
context-aware representations
Ej
and
Hj
for the
query and its candidates.
Concretely, at each interaction layer, the key and
value matrices of the query are utilized by candi-
dates in two ways. (1) Producing contextualized
representations for the candidates:
Ej=Attn(Q,[K;K],[V;V]),(1)
where
Q
,
K
,
V
are derived from the
Ej1
with
a linear transformation.
Ej
is supposed to con-
tain semantics from both the query and candidates.
(2) Compressing the semantics of the query into a
vector for each candidate:
Hj=Gate(Attn(Q, K, V ), Hj1),(2)
where
QRN×d
is derived from
Ej1
by a pool-
ing operation,
HRN×d
stands for the candidate-
aware query states and
H0
is initialized as a zero
matrix.
3.2 Prediction
Let
H
and
E
denote the query states and the can-
didate context embeddings generated by the last
interaction layer, respectively. For the
i
-th can-
didate, its representation is the mean of the
i
-th
row of
E
, denoted as
ei
. The representation of the
query with respect to this candidate is the
i
-th row
of
H
, denoted as
hi
. The cosine similarity between
ei
and
hi
is used as the semantic similarity. Addi-
tionally, we can pass
ei
and
hi
to a classifier for
classification tasks.
摘要:

OnceisEnough:ALightweightCross-AttentionforFastSentencePairModelingYuanhangYang1ShiyiQi1ChuanyiLiu1QifanWang2CuiyunGao1ZenglinXu11HarbinInstituteofTechnology,Shenzhen,China2MetaAI,CA,USA{ysngkil,syqi12138}@gmail.comliuchuanyi@hit.edu.cnwqfcr@fb.com{gaocuiyun,xuzenglin}@hit.edu.cnAbstractTransformer-...

展开>> 收起<<
Once is Enough A Lightweight Cross-Attention for Fast Sentence Pair Modeling Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:1.45MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注