Once is Enough A Lightweight Cross-Attention for Fast Sentence Pair Modeling Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2

2025-05-02 0 0 1.45MB 7 页 10玖币

侵权投诉

Once is Enough: A Lightweight Cross-Attention for

Fast Sentence Pair Modeling

Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2

Cuiyun Gao1Zenglin Xu1

1Harbin Institute of Technology, Shenzhen, China

2Meta AI, CA, USA

{ysngkil, syqi12138}@gmail.com liuchuanyi@hit.edu.cn wqfcr@fb.com

{gaocuiyun, xuzenglin}@hit.edu.cn

Abstract

Transformer-based models have achieved great

success on sentence pair modeling tasks, such

as answer selection and natural language infer-

ence (NLI). These models generally perform

cross-attention over input pairs, leading to pro-

hibitive computational costs. Recent studies

propose dual-encoder and late interaction ar-

chitectures for faster computation. However,

the balance between the expressive of cross-

attention and computation speedup still needs

better coordinated. To this end, this paper in-

troduces a novel paradigm MixEncoder for ef-

ﬁcient sentence pair modeling. MixEncoder

involves a lightweight cross-attention mecha-

nism. It avoids the repeated encoding of the

same query for different candidates, thus al-

lowing modeling the query-candidate interac-

tion in parallel. Extensive experiments con-

ducted on four tasks demonstrate that our Mix-

Encoder can speed up sentence pairing by over

113x while achieving comparable performance

as the more expensive cross-attention mod-

els. The source code is available at

https:

//github.com/ysngki/MixEncoder.

1 Introduction

Sentence pair modeling, such as natural language

inference, question answering, and information

retrieval, is an essential task in natural language

processing (Nogueira and Cho,2020;Qu et al.,

2021;Zhao et al.,2021). These tasks can be de-

picted as a procedure of scoring the candidates

given a query. Recently, Transformer-based mod-

els (Vaswani et al.,2017;Devlin et al.,2019) have

shown promising performance on sentence pair

modeling tasks due to the expressiveness of the

pre-trained cross-encoder. As shown in Figure 1(a),

the cross-encoder takes a pair of query and candi-

date as input and calculates the interaction between

them at each layer by the input-wide self-attention

mechanism. Despite the effective text representa-

tion power, the cross-encoder leads to exhaustive

computation costs, especially when the number of

candidates is very large ( e.g., the interaction will

be calculated

times if there are

candidates).

This computation cost, therefore, restricts the use

of these cross-encoder models in many real-world

applications (Chen et al.,2020).

To tackle this issue, we propose a lightweight

cross-attention mechanism, called MixEncoder,

that speeds up the inference while maintaining

the expressiveness of cross-attention. Speciﬁcally,

the proposed MixEncoder accelerates the cross-

attention by performing attention only from candi-

dates to the query, involving few tokens and only

at a few layers. This lightweight cross-attention

avoids repetitive query encoding, supporting the

processing of multiple candidates in parallel and

thus reducing computation costs. Additionally,

MixEncoder allows to pre-compute the candidates

into several dense context embeddings and to store

them ofﬂine to accelerate the inference further.

We evaluate MixEncoder for sentence pair mod-

eling on four benchmark datasets related to tasks

of natural language inference, dialogue, and infor-

mation retrieval. The results demonstrate that Mix-

Encoder better balances the effectiveness and efﬁ-

ciency. For example, MixEncoder achieves a sub-

stantial speedup of more than 113x over the cross-

encoder and provides competitive performance.

2 Background

Extensive studies, including dual-encoder (Reimers

and Gurevych,2019) and late interaction models

(MacAvaney et al.,2020;Gao et al.,2020;Chen

et al.,2020;Khattab and Zaharia,2020), have been

proposed to accelerate the transformer inference on

sentence pair modeling tasks.

As shown in Figure 1, dual-encoders process

the query and candidates separately, allowing pre-

computing the candidates to accelerate online in-

ference, resulting in fast inference speed. However,

this speedup is built upon sacriﬁcing the expres-

arXiv:2210.05261v3 [cs.CL] 22 Oct 2023

(b) Dual-encoder (c) Late interaction

Query Candidate

CLS

(a) Cross-encoder

Query Candidate

CLS

Classifier

Cache

CLS Query CandidateCLS

Cache

CLS

Classifier

Executed N times

Executed once

Vanilla Transformer layer

��

Offline Calculation

Figure 1: Illustration of three popular sentence pair

approaches, where

denotes the number of candidates

and

denotes the relevance score of candidate-query

pairs. The cache stores the pre-computed embeddings.

QueryCLS

CandidateCLS

(a) Candidate Pre-computation (b) Online Encoding & Classification/Ranking

Cache Classifier

Interaction Layer

Cache

LN LN LN

Pooling

Att Att Gate

Att

Interaction Layer

Candidates

Figure 2: Overview of proposed MixEncoder.

siveness of cross-attention (Luan et al.,2021;Hu

et al.,2021;Zhang et al.,2021). Alternatively,

late-interaction models adjust dual-encoders by ap-

pending an interaction component, such as a stack

of Transformer layers (Cao et al.,2020;Nie et al.,

2020), for modeling the interaction between the

query and the cached candidates. These approaches

still suffer from the high costs of the interaction

component (Chen et al.,2020).

3 Method

In this section, we introduce the details of the

proposed MixEncoder, which simpliﬁes cross-

attention by enabling pre-computation, reducing

the times of query encoding, and reducing the num-

ber of involved tokens and layers.

3.1 Candidate Pre-computation

Given a candidate that is a sequence of tokens

Ti= [t1,· · · , tl]

, we experiment with two strate-

gies to encode these tokens into

context embed-

dings in advance, where

k≪l

: (1) prepending

special tokens

{Si}k

i=1

before feeding

into the Transformer encoder (Vaswani et al.,2017;

Devlin et al.,2019), and using the output at these

special tokens as context embeddings (

-strategy);

(2) maintaining

context codes (Humeau et al.,

2020) to extract global features from output of the

encoder by attention mechanism (

-strategy). The

default conﬁguration is

-strategy as it provides

slightly better performance. The pre-computed con-

text embeddings

E∈RN×k×d

are cached for on-

line inference, where

is the number of candi-

dates.

3.1.1 Query Encoding

Since the cross-encoder performs

times of query

encoding, which contributes to the inefﬁciency, a

straightforward way to accelerate the inference is

to reduce the encoding times of the query. Here we

encode the query without taking its candidates into

account, thus requiring the encoding only once.

To preserve the expressiveness of the cross-

attention, the simpliﬁed cross-attention is per-

formed at several interaction layers. As shown

in Figure 2, the context embeddings

Ej−1

of can-

didates are allowed to attend over the intermedi-

ate token embeddings of the query, thus obtaining

context-aware representations

and

for the

query and its candidates.

Concretely, at each interaction layer, the key and

value matrices of the query are utilized by candi-

dates in two ways. (1) Producing contextualized

representations for the candidates:

Ej=Attn(Q′,[K′;K],[V′;V]),(1)

where

Q′

K′

V′

are derived from the

Ej−1

with

a linear transformation.

is supposed to con-

tain semantics from both the query and candidates.

(2) Compressing the semantics of the query into a

vector for each candidate:

Hj=Gate(Attn(Q∗, K, V ), Hj−1),(2)

where

Q∗∈RN×d

is derived from

Ej−1

by a pool-

ing operation,

H∈RN×d

stands for the candidate-

aware query states and

is initialized as a zero

matrix.

3.2 Prediction

Let

and

denote the query states and the can-

didate context embeddings generated by the last

interaction layer, respectively. For the

-th can-

didate, its representation is the mean of the

-th

row of

, denoted as

. The representation of the

query with respect to this candidate is the

-th row

, denoted as

. The cosine similarity between

and

is used as the semantic similarity. Addi-

tionally, we can pass

and

to a classiﬁer for

classiﬁcation tasks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnceisEnough:ALightweightCross-AttentionforFastSentencePairModelingYuanhangYang1ShiyiQi1ChuanyiLiu1QifanWang2CuiyunGao1ZenglinXu11HarbinInstituteofTechnology,Shenzhen,China2MetaAI,CA,USA{ysngkil,syqi12138}@gmail.comliuchuanyi@hit.edu.cnwqfcr@fb.com{gaocuiyun,xuzenglin}@hit.edu.cnAbstractTransformer-...

展开>> 收起<<

Once is Enough A Lightweight Cross-Attention for Fast Sentence Pair Modeling Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Once is Enough A Lightweight Cross-Attention for Fast Sentence Pair Modeling Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: