Once is Enough: A Lightweight Cross-Attention for
Fast Sentence Pair Modeling
Yuanhang Yang1Shiyi Qi1Chuanyi Liu1Qifan Wang2
Cuiyun Gao1Zenglin Xu1
1Harbin Institute of Technology, Shenzhen, China
2Meta AI, CA, USA
{ysngkil, syqi12138}@gmail.com liuchuanyi@hit.edu.cn wqfcr@fb.com
{gaocuiyun, xuzenglin}@hit.edu.cn
Abstract
Transformer-based models have achieved great
success on sentence pair modeling tasks, such
as answer selection and natural language infer-
ence (NLI). These models generally perform
cross-attention over input pairs, leading to pro-
hibitive computational costs. Recent studies
propose dual-encoder and late interaction ar-
chitectures for faster computation. However,
the balance between the expressive of cross-
attention and computation speedup still needs
better coordinated. To this end, this paper in-
troduces a novel paradigm MixEncoder for ef-
ficient sentence pair modeling. MixEncoder
involves a lightweight cross-attention mecha-
nism. It avoids the repeated encoding of the
same query for different candidates, thus al-
lowing modeling the query-candidate interac-
tion in parallel. Extensive experiments con-
ducted on four tasks demonstrate that our Mix-
Encoder can speed up sentence pairing by over
113x while achieving comparable performance
as the more expensive cross-attention mod-
els. The source code is available at
https:
//github.com/ysngki/MixEncoder.
1 Introduction
Sentence pair modeling, such as natural language
inference, question answering, and information
retrieval, is an essential task in natural language
processing (Nogueira and Cho,2020;Qu et al.,
2021;Zhao et al.,2021). These tasks can be de-
picted as a procedure of scoring the candidates
given a query. Recently, Transformer-based mod-
els (Vaswani et al.,2017;Devlin et al.,2019) have
shown promising performance on sentence pair
modeling tasks due to the expressiveness of the
pre-trained cross-encoder. As shown in Figure 1(a),
the cross-encoder takes a pair of query and candi-
date as input and calculates the interaction between
them at each layer by the input-wide self-attention
mechanism. Despite the effective text representa-
tion power, the cross-encoder leads to exhaustive
computation costs, especially when the number of
candidates is very large ( e.g., the interaction will
be calculated
N
times if there are
N
candidates).
This computation cost, therefore, restricts the use
of these cross-encoder models in many real-world
applications (Chen et al.,2020).
To tackle this issue, we propose a lightweight
cross-attention mechanism, called MixEncoder,
that speeds up the inference while maintaining
the expressiveness of cross-attention. Specifically,
the proposed MixEncoder accelerates the cross-
attention by performing attention only from candi-
dates to the query, involving few tokens and only
at a few layers. This lightweight cross-attention
avoids repetitive query encoding, supporting the
processing of multiple candidates in parallel and
thus reducing computation costs. Additionally,
MixEncoder allows to pre-compute the candidates
into several dense context embeddings and to store
them offline to accelerate the inference further.
We evaluate MixEncoder for sentence pair mod-
eling on four benchmark datasets related to tasks
of natural language inference, dialogue, and infor-
mation retrieval. The results demonstrate that Mix-
Encoder better balances the effectiveness and effi-
ciency. For example, MixEncoder achieves a sub-
stantial speedup of more than 113x over the cross-
encoder and provides competitive performance.
2 Background
Extensive studies, including dual-encoder (Reimers
and Gurevych,2019) and late interaction models
(MacAvaney et al.,2020;Gao et al.,2020;Chen
et al.,2020;Khattab and Zaharia,2020), have been
proposed to accelerate the transformer inference on
sentence pair modeling tasks.
As shown in Figure 1, dual-encoders process
the query and candidates separately, allowing pre-
computing the candidates to accelerate online in-
ference, resulting in fast inference speed. However,
this speedup is built upon sacrificing the expres-
arXiv:2210.05261v3 [cs.CL] 22 Oct 2023