
Efficient Nearest Neighbor Search for Cross-Encoder Models
using Matrix Factorization
Nishant Yadav†, Nicholas Monath♦, Rico Angell†,
Manzil Zaheer♦, and Andrew McCallum†
†University of Massachusetts Amherst ♦Google Research
{nishantyadav, rangell, mccallum}@cs.umass.edu
{nmonath,manzilzaheer}@google.com
Abstract
Efficient k-nearest neighbor search is a fun-
damental task, foundational for many prob-
lems in NLP. When the similarity is measured
by dot-product between dual-encoder vectors
or `2-distance, there already exist many scal-
able and efficient search methods. But not
so when similarity is measured by more ac-
curate and expensive black-box neural simi-
larity models, such as cross-encoders, which
jointly encode the query and candidate neigh-
bor. The cross-encoders’ high computational
cost typically limits their use to reranking can-
didates retrieved by a cheaper model, such as
dual encoder or TF-IDF. However, the accu-
racy of such a two-stage approach is upper-
bounded by the recall of the initial candidate
set, and potentially requires additional train-
ing to align the auxiliary retrieval model with
the cross-encoder model. In this paper, we
present an approach that avoids the use of a
dual-encoder for retrieval, relying solely on
the cross-encoder. Retrieval is made efficient
with CUR decomposition, a matrix decompo-
sition approach that approximates all pairwise
cross-encoder distances from a small subset
of rows and columns of the distance matrix.
Indexing items using our approach is compu-
tationally cheaper than training an auxiliary
dual-encoder model through distillation. Em-
pirically, for k > 10, our approach provides
test-time recall-vs-computational cost trade-
offs superior to the current widely-used meth-
ods that re-rank items retrieved using a dual-
encoder or TF-IDF.
1 Introduction
Finding top-
k
scoring items for a given query is a
fundamental sub-routine of recommendation and
information retrieval systems (Kowalski,2007;Das
et al.,2017). For instance, in question answering
systems, the query corresponds to a question and
the item corresponds to a document or a passage.
Neural networks are widely used to model the sim-
ilarity between a query and an item in such ap-
plications (Zamani et al.,2018;Hofstätter et al.,
2019;Karpukhin et al.,2020;Qu et al.,2021). In
this work, we focus on efficient
k
-nearest neigh-
bor search for one such similarity function – the
cross-encoder model.
Cross-encoder models output a scalar similarity
score by jointly encoding the query-item pair and
often generalize better to new domains and unseen
data (Chen et al.,2020;Wu et al.,2020;Thakur
et al.,2021) as compared to dual-encoder
1
models
which independently embed the query and the item
in a vector space, and use simple functions such as
dot-product to measure similarity. However, due
to the black-box nature of the cross-encoder based
similarity function, the computational cost for brute
force search with cross-encoders is prohibitively
high. This often limits the use of cross-encoder
models to re-ranking items retrieved using a sepa-
rate retrieval model such as a dual-encoder or a TF-
IDF-based model (Logeswaran et al.,2019;Zhang
and Stratos,2021;Qu et al.,2021). The accuracy of
such a two-stage approach is upper bounded by the
recall of relevant items by the initial retrieval model.
Much of recent work either attempts to distill in-
formation from an expensive but more expressive
cross-encoder model into a cheaper student model
such as a dual-encoder (Wu et al.,2020;Hofstätter
et al.,2020;Lu et al.,2020;Qu et al.,2021;Liu
et al.,2022), or focuses on cheaper alternatives to
the cross-encoder model while attempting to cap-
ture fine-grained interactions between the query
and the item (Humeau et al.,2020;Khattab and
Zaharia,2020;Luan et al.,2021).
In this work, we tackle the fundamental task
of efficient
k
-nearest neighbor search for a given
query according to the cross-encoder. Our pro-
posed approach, ANNCUR, uses CUR decompo-
sition (Mahoney and Drineas,2009), a matrix fac-
torization approach, to approximate cross-encoder
1also referred to as two-tower models, Siamese networks
arXiv:2210.12579v1 [cs.CL] 23 Oct 2022