Constrained Approximate Similarity Search on Proximity Graph Weijie Zhao Shulong Tan Ping Li

2025-05-02 0 0 1.73MB 21 页 10玖币
侵权投诉
Constrained Approximate Similarity
Search on Proximity Graph
Weijie Zhao, Shulong Tan, Ping Li
Cognitive Computing Lab
Baidu Research
10900 NE 8th St. Bellevue, WA 98004, USA
{zhaoweijie12, tanshulong2011, pingli98}@gmail.com
Abstract
Search engines and recommendation systems are built to efficiently display relevant information
from those massive amounts of candidates. Typically a three-stage mechanism is employed in
those systems: (i) a small collection of items are first retrieved by (e.g.,) approximate near
neighbor search algorithms; (ii) then a collection of constraints are applied on the retrieved items;
(iii) a fine-grained ranking neural network is employed to determine the final recommendation.
We observe a major defect of the original three-stage pipeline: Although we only target to
retrieve
k
vectors in the final recommendation, we have to preset a sufficiently large
s
(
s > k
)
for each query, and “hope” the number of survived vectors after the filtering is not smaller than
k. That is, at least kvectors in the ssimilar candidates satisfy the query constraints.
In this paper, we investigate this
constrained similarity search
problem and attempt to
merge the similarity search stage and the filtering stage into one single search operation. We
introduce
AIRSHIP
, a system that integrates a user-defined function filtering into the similarity
search framework. The proposed system does not need to build extra indices nor require prior
knowledge of the query constraints. We propose three optimization strategies: (1) starting point
selection, (2) multi-direction search, and (3) biased priority queue selection. Our first starting
point selection optimization is to locate good starting points for the graph searching algorithm.
We would like to start the graph search inside a cluster of satisfied vectors. In this scenario, a
small number of satisfied vectors can be retrieved along the path to the query vector. Then, we
propose a multi-direction search optimization that enables multi-direction searching. Compared
with the single priority queue used in the original graph search, we have two priority queues
to keep search candidates: on stores the satisfied vectors and the other maintains remaining
unsatisfied vectors. When both priority queues are not empty, we choose each queue with
a weight ratio,
alter_ratio
. On top of that, we propose a biased priority queue selection to
adaptively select the queue for searching: when the top candidate from the satisfied vector queue
is better than the one from the other queue, we override the
alter_ratio
restriction and select the
satisfied vector queue. All three optimizations improve the similarity search performance. The
proposed algorithm has a hyper-parameter alter_ratio that is data/query dependent. We present
an estimation method to adaptively choose the hyper-parameter. Experimental evaluations on
both synthetic and real data confirm the effectiveness of the proposed AIRSHIP algorithm.
We focus on constrained graph-based approximate near neighbor (ANN) search in this study, in
part because graph-based ANN is known to achieve excellent performance. We believe it is also
possible to develop constrained hashing-based ANN or constrained quantization-based ANN.
1
arXiv:2210.14958v2 [cs.IR] 8 Nov 2022
1 Introduction
The task of searching for similar items is the standard routine in numerous industrial applications as
well as research problems in computer science and other fields. Finding the exact near (or nearest)
neighbors is often expensive and typically an approximate solution would be sufficient. The research
on developing efficient algorithms for approximate near neighbor (ANN) search dated back at least
to the 1970’s (Friedman et al.,1975,1977). In recent years, ANN algorithms have become the crucial
component in modern recommender systems (Davidson et al.,2010;Dror et al.,2011;Fan et al.,2019;
Kanakia et al.,2019), Among many application domains. In this paper, we study ANN algorithms
for an important application scenario, that is, the retrieved items must satisfy certain pre-specified
constraints. More specifically, we focus on graph-based ANN algorithms under constraints. We
should emphasize that, although this research problem was motivated from the production needs at
Baidu, we only report experiments using public datasets.
1.1 Algorithms for Approximate Near Neighbor (ANN) Search
Broadly speaking, most approximate similarity search methods fall in the following 5 categories:
Hash-based methods. Traditionally, hashing-based methods (Broder et al.,1997;Indyk and
Motwani,1998;Charikar,2002;Datar et al.,2004;Li and Church,2005;Lv et al.,2007;Paulevé
et al.,2010;Shrivastava and Li,2012,2014;Wang and Dasgupta,2016;Li,2019;Li et al.,2021)
partition the base vectors into a constant number of buckets via a specific hashing function.
The chosen hashing function guarantees a collision probability is positively correlated to the
specific similarity of two vectors, i.e., similar vectors have a high probability to be assigned
to the same hash bucket. Relatively recently, learning to hash methods (Kulis and Darrell,
2009;Grauman and Fergus,2013;Lai et al.,2015;Wang et al.,2016a;Dong et al.,2020)
become popular. Typically they learn the hashing on a sample of data so that the learned
hash function can better fit the data distribution compared with the classic data-independent
hashing algorithms.
Tree-based methods (Friedman et al.,1975,1977;Jagadish et al.,2005;Cayton,2008;Ram
and Gray,2012;Curtin et al.,2013;Ram and Sinha,2019) partition the high-dimensional data
space into smaller regions via a decision tree/space partitioning tree. The partitioned spaces
are organized within a tree: A leaf node in the tree corresponds to the finest granularity of
the partitioned space, and edges between tree nodes represent the hierarchical relationship of
partitioned spaces. Hamilton et al. (2020) recently propose a tree-based method that builds
an inverted index for each label/attributes. It is infeasible to build an index for each label
combination without prior knowledge of the query constraints. Our proposed solution performs
the original index and does not need to build any additional indices. Moreover, we do not need
to know the query constraints beforehand.
Quantization-based algorithms (Jégou et al.,2011;Ge et al.,2013;Wang et al.,2016b;Wu
et al.,2017;Xu et al.,2018;Echihabi et al.,2021) learn a code-book from base vectors and
quantize the original high-dimensional vectors into a low-dimensional quantized values. The
quantized value is a low-dimensional representation of the original vectors. The distances
between quantized vectors approximate the distances between original vectors.
2
Graph-based algorithms achieve a lot of attention recently due to its superior performance to
other similarity search techniques on many real-world datasets and industrial applications (Ha-
jebi et al.,2011;Wu et al.,2014;Malkov et al.,2014;Malkov and Yashunin,2020;Fu et al.,
2019;Tan et al.,2019;Zhou et al.,2019;Tan et al.,2021a;Xu et al.,2022). Graph-based
algorithms build a proximity graph as an index and perform a graph searching to answer the
query. We investigate the constrained similarity search on the graph-based algorithm. Recently,
Zhao et al. (2020) develop the first GPU-based graph-based ANN algorithm.
Neural ranking. To resolve issues related to proximity search in various neural models, in
recent years, the subject of “neural ranking” has attracted increasingly more attentions (Zhu
et al.,2018,2019;Tan et al.,2020;Zhuo et al.,2020;Gao et al.,2020;Tan et al.,2021b;Yu
et al.,2022;Zhao et al.,2022). Effective algorithms for neural ranking can be very desirable in
practice. For example, while the standard “two-tower” model is popular and fairly effective,
practitioners might hope to replace the simple “cosine” loss function with a most sophisticated
deep neural nets for better retrieval accuracy. However, once we have replaced the cosine by a
neural net, we can no longer use the standard ANN techniques to achieve fast ranking.
Among the above methods, hashing-based methods are typically the simplest but the performance
highly depends on the data and the chosen hashing functions. Graph-based ANN methods typically
outperforms other methods, often considerably. In this study, we focus on graph-based ANN methods.
1.2 Approximate Near Neighbor Search Under Constraints
In this paper, we consider a practical scenario commonly encountered in industrial applications, as
illustrated in Figure 1. That is, the returned search results have to satisfy certain constraints. While
engineers and researchers from industry would probably agree this is a ubiquitous task in general,
how to effectively solve the problem depends on the pattern of the constraints.
Base vectors
(n)
Similar candidates
(s)
Similarity
search Filtering Ranking
Top-k results
(k)
Original
Constrained similarity search Ranking
Proposed
Similar &
constrained
(c)
Top-k results
(k)
Base vectors
(n)
Similar &
constrained
(k)
Figure 1: The three-stage pipeline v.s our proposed constrained similarity search workflow.
In Figure 1(upper), we describe the usual strategy. That is, we consider there are initially
n
vectors. For a particular query, the ANN search algorithm returns
s
vectors (from the
n
vectors).
After applying the constraints on the
s
vectors, the number is reduced to
c
. Finally, we apply the
re-ranking algorithm (which can be either simply based on the similarity or based on the prediction
results from a neural net or other learning model) to obtain the final top-
k
results. Obviously, with
this strategy, practitioners would like to make sure that
c>k
, but this is not always easy to achieve
3
unless we increase
s
to be much larger than
k
. For example, when the target is to retrieve the
top-1000 vectors, practitioners may have to first retrieve the top-5000 or even top-10000 vectors
and then apply the constraints on the retrieved vectors. In this study, we hope to develop better
schemes to improve the efficiency.
Different from the pipeline in the upper part of Figure 1, there is another obvious strategy
by simply applying the constraints on the original
n
vectors before conducting approximate near
neighbor search. If engineers know (or guess) beforehand that directly filtering on the
n
vectors
would result in only a small number of vectors, then this would be a good strategy, provided that
the cost for applying the constraints on all
n
vectors is not too high (for example, when indices have
been built for those constraints). In this paper, we do not consider this scenario.
As illustrated in the bottom part of Figure 1, in this study, our goal is to directly retrieve the
top-
k
vectors which satisfy the constraints. We focus on the graph-based ANN algorithm in this
paper and we believe that one can also develop, with efforts, retrieval algorithms of similar nature
based on other ANN methods such as hashing-based or quantization-based ANN algorithms. We
leave them for future research.
1.3 Constrained Similarity Search
More formally, we merge the similarity search stage and the filtering stage into one single search
operation. The merging has two principal advantages:
1.
The constrained similarity search targets to return top-
k
similar candidates that satisfy the
constraints. We do not need to estimate sto ensure the search performance.
2.
The similarity search algorithm takes the constraints into the consideration—additional indices
and optimizations can be potentially applied to the search.
Problem Statement.
We have a collection of
n d
-dimensional base vectors
V
=
{v1, v2,· · · , vn}
.
Each vector has
m
attributes: the attributes of the
ith
vector are
{ai1, ai2,· · · , aim}
. In addition, we
are given
Q
queries. The
jth
query consists of a query vector
qj
and a user-defined function
fj
as the
constraint. For each query vector
qj
, we are required to return
k
similar candidates (
rj1, rj2,· · · , rjk
)
from base vectors and each candidate rj·satisfies the constraint, i.e., fj(rj·)true.
Challenges & approaches.
We identify two main challenges for the constrained similarity search.
The first challenge is that we are required to efficiently identify the “interesting region” that contains
vectors that meet the given constraints. There is no sub-linear algorithm to tackle this challenge
without any assumptions to the query constraints and vector distributions: Before knowing queries,
it is impractical to build indices for all possible constraints in pre-processing. On the other hand,
after queries arrive, building indices for all different constraints is more time-consuming than a linear
scan. Therefore, we have one assumption in this paper: for each constraint, there are at least
p
%of
base vectors are satisfied vectors.
Secondly, the distributions of the vectors that satisfy the constraints are unknown. We need to
retrieve similar vectors from them. However, similarity search techniques require us to construct an
index beforehand. In this paper, we propose an adaptive searching algorithm on proximity graph
that explores the clusterness of the vector distribution—vectors which meet the constraints are
very likely to cluster in real-world datasets because the vector representations are trained based on
those attributes. We leverage a heuristic algorithm to determine hyper-parameters in the proposed
searching algorithm.
4
Why not inverted index?
We have explained this earlier but we would like to re-iterate the
problem since it is a common question. Basically, for applications in which it is efficient to build
indices for the constraints and the fraction of data vectors after filtering is small, then we might
want to go with this straightforward approach. In real applications, however, it is often inefficient
to build indices for all the (potentially combinatorial number of) constraints. Creating this many
indices is not only time-consuming but also space-intensive, because we would need to duplicate a
vector each time it is captured by a query constraint. Note that the satisfied vectors for different
query constraints may overlap. Hence, in this paper we seek more efficient solutions for scenarios in
which the inverted index approach is not applicable.
Contributions. Our major contributions are summarized as follows:
We introduce AIRSHIP (an acronym for AttrIbute-constRained Similarity searcH on proxImity
graPh) that integrates a user-defined function filtering into the similarity search framework.
AIRSHIP requires no extra indices nor prior knowledge of the query constraints.
We propose three optimizations for the attribute-constrained searching problem: starting point
selection, multi-direction search, and biased priority queue selection.
We present an estimation method to choose the hyper-parameter of the searching algorithm.
We experimentally evaluate the proposed algorithm on both synthetic and real datasets. The
empirical results confirm the effectiveness of our optimizations.
2 AIRSHIP: Attribute-Constrained Similarity Search
In this section, we introduce our solutions to the attribute-restricted similarity search problem. We
begin with a baseline adaption of graph-based similarity search methods. Then, we discuss the
drawbacks of the baseline solution and propose our optimizations step by step. Each optimization is
built on top of the previous one.
2.1 Vanilla Graph-Based Similarity Search
Algorithm 1illustrates the workflow of the vanilla graph-based similarity search. From Lines 1-3, a
random starting point
S
is selected and base data structures are initialized. In each iteration, the
vector that is closest to the query point is extracted from the priority queue (Line 5). The algorithm
is terminated when we have obtained
K
candidates and the current searched vector is worse than
the worst vector in topk (Lines 6-8). The only difference to the original graph-based similarity search
algorithm locates in Lines 9-14: We only update the top-
K
candidates when the current searched
vector satisfies the query constraint. After that, Lines 15-21 iterate over all unvisited neighbors of
now_idx, add them to the priority queue, and mark them as visited.
Note that this vanilla algorithm is inefficient when the query vector does not locate near the
satisfied vectors. Figure 2(a) presents an example: The searching algorithm goes from the starting
point toward the query vector, however, no satisfied vectors can be found in that path. The searching
algorithm keeps expanding the searching radius and cannot touch any satisfied vector until it
enumerates a lot of very similar but unsatisfied vectors.
5
摘要:

ConstrainedApproximateSimilaritySearchonProximityGraphWeijieZhao,ShulongTan,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{zhaoweijie12,tanshulong2011,pingli98}@gmail.comAbstractSearchenginesandrecommendationsystemsarebuilttoecientlydisplayrelevantinformationfromthosemass...

展开>> 收起<<
Constrained Approximate Similarity Search on Proximity Graph Weijie Zhao Shulong Tan Ping Li.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.73MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注