Constrained Approximate Similarity Search on Proximity Graph Weijie Zhao Shulong Tan Ping Li

2025-05-02 0 0 1.73MB 21 页 10玖币

侵权投诉

Constrained Approximate Similarity

Search on Proximity Graph

Weijie Zhao, Shulong Tan, Ping Li

Cognitive Computing Lab

Baidu Research

10900 NE 8th St. Bellevue, WA 98004, USA

{zhaoweijie12, tanshulong2011, pingli98}@gmail.com

Abstract

Search engines and recommendation systems are built to eﬃciently display relevant information

from those massive amounts of candidates. Typically a three-stage mechanism is employed in

those systems: (i) a small collection of items are ﬁrst retrieved by (e.g.,) approximate near

neighbor search algorithms; (ii) then a collection of constraints are applied on the retrieved items;

(iii) a ﬁne-grained ranking neural network is employed to determine the ﬁnal recommendation.

We observe a major defect of the original three-stage pipeline: Although we only target to

retrieve

vectors in the ﬁnal recommendation, we have to preset a suﬃciently large

(

s > k

)

for each query, and “hope” the number of survived vectors after the ﬁltering is not smaller than

k. That is, at least kvectors in the ssimilar candidates satisfy the query constraints.

In this paper, we investigate this

constrained similarity search

problem and attempt to

merge the similarity search stage and the ﬁltering stage into one single search operation. We

introduce

AIRSHIP

, a system that integrates a user-deﬁned function ﬁltering into the similarity

search framework. The proposed system does not need to build extra indices nor require prior

knowledge of the query constraints. We propose three optimization strategies: (1) starting point

selection, (2) multi-direction search, and (3) biased priority queue selection. Our ﬁrst starting

point selection optimization is to locate good starting points for the graph searching algorithm.

We would like to start the graph search inside a cluster of satisﬁed vectors. In this scenario, a

small number of satisﬁed vectors can be retrieved along the path to the query vector. Then, we

propose a multi-direction search optimization that enables multi-direction searching. Compared

with the single priority queue used in the original graph search, we have two priority queues

to keep search candidates: on stores the satisﬁed vectors and the other maintains remaining

unsatisﬁed vectors. When both priority queues are not empty, we choose each queue with

a weight ratio,

alter_ratio

. On top of that, we propose a biased priority queue selection to

adaptively select the queue for searching: when the top candidate from the satisﬁed vector queue

is better than the one from the other queue, we override the

alter_ratio

restriction and select the

satisﬁed vector queue. All three optimizations improve the similarity search performance. The

proposed algorithm has a hyper-parameter alter_ratio that is data/query dependent. We present

an estimation method to adaptively choose the hyper-parameter. Experimental evaluations on

both synthetic and real data conﬁrm the eﬀectiveness of the proposed AIRSHIP algorithm.

We focus on constrained graph-based approximate near neighbor (ANN) search in this study, in

part because graph-based ANN is known to achieve excellent performance. We believe it is also

possible to develop constrained hashing-based ANN or constrained quantization-based ANN.

arXiv:2210.14958v2 [cs.IR] 8 Nov 2022

1 Introduction

The task of searching for similar items is the standard routine in numerous industrial applications as

well as research problems in computer science and other ﬁelds. Finding the exact near (or nearest)

neighbors is often expensive and typically an approximate solution would be suﬃcient. The research

on developing eﬃcient algorithms for approximate near neighbor (ANN) search dated back at least

to the 1970’s (Friedman et al.,1975,1977). In recent years, ANN algorithms have become the crucial

component in modern recommender systems (Davidson et al.,2010;Dror et al.,2011;Fan et al.,2019;

Kanakia et al.,2019), Among many application domains. In this paper, we study ANN algorithms

for an important application scenario, that is, the retrieved items must satisfy certain pre-speciﬁed

constraints. More speciﬁcally, we focus on graph-based ANN algorithms under constraints. We

should emphasize that, although this research problem was motivated from the production needs at

Baidu, we only report experiments using public datasets.

1.1 Algorithms for Approximate Near Neighbor (ANN) Search

Broadly speaking, most approximate similarity search methods fall in the following 5 categories:

•

Hash-based methods. Traditionally, hashing-based methods (Broder et al.,1997;Indyk and

Motwani,1998;Charikar,2002;Datar et al.,2004;Li and Church,2005;Lv et al.,2007;Paulevé

et al.,2010;Shrivastava and Li,2012,2014;Wang and Dasgupta,2016;Li,2019;Li et al.,2021)

partition the base vectors into a constant number of buckets via a speciﬁc hashing function.

The chosen hashing function guarantees a collision probability is positively correlated to the

speciﬁc similarity of two vectors, i.e., similar vectors have a high probability to be assigned

to the same hash bucket. Relatively recently, learning to hash methods (Kulis and Darrell,

2009;Grauman and Fergus,2013;Lai et al.,2015;Wang et al.,2016a;Dong et al.,2020)

become popular. Typically they learn the hashing on a sample of data so that the learned

hash function can better ﬁt the data distribution compared with the classic data-independent

hashing algorithms.

•

Tree-based methods (Friedman et al.,1975,1977;Jagadish et al.,2005;Cayton,2008;Ram

and Gray,2012;Curtin et al.,2013;Ram and Sinha,2019) partition the high-dimensional data

space into smaller regions via a decision tree/space partitioning tree. The partitioned spaces

are organized within a tree: A leaf node in the tree corresponds to the ﬁnest granularity of

the partitioned space, and edges between tree nodes represent the hierarchical relationship of

partitioned spaces. Hamilton et al. (2020) recently propose a tree-based method that builds

an inverted index for each label/attributes. It is infeasible to build an index for each label

combination without prior knowledge of the query constraints. Our proposed solution performs

the original index and does not need to build any additional indices. Moreover, we do not need

to know the query constraints beforehand.

•

Quantization-based algorithms (Jégou et al.,2011;Ge et al.,2013;Wang et al.,2016b;Wu

et al.,2017;Xu et al.,2018;Echihabi et al.,2021) learn a code-book from base vectors and

quantize the original high-dimensional vectors into a low-dimensional quantized values. The

quantized value is a low-dimensional representation of the original vectors. The distances

between quantized vectors approximate the distances between original vectors.

•

Graph-based algorithms achieve a lot of attention recently due to its superior performance to

other similarity search techniques on many real-world datasets and industrial applications (Ha-

jebi et al.,2011;Wu et al.,2014;Malkov et al.,2014;Malkov and Yashunin,2020;Fu et al.,

2019;Tan et al.,2019;Zhou et al.,2019;Tan et al.,2021a;Xu et al.,2022). Graph-based

algorithms build a proximity graph as an index and perform a graph searching to answer the

query. We investigate the constrained similarity search on the graph-based algorithm. Recently,

Zhao et al. (2020) develop the ﬁrst GPU-based graph-based ANN algorithm.

•

Neural ranking. To resolve issues related to proximity search in various neural models, in

recent years, the subject of “neural ranking” has attracted increasingly more attentions (Zhu

et al.,2018,2019;Tan et al.,2020;Zhuo et al.,2020;Gao et al.,2020;Tan et al.,2021b;Yu

et al.,2022;Zhao et al.,2022). Eﬀective algorithms for neural ranking can be very desirable in

practice. For example, while the standard “two-tower” model is popular and fairly eﬀective,

practitioners might hope to replace the simple “cosine” loss function with a most sophisticated

deep neural nets for better retrieval accuracy. However, once we have replaced the cosine by a

neural net, we can no longer use the standard ANN techniques to achieve fast ranking.

Among the above methods, hashing-based methods are typically the simplest but the performance

highly depends on the data and the chosen hashing functions. Graph-based ANN methods typically

outperforms other methods, often considerably. In this study, we focus on graph-based ANN methods.

1.2 Approximate Near Neighbor Search Under Constraints

In this paper, we consider a practical scenario commonly encountered in industrial applications, as

illustrated in Figure 1. That is, the returned search results have to satisfy certain constraints. While

engineers and researchers from industry would probably agree this is a ubiquitous task in general,

how to eﬀectively solve the problem depends on the pattern of the constraints.

Base vectors

(n)

Similar candidates

(s)

Similarity

search Filtering Ranking

Top-k results

(k)

Original

Constrained similarity search Ranking

Proposed

Similar &

constrained

(c)

Top-k results

(k)

Base vectors

(n)

Similar &

constrained

(k)

Figure 1: The three-stage pipeline v.s our proposed constrained similarity search workﬂow.

In Figure 1(upper), we describe the usual strategy. That is, we consider there are initially

vectors. For a particular query, the ANN search algorithm returns

vectors (from the

vectors).

After applying the constraints on the

vectors, the number is reduced to

. Finally, we apply the

re-ranking algorithm (which can be either simply based on the similarity or based on the prediction

results from a neural net or other learning model) to obtain the ﬁnal top-

results. Obviously, with

this strategy, practitioners would like to make sure that

c>k

, but this is not always easy to achieve

unless we increase

to be much larger than

. For example, when the target is to retrieve the

top-1000 vectors, practitioners may have to ﬁrst retrieve the top-5000 or even top-10000 vectors

and then apply the constraints on the retrieved vectors. In this study, we hope to develop better

schemes to improve the eﬃciency.

Diﬀerent from the pipeline in the upper part of Figure 1, there is another obvious strategy

by simply applying the constraints on the original

vectors before conducting approximate near

neighbor search. If engineers know (or guess) beforehand that directly ﬁltering on the

vectors

would result in only a small number of vectors, then this would be a good strategy, provided that

the cost for applying the constraints on all

vectors is not too high (for example, when indices have

been built for those constraints). In this paper, we do not consider this scenario.

As illustrated in the bottom part of Figure 1, in this study, our goal is to directly retrieve the

top-

vectors which satisfy the constraints. We focus on the graph-based ANN algorithm in this

paper and we believe that one can also develop, with eﬀorts, retrieval algorithms of similar nature

based on other ANN methods such as hashing-based or quantization-based ANN algorithms. We

leave them for future research.

1.3 Constrained Similarity Search

More formally, we merge the similarity search stage and the ﬁltering stage into one single search

operation. The merging has two principal advantages:

The constrained similarity search targets to return top-

similar candidates that satisfy the

constraints. We do not need to estimate sto ensure the search performance.

The similarity search algorithm takes the constraints into the consideration—additional indices

and optimizations can be potentially applied to the search.

Problem Statement.

We have a collection of

n d

-dimensional base vectors

{v1, v2,· · · , vn}

Each vector has

attributes: the attributes of the

ith

vector are

{ai1, ai2,· · · , aim}

. In addition, we

are given

queries. The

jth

query consists of a query vector

and a user-deﬁned function

as the

constraint. For each query vector

, we are required to return

similar candidates (

rj1, rj2,· · · , rjk

)

from base vectors and each candidate rj·satisﬁes the constraint, i.e., fj(rj·)→true.

Challenges & approaches.

We identify two main challenges for the constrained similarity search.

The ﬁrst challenge is that we are required to eﬃciently identify the “interesting region” that contains

vectors that meet the given constraints. There is no sub-linear algorithm to tackle this challenge

without any assumptions to the query constraints and vector distributions: Before knowing queries,

it is impractical to build indices for all possible constraints in pre-processing. On the other hand,

after queries arrive, building indices for all diﬀerent constraints is more time-consuming than a linear

scan. Therefore, we have one assumption in this paper: for each constraint, there are at least

%of

base vectors are satisﬁed vectors.

Secondly, the distributions of the vectors that satisfy the constraints are unknown. We need to

retrieve similar vectors from them. However, similarity search techniques require us to construct an

index beforehand. In this paper, we propose an adaptive searching algorithm on proximity graph

that explores the clusterness of the vector distribution—vectors which meet the constraints are

very likely to cluster in real-world datasets because the vector representations are trained based on

those attributes. We leverage a heuristic algorithm to determine hyper-parameters in the proposed

searching algorithm.

Why not inverted index?

We have explained this earlier but we would like to re-iterate the

problem since it is a common question. Basically, for applications in which it is eﬃcient to build

indices for the constraints and the fraction of data vectors after ﬁltering is small, then we might

want to go with this straightforward approach. In real applications, however, it is often ineﬃcient

to build indices for all the (potentially combinatorial number of) constraints. Creating this many

indices is not only time-consuming but also space-intensive, because we would need to duplicate a

vector each time it is captured by a query constraint. Note that the satisﬁed vectors for diﬀerent

query constraints may overlap. Hence, in this paper we seek more eﬃcient solutions for scenarios in

which the inverted index approach is not applicable.

Contributions. Our major contributions are summarized as follows:

•

We introduce AIRSHIP (an acronym for AttrIbute-constRained Similarity searcH on proxImity

graPh) that integrates a user-deﬁned function ﬁltering into the similarity search framework.

AIRSHIP requires no extra indices nor prior knowledge of the query constraints.

•

We propose three optimizations for the attribute-constrained searching problem: starting point

selection, multi-direction search, and biased priority queue selection.

•We present an estimation method to choose the hyper-parameter of the searching algorithm.

•

We experimentally evaluate the proposed algorithm on both synthetic and real datasets. The

empirical results conﬁrm the eﬀectiveness of our optimizations.

2 AIRSHIP: Attribute-Constrained Similarity Search

In this section, we introduce our solutions to the attribute-restricted similarity search problem. We

begin with a baseline adaption of graph-based similarity search methods. Then, we discuss the

drawbacks of the baseline solution and propose our optimizations step by step. Each optimization is

built on top of the previous one.

2.1 Vanilla Graph-Based Similarity Search

Algorithm 1illustrates the workﬂow of the vanilla graph-based similarity search. From Lines 1-3, a

random starting point

is selected and base data structures are initialized. In each iteration, the

vector that is closest to the query point is extracted from the priority queue (Line 5). The algorithm

is terminated when we have obtained

candidates and the current searched vector is worse than

the worst vector in topk (Lines 6-8). The only diﬀerence to the original graph-based similarity search

algorithm locates in Lines 9-14: We only update the top-

candidates when the current searched

vector satisﬁes the query constraint. After that, Lines 15-21 iterate over all unvisited neighbors of

now_idx, add them to the priority queue, and mark them as visited.

Note that this vanilla algorithm is ineﬃcient when the query vector does not locate near the

satisﬁed vectors. Figure 2(a) presents an example: The searching algorithm goes from the starting

point toward the query vector, however, no satisﬁed vectors can be found in that path. The searching

algorithm keeps expanding the searching radius and cannot touch any satisﬁed vector until it

enumerates a lot of very similar but unsatisﬁed vectors.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConstrainedApproximateSimilaritySearchonProximityGraphWeijieZhao,ShulongTan,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{zhaoweijie12,tanshulong2011,pingli98}@gmail.comAbstractSearchenginesandrecommendationsystemsarebuilttoecientlydisplayrelevantinformationfromthosemass...

展开>> 收起<<

Constrained Approximate Similarity Search on Proximity Graph Weijie Zhao Shulong Tan Ping Li.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Constrained Approximate Similarity Search on Proximity Graph Weijie Zhao Shulong Tan Ping Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: