1 Introduction
The task of searching for similar items is the standard routine in numerous industrial applications as
well as research problems in computer science and other fields. Finding the exact near (or nearest)
neighbors is often expensive and typically an approximate solution would be sufficient. The research
on developing efficient algorithms for approximate near neighbor (ANN) search dated back at least
to the 1970’s (Friedman et al.,1975,1977). In recent years, ANN algorithms have become the crucial
component in modern recommender systems (Davidson et al.,2010;Dror et al.,2011;Fan et al.,2019;
Kanakia et al.,2019), Among many application domains. In this paper, we study ANN algorithms
for an important application scenario, that is, the retrieved items must satisfy certain pre-specified
constraints. More specifically, we focus on graph-based ANN algorithms under constraints. We
should emphasize that, although this research problem was motivated from the production needs at
Baidu, we only report experiments using public datasets.
1.1 Algorithms for Approximate Near Neighbor (ANN) Search
Broadly speaking, most approximate similarity search methods fall in the following 5 categories:
•
Hash-based methods. Traditionally, hashing-based methods (Broder et al.,1997;Indyk and
Motwani,1998;Charikar,2002;Datar et al.,2004;Li and Church,2005;Lv et al.,2007;Paulevé
et al.,2010;Shrivastava and Li,2012,2014;Wang and Dasgupta,2016;Li,2019;Li et al.,2021)
partition the base vectors into a constant number of buckets via a specific hashing function.
The chosen hashing function guarantees a collision probability is positively correlated to the
specific similarity of two vectors, i.e., similar vectors have a high probability to be assigned
to the same hash bucket. Relatively recently, learning to hash methods (Kulis and Darrell,
2009;Grauman and Fergus,2013;Lai et al.,2015;Wang et al.,2016a;Dong et al.,2020)
become popular. Typically they learn the hashing on a sample of data so that the learned
hash function can better fit the data distribution compared with the classic data-independent
hashing algorithms.
•
Tree-based methods (Friedman et al.,1975,1977;Jagadish et al.,2005;Cayton,2008;Ram
and Gray,2012;Curtin et al.,2013;Ram and Sinha,2019) partition the high-dimensional data
space into smaller regions via a decision tree/space partitioning tree. The partitioned spaces
are organized within a tree: A leaf node in the tree corresponds to the finest granularity of
the partitioned space, and edges between tree nodes represent the hierarchical relationship of
partitioned spaces. Hamilton et al. (2020) recently propose a tree-based method that builds
an inverted index for each label/attributes. It is infeasible to build an index for each label
combination without prior knowledge of the query constraints. Our proposed solution performs
the original index and does not need to build any additional indices. Moreover, we do not need
to know the query constraints beforehand.
•
Quantization-based algorithms (Jégou et al.,2011;Ge et al.,2013;Wang et al.,2016b;Wu
et al.,2017;Xu et al.,2018;Echihabi et al.,2021) learn a code-book from base vectors and
quantize the original high-dimensional vectors into a low-dimensional quantized values. The
quantized value is a low-dimensional representation of the original vectors. The distances
between quantized vectors approximate the distances between original vectors.
2