SicHash Small Irregular Cuckoo Tables for Perfect Hashing

2025-05-03 0 0 1022.46KB 16 页 10玖币
侵权投诉
SicHash — Small Irregular Cuckoo Tables for Perfect Hashing
Hans-Peter LehmannPeter SandersStefan Walzer
Abstract
A Perfect Hash Function (PHF) is a hash function that
has no collisions on a given input set. PHFs can be
used for space efficient storage of data in an array, or for
determining a compact representative of each object in
the set. In this paper, we present the PHF construction
algorithm SicHash — Small Irregular Cuckoo Tables for
Perfect Hashing. At its core, SicHash uses a known
technique: It places objects in a cuckoo hash table
and then stores the final hash function choice of each
object in a retrieval data structure. We combine the
idea with irregular cuckoo hashing, where each object
has a different number of hash functions. Additionally,
we use many small tables that we overload beyond
their asymptotic maximum load factor. The most space
efficient competitors often use brute force methods to
determine the PHFs. SicHash provides a more direct
construction algorithm that only rarely needs to re-
compute parts. Our implementation improves the state
of the art in terms of space usage versus construction
time for a wide range of configurations. At the same
time, it provides very fast queries.
1 Introduction
A Perfect Hash Function (PHF) is a hash function that
does not have collisions on a given set
S
of objects, i.e., is
injective. In this paper, we call the number of objects
N
and the size of the output range
M
. A PHF with
M
=
N
is called Minimal Perfect Hash Function (MPHF). For
M
close to
N
, it is likely that an ordinary hash function
has collisions. It is therefore necessary to store additional
information, specific to the set
S
, that help to avoid these
collisions. The lower space bound for an MPHF is 1
.
44
bits/object [
2
]. PHFs can be represented with less space,
depending on the load factor
N/M
. In the literature,
load factors between 0.8 and 1.0 are common [2,38].
PHFs can be used to implement a hash table where
each query only needs to access one single cell. When
we know that the hash table is never queried for objects
6∈ S
, table cells can store the plain payload data without
Karlsruhe Institute of Technology, Germany.
Karlsruhe Institute of Technology, Germany.
Cologne University, Germany.
keys. This results in a retrieval data structure that can
be updated efficiently. Refer to Section 2for more details
about retrieval data structures and an introduction to
cuckoo hashing, which is a main building block of SicHash.
There is a wide range of PHF construction algorithms,
which we review extensively in Section 3.
The basic idea of SicHash is to distribute the input
objects to a number of small buckets and build a cuckoo
hash table in each. To obtain a PHF, we store which
hash function index was finally used to place each object
by using a retrieval data structure. The reason for
constructing small tables is the use of overloading, which
we describe in Section 4. In Section 5, we explain SicHash
in detail, giving construction and query algorithms.
Enhancements of the basic scheme, including one that
produces MPHFs, are given in Section 6. We analyze
SicHash in Section 7. We then provide an extensive
experimental evaluation in Section 8, comparing it with
a wide range of competitors. In Section 9, we summarize
the results and discuss possible future work.
Comparison with Previous Approaches.
The
most space-efficient previous algorithms perform brute-
force search as a core step to determine a perfect hash
function. In particular, both CHD [
2
] and RecSplit [
15
]
at some point try out random hash functions (on sub-
problems) until one happens to be injective. Refer to
Section 3for details. Our construction is more directed
than this because it constructs cuckoo hash tables as
its base case, which is possible in polynomial time. The
directedness is also visible in the experiments, where
our method can construct PHFs with the same space
requirements significantly faster than the competitors.
Our Contribution.
We combine and refine several
known ideas in a novel way leading to excellent space–
construction time trade-offs while using very low query
time. We base SicHash on the known idea of PHF
generation through cuckoo hashing. We use irregular
cuckoo hashing, which was previously considered for
reducing search time [
10
]. For that application it was of
little help apart from allowing to interpolate between two
uniform degree cases. In contrast, for our application
to reduce space, it is helpful even for integer average
degree. Space is further reduced using the novel idea to
overload the cuckoo hash tables, i.e., to load them with
more objects than would be possible in an asymptotic
arXiv:2210.01560v2 [cs.DS] 8 Nov 2022
sense, exploiting that the tables are small. All this keeps
the queries extremely simple — basically the cost for a
single access to a retrieval data structure. This further
profits from recent advances on fast static retrieval data
structures with virtually no space overhead [
13
]. In turn,
our PHFs can be used to obtain improved updateable
retrieval data structures as discussed before.
2 Preliminaries
Cuckoo Hashing.
Cuckoo hashing [
37
] is a well
known approach to hash tables with open addressing.
In a basic cuckoo hash table, each object can be placed
in one of two cells, determined by two hash functions.
Queries load the two candidate cells and compare both
objects. Insertion applies one of the hash functions and
places the new object in the corresponding cell. If the
cell is already occupied, the object previously placed in
that cell is pushed out and is recursively inserted using
its other hash function.
Instead of locating each object in one of two cells, the
idea can be generalized to
d
cells [
17
] by using
d
hash func-
tions. In irregular cuckoo hash tables, different objects
can have a different number of choices [
10
]. For example,
some percentage of the objects get
d1
choices, some
d2
choices, and some
d3
choices. Averaging over the
di
, the
method enables
d
-ary cuckoo hashing with non-integer
d
and higher load factors than a simple interpolation
between two ordinary cuckoo hash tables [10].
Retrieval Data Structures.
Aretrieval data
structure or static function on a set
S
of objects de-
scribes a function
f
:
S→ {
0
,
1
}r
that returns a specific
r
-bit value for each object. Applying the function on
an object not in
S
can return an arbitrary value. The
lower bound of the space requirement of a retrieval data
structure is rN bits. The best retrieval data structures
now come very close to the lower bounds (around 1%
overhead) and are also quite fast [13].
PHF Construction by Cuckoo Hashing.
To
the best of our knowledge, constructing PHFs through
cuckoo hashing was only mentioned very briefly be-
fore [
13
]. In this paragraph, we give a more detailed
and intuitive introduction to the idea. A related idea
is the construction of PHFs by solving a matching as
described by, e.g., Botelho et al. [
5
] and Navarro [
36
,
Section 4.5.3].
Assume that all input objects are inserted into a
d
-
ary cuckoo hash table. The table then implicitly describes
an injective mapping from objects to table cells, because
each cell only stores one object. For perfect hashing,
we are not interested in storing the objects themselves
but only in mapping objects to numbers, e.g., table cell
indices. Because each object can only be placed in
d
cells using
d
hash functions
hi
, we can remember the
placement of each object by simply storing which of the
hash functions was finally used to place the object. We
can do that using only about
log d
bits
1
per object by
constructing a retrieval data structure. A query for an
object
x
then retrieves the hash function index
i
(
x
) and
executes hi(x)(x) to obtain a perfect hash function.
Elias-Fano Coding.
Elias-Fano Coding [
14
,
16
] is
a way to efficiently store a monotonic sequence of
N
integers. It consists of two data structures, a bit vector
H
, and an array
L
. An item at position
i
is split into
two parts. The
log N
upper bits
u
are stored as a 1-bit
in
H
[
i
+
u
]. The remaining lower bits are directly stored
in
L
. Items can be accessed in constant time by finding
the
i
-th 1-bit in
H
using a
select1
data structure and
by looking up the lower bits in
L
. The space usage of
an Elias-Fano coded sequence is 2
N
+
Ndlog U/N e
bits,
where Uis the maximum value of an item.
Golomb-Rice Coding.
Golomb coding [
23
] with
parameter
k
can be used to store a sequence of integers
that have a geometric distribution. The idea is to store
each integer
x
as a quotient
q
=
bx/kc
in unary coding
and a remainder
xqk
in truncated binary coding. Rice
coding [
39
] is Golomb coding where
k
is a power of 2. This
makes arithmetics more efficient and simplifies storing the
remainder to a normal array with binary coding. Items
can be accessed in constant time by looking up the array
and reconstructing the quotient using a select1query.
3 Related Work
In the following, we first describe variants and enhance-
ments of cuckoo hashing from the literature. Afterwards,
we introduce existing PHFs, most of which we later in-
clude in our experimental evaluation (see Section 8).
3.1 Cuckoo Hashing.
After describing variants of
cuckoo hashing, we describe construction algorithms and
maximum load factors.
Variants.
Higher maximum load factors can be
achieved by making the cells larger, so that they hold
more than one object [
12
]. When then allowing the cells
to overlap [
31
], even higher load factors are possible [
42
].
For our application to perfect hashing, we only consider
cells of size 1. On external memory, I/Os can be
reduced by choosing candidate cells on the same page [
11
].
Maintaining two tables [
37
] of asymmetric size [
30
] can
improve the search time because more objects can be
placed using their first hash function. Giving each object
d
instead of 2 choices [
17
] increases the maximum load
factor. Irregular cuckoo hashing [
10
] uses a different
number of hash function for each object. A similar idea
can also be found in coding theory, where each message
1Throughout this paper, log xstands for log2x.
bit is covered by an irregular number of check bits [
34
].
Specifically, the probability that a message bit is covered
by
i
check bits is proportional to 1
/i
. Another related
result is the weighted Bloom filter [
7
], where objects get
a different number of hash functions (and therefore false
positive probability) based on their query frequency and
membership likelihood.
Construction.
The enhancement to
d
-ary cuckoo
hashing [
17
] makes insertions more complex because it is
no longer clear which of the alternative cells to displace
objects to. Common ways to perform insertion are to
find a shortest move sequence by performing breadth-first-
search (BFS) in a graph defining possible object moves
or by performing a random walk in that graph. Both
approaches need constant expected time when the table
is not too highly loaded [17,44,21,19,26,27].
In this paper, we are interested in the static case,
where all objects to be stored are known from the start.
In that case, it is also possible to construct the whole hash
table at once instead of using incremental insertions. Let
us model the cuckoo hash table as a bipartite graph. The
first set of graph nodes is simply the set of input objects
and the second set represents the table cells. Edges
connect each object to its candidate cells, as determined
by the
d
hash functions. A matching of size
N
then gives a
collision free assignment from objects to table cells. This
can be calculated using, for example, the Hopcroft-Karp-
Karzanov algorithm [24] or the LSA algorithm [26,27].
Load Factors and Space Usage.
Classic cuckoo
hashing with
d
= 2 hash functions has a maximum
load factor of at most
N/M
= 0
.
5. Using
d
= 4 hash
functions already increases the maximum load factor to
0
.
9768 [
18
,
43
]. In our construction, the load factor of the
PHF equals the load factor of the cuckoo hash table, and
the storage space is determined by the number of hash
functions
d
. This means that a PHF from binary cuckoo
hashing with a load factor of 0
.
5 can be represented using
log
2 = 1 bit per object. A PHF with a load factor of
0.9768 can be implemented using 2 bits per object.
Ref. [
10
] gives maximum load factors for irregular
cuckoo hashing, depending on the distribution of hash
functions used. When looking at a specific average
number of hash functions
d0R
, the best load factors
are given by combining objects with
bd0c
and
dd0e
hash
functions [
10
]. As we will see in Section 4, this is not the
case in the context of PHFs because we are looking at
storage space instead of the average hash function.
3.2 Perfect Hashing.
The perfect hashing problem
is already considered since the 1970s [
41
,
25
,
6
], and is still
an active area of research. In the following paragraphs,
we describe more recent papers.
Order-Preserving.
An order-preserving PHF
maintains the order that the input objects are given in.
CHM [
9
] and BMZ [
3
] construct an undirected graph
with edges
{
(
h1
(
x
)
, h2
(
x
))
|xS}
. They then assign a
number to each vertex, such that for each edge, the sum
of numbers stored in adjacent vertices gives the desired
PHF value. This can be done by assigning 0 to an
arbitrary vertex and then performing depth-first-search
to assign all neighbors by simple subtraction. CHM
and BMZ store 2
.
09
N
and 1
.
15
N
integer numbers,
respectively, therefore needing O
(Nlog(N))
space.
Note that space near
Nlog N
can also be achieved by
explicitly storing the rank of the objects in a retrieval
data structure.
BDZ.
In the BDZ algorithm [
5
], also called RAM
algorithm or BPZ algorithm, each input object is mapped
to an edge in a random hypergraph using
d
independent
hash functions
hi
. The hypergraph needs to be peelable
2
in order to continue. By peeling the graph, BDZ deter-
mines
i
(
x
), such that
hi(x)
(
x
) is unique for each object
x
. It then uses a linear equation system to determine a
function
g
, such that
i
(
x
) =
P0i<d g(hi(x))mod d
.
Even though our presentation using cuckoo hashing
sounds different, SicHash is similar to this idea. The
BDZ algorithm’s task of finding a unique
hi(x)
(
x
) for
each object can be interpreted as placing the objects in
a cuckoo hash table. The function
g
serves as a retrieval
data structure that maps each object to a hash function
index
i
(
x
). The most important difference is that the
BDZ algorithm couples retrieval and object placement
by using the same set of hash functions. In particular,
g
is evaluated for the entire range
M
, so the space to
store
g
depends on
M
. SicHash, in contrast, separates
the two tasks of object placement and hash function
retrieval. This enables using a retrieval data structure of
size
N
instead. Moreover, SicHash uses irregular cuckoo
hashing, which cannot be represented efficiently with
the integrated retrieval data structure of BDZ. Finally,
SicHash does not depend on peelability.
WBPM.
Weaver et al. [
45
] describe an algorithm
for calculating MPHFs that is based on weighted bipartite
matchings (WBPM). The left set of the graph is
determined by the
M
=
N
input objects and the right set
is determined by the
N
possible hash values. The edges
are determined by applying O
(log(N))
hash functions to
each object, where an edge determined from the
i
-th hash
function has weight
i
. The weighted matching can be
solved with a weight of 1
.
83
N
, giving an assignment from
objects to hash values. For storing which hash function
to use for each object, WBPM uses a 1-bit retrieval data
2Possibility of obtaining a graph without edges by iteratively
taking away edges that contain a node with degree 1 [5,43].
structure. The keys to the retrieval data structure are
tuples of object and hash function index. The stored
value is 1 for the hash function to finally be used, and
0 for all smaller indices. The weight of 1
.
83
N
therefore
also equals the space usage of the final data structure,
except for overheads like prefix sums due to bucketing.
SicHash uses a similar structure but simplifies each of
the ingredients. Instead of a weighted bipartite matching,
SicHash (implicitly) solves a non-weighted bipartite
matching by constructing a cuckoo hash table. Instead of
querying a 1-bit retrieval data structure multiple times
for each hash function evaluation, SicHash only performs
a single query to a retrieval data structure. While WBPM
constructs a retrieval data structure consisting of 1
.
83
N
objects, SicHash generates retrieval data structures with
a total of
N
objects, which makes the construction faster.
While WBPM’s space usage is competitive for MPHFs,
constructing a non-minimal PHF is less efficient. With a
load factor of 0
.
85, for example, SicHash achieves a space
usage of 1
.
43
N
bits, while our preliminary experiments
show that a matching like above has a weight of 1
.
54
N
.
This stems from the fact that SicHash stores the selected
hash function index using binary code, while WBPM
effectively uses unary code.
CHD [2] / FCH [20] / PTHash [38].
The basic
idea of this approach is to hash the input objects to
a number of buckets with expected equal size. After
sorting the buckets by their size, the methods search
for a hash function that stores the entire bucket in the
final output domain without causing collisions. CHD [
2
]
tries out hash functions linearly, so that the data to store
for each bucket can be compressed efficiently. With a
load factor of 81%, CHD can construct a PHF using 1
.
4
bits per object. With a load factor of 99%, it achieves
1
.
98 bits per object. In FCH [
20
], the bucket sizes are
asymmetric — using the default parameters, 60% of the
objects are mapped to 30% of the buckets. After hashing
the objects of a bucket, it searches for a rotation value
that (mod
M
) places all objects into the output range
without collisions. FCH produces MPHFs with about
2 bits per object. PTHash [
38
] combines the two ideas
by using asymmetric buckets and trying hash functions
linearly to enable a compressible representation. To speed
up searching for the hash function in each bucket, it first
generates a non-minimal PHF. Instead of using the well
known rank trick [
4
] to convert a PHF to an MPHF,
PTHash uses a new approach that enables faster queries:
We can interpret the values in the output range
M
that
no object is mapped to as holes. PTHash now stores
an array of size
MN
that re-maps each object with
a hash value
> N
to one of the holes. Given that the
holes are a monotonic sequence of integers, the array can
be compressed with Elias-Fano coding. Because most
objects are mapped to values
N
, the expensive lookups
into the Elias-Fano sequence are only rarely needed.
MeraculousHash [8] / FiPHa [35] / BB-
Hash [33].
This approach hashes the objects to
γN
buckets, for
γR, γ
1. If a bucket has exactly one
object mapped to it, that mapping is injective. Objects
from buckets that hold more than one object are bumped
to the next layer of the same data structure. For each
layer, a bit vector of size
γN
indicates which buckets had
stored a single object, e.g., where the recursive query
can stop. Together with a rank data structure, that bit
vector can also be used to make the resulting hash func-
tion minimal. This approach allows fast construction and
queries but needs space at least
e
bits per object [
35
] –
more than 4 for really good speed.
RecSplit.
RecSplit [
15
] distributes all objects to
buckets of expected equal size. Within each bucket,
RecSplit first searches for a hash function with binary
output that splits the set of objects in half.
3
This is
repeated recursively until the set of objects has a small,
configurable size. Usual values for the leaf size are about
8–12 objects. At a leaf, RecSplit then tries out hash
functions until it finds one that is a bijection. For each
bucket, RecSplit stores the hash function seeds that make
up the splitting tree, the seeds for each leaf, and a prefix
sum of the bucket sizes. There are configurations that
need only 1.56 bits per object.
4 Overloading
In cuckoo hashing, the maximum load factor of a table is
a widely studied subject (see Section 3.1). For example,
it is well known that a table with
d
= 2 hash functions
has a load threshold of 50%. For
M→ ∞
, the probability
of successful table construction with load factor
>
50%
approaches 0, while it approaches 1 for load factors
<
50%.
Let us now look at a very small table of size
M
= 3
storing
N
= 2 objects. This table has a load factor of
66%, which is more than load the threshold. Still, the
probability of successful construction, i.e., not all four
hash function values being the same, is 1
3
·
(1
/
3)
4
88%.
This shows that the load factors can be considerably
higher than the asymptotic limits when using small tables.
We call a table that contains more objects than the
asymptotic limit overloaded.
To illustrate the achievable load actors, we incremen-
tally construct cuckoo hash tables. Figure 1gives a box
plot for the load factors at which the insertion finally fails.
It shows three fundamental observations that we use in
SicHash to increase the load factor while decreasing the
amount of memory needed.
3
The number of splits is larger for the bottom recursion layers.
摘要:

SicHash|SmallIrregularCuckooTablesforPerfectHashingHans-PeterLehmann*PeterSanders„StefanWalzer…AbstractAPerfectHashFunction(PHF)isahashfunctionthathasnocollisionsonagiveninputset.PHFscanbeusedforspaceecientstorageofdatainanarray,orfordeterminingacompactrepresentativeofeachobjectintheset.Inthispaper...

展开>> 收起<<
SicHash Small Irregular Cuckoo Tables for Perfect Hashing.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1022.46KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注