
structure. The keys to the retrieval data structure are
tuples of object and hash function index. The stored
value is 1 for the hash function to finally be used, and
0 for all smaller indices. The weight of 1
.
83
N
therefore
also equals the space usage of the final data structure,
except for overheads like prefix sums due to bucketing.
SicHash uses a similar structure but simplifies each of
the ingredients. Instead of a weighted bipartite matching,
SicHash (implicitly) solves a non-weighted bipartite
matching by constructing a cuckoo hash table. Instead of
querying a 1-bit retrieval data structure multiple times
for each hash function evaluation, SicHash only performs
a single query to a retrieval data structure. While WBPM
constructs a retrieval data structure consisting of 1
.
83
N
objects, SicHash generates retrieval data structures with
a total of
N
objects, which makes the construction faster.
While WBPM’s space usage is competitive for MPHFs,
constructing a non-minimal PHF is less efficient. With a
load factor of 0
.
85, for example, SicHash achieves a space
usage of 1
.
43
N
bits, while our preliminary experiments
show that a matching like above has a weight of 1
.
54
N
.
This stems from the fact that SicHash stores the selected
hash function index using binary code, while WBPM
effectively uses unary code.
CHD [2] / FCH [20] / PTHash [38].
The basic
idea of this approach is to hash the input objects to
a number of buckets with expected equal size. After
sorting the buckets by their size, the methods search
for a hash function that stores the entire bucket in the
final output domain without causing collisions. CHD [
2
]
tries out hash functions linearly, so that the data to store
for each bucket can be compressed efficiently. With a
load factor of 81%, CHD can construct a PHF using 1
.
4
bits per object. With a load factor of 99%, it achieves
1
.
98 bits per object. In FCH [
20
], the bucket sizes are
asymmetric — using the default parameters, 60% of the
objects are mapped to 30% of the buckets. After hashing
the objects of a bucket, it searches for a rotation value
that (mod
M
) places all objects into the output range
without collisions. FCH produces MPHFs with about
2 bits per object. PTHash [
38
] combines the two ideas
by using asymmetric buckets and trying hash functions
linearly to enable a compressible representation. To speed
up searching for the hash function in each bucket, it first
generates a non-minimal PHF. Instead of using the well
known rank trick [
4
] to convert a PHF to an MPHF,
PTHash uses a new approach that enables faster queries:
We can interpret the values in the output range
M
that
no object is mapped to as holes. PTHash now stores
an array of size
M−N
that re-maps each object with
a hash value
> N
to one of the holes. Given that the
holes are a monotonic sequence of integers, the array can
be compressed with Elias-Fano coding. Because most
objects are mapped to values
≤N
, the expensive lookups
into the Elias-Fano sequence are only rarely needed.
MeraculousHash [8] / FiPHa [35] / BB-
Hash [33].
This approach hashes the objects to
γN
buckets, for
γ∈R, γ ≥
1. If a bucket has exactly one
object mapped to it, that mapping is injective. Objects
from buckets that hold more than one object are bumped
to the next layer of the same data structure. For each
layer, a bit vector of size
γN
indicates which buckets had
stored a single object, e.g., where the recursive query
can stop. Together with a rank data structure, that bit
vector can also be used to make the resulting hash func-
tion minimal. This approach allows fast construction and
queries but needs space at least
e
bits per object [
35
] –
more than 4 for really good speed.
RecSplit.
RecSplit [
15
] distributes all objects to
buckets of expected equal size. Within each bucket,
RecSplit first searches for a hash function with binary
output that splits the set of objects in half.
3
This is
repeated recursively until the set of objects has a small,
configurable size. Usual values for the leaf size are about
8–12 objects. At a leaf, RecSplit then tries out hash
functions until it finds one that is a bijection. For each
bucket, RecSplit stores the hash function seeds that make
up the splitting tree, the seeds for each leaf, and a prefix
sum of the bucket sizes. There are configurations that
need only 1.56 bits per object.
4 Overloading
In cuckoo hashing, the maximum load factor of a table is
a widely studied subject (see Section 3.1). For example,
it is well known that a table with
d
= 2 hash functions
has a load threshold of 50%. For
M→ ∞
, the probability
of successful table construction with load factor
>
50%
approaches 0, while it approaches 1 for load factors
<
50%.
Let us now look at a very small table of size
M
= 3
storing
N
= 2 objects. This table has a load factor of
≈
66%, which is more than load the threshold. Still, the
probability of successful construction, i.e., not all four
hash function values being the same, is 1
−
3
·
(1
/
3)
4≈
88%.
This shows that the load factors can be considerably
higher than the asymptotic limits when using small tables.
We call a table that contains more objects than the
asymptotic limit overloaded.
To illustrate the achievable load actors, we incremen-
tally construct cuckoo hash tables. Figure 1gives a box
plot for the load factors at which the insertion finally fails.
It shows three fundamental observations that we use in
SicHash to increase the load factor while decreasing the
amount of memory needed.
3
The number of splits is larger for the bottom recursion layers.