SicHash Small Irregular Cuckoo Tables for Perfect Hashing

2025-05-03 0 0 1022.46KB 16 页 10玖币

侵权投诉

SicHash — Small Irregular Cuckoo Tables for Perfect Hashing

Hans-Peter Lehmann∗Peter Sanders†Stefan Walzer‡

Abstract

A Perfect Hash Function (PHF) is a hash function that

has no collisions on a given input set. PHFs can be

used for space eﬃcient storage of data in an array, or for

determining a compact representative of each object in

the set. In this paper, we present the PHF construction

algorithm SicHash — Small Irregular Cuckoo Tables for

Perfect Hashing. At its core, SicHash uses a known

technique: It places objects in a cuckoo hash table

and then stores the ﬁnal hash function choice of each

object in a retrieval data structure. We combine the

idea with irregular cuckoo hashing, where each object

has a diﬀerent number of hash functions. Additionally,

we use many small tables that we overload beyond

their asymptotic maximum load factor. The most space

eﬃcient competitors often use brute force methods to

determine the PHFs. SicHash provides a more direct

construction algorithm that only rarely needs to re-

compute parts. Our implementation improves the state

of the art in terms of space usage versus construction

time for a wide range of conﬁgurations. At the same

time, it provides very fast queries.

1 Introduction

A Perfect Hash Function (PHF) is a hash function that

does not have collisions on a given set

of objects, i.e., is

injective. In this paper, we call the number of objects

and the size of the output range

. A PHF with

is called Minimal Perfect Hash Function (MPHF). For

close to

, it is likely that an ordinary hash function

has collisions. It is therefore necessary to store additional

information, speciﬁc to the set

, that help to avoid these

collisions. The lower space bound for an MPHF is 1

bits/object [

]. PHFs can be represented with less space,

depending on the load factor

N/M

. In the literature,

load factors between 0.8 and 1.0 are common [2,38].

PHFs can be used to implement a hash table where

each query only needs to access one single cell. When

we know that the hash table is never queried for objects

6∈ S

, table cells can store the plain payload data without

∗Karlsruhe Institute of Technology, Germany.

†Karlsruhe Institute of Technology, Germany.

‡Cologne University, Germany.

keys. This results in a retrieval data structure that can

be updated eﬃciently. Refer to Section 2for more details

about retrieval data structures and an introduction to

cuckoo hashing, which is a main building block of SicHash.

There is a wide range of PHF construction algorithms,

which we review extensively in Section 3.

The basic idea of SicHash is to distribute the input

objects to a number of small buckets and build a cuckoo

hash table in each. To obtain a PHF, we store which

hash function index was ﬁnally used to place each object

by using a retrieval data structure. The reason for

constructing small tables is the use of overloading, which

we describe in Section 4. In Section 5, we explain SicHash

in detail, giving construction and query algorithms.

Enhancements of the basic scheme, including one that

produces MPHFs, are given in Section 6. We analyze

SicHash in Section 7. We then provide an extensive

experimental evaluation in Section 8, comparing it with

a wide range of competitors. In Section 9, we summarize

the results and discuss possible future work.

Comparison with Previous Approaches.

The

most space-eﬃcient previous algorithms perform brute-

force search as a core step to determine a perfect hash

function. In particular, both CHD [

] and RecSplit [

]

at some point try out random hash functions (on sub-

problems) until one happens to be injective. Refer to

Section 3for details. Our construction is more directed

than this because it constructs cuckoo hash tables as

its base case, which is possible in polynomial time. The

directedness is also visible in the experiments, where

our method can construct PHFs with the same space

requirements signiﬁcantly faster than the competitors.

Our Contribution.

We combine and reﬁne several

known ideas in a novel way leading to excellent space–

construction time trade-oﬀs while using very low query

time. We base SicHash on the known idea of PHF

generation through cuckoo hashing. We use irregular

cuckoo hashing, which was previously considered for

reducing search time [

]. For that application it was of

little help apart from allowing to interpolate between two

uniform degree cases. In contrast, for our application

to reduce space, it is helpful even for integer average

degree. Space is further reduced using the novel idea to

overload the cuckoo hash tables, i.e., to load them with

more objects than would be possible in an asymptotic

arXiv:2210.01560v2 [cs.DS] 8 Nov 2022

sense, exploiting that the tables are small. All this keeps

the queries extremely simple — basically the cost for a

single access to a retrieval data structure. This further

proﬁts from recent advances on fast static retrieval data

structures with virtually no space overhead [

]. In turn,

our PHFs can be used to obtain improved updateable

retrieval data structures as discussed before.

2 Preliminaries

Cuckoo Hashing.

Cuckoo hashing [

] is a well

known approach to hash tables with open addressing.

In a basic cuckoo hash table, each object can be placed

in one of two cells, determined by two hash functions.

Queries load the two candidate cells and compare both

objects. Insertion applies one of the hash functions and

places the new object in the corresponding cell. If the

cell is already occupied, the object previously placed in

that cell is pushed out and is recursively inserted using

its other hash function.

Instead of locating each object in one of two cells, the

idea can be generalized to

cells [

] by using

hash func-

tions. In irregular cuckoo hash tables, diﬀerent objects

can have a diﬀerent number of choices [

]. For example,

some percentage of the objects get

choices, some

choices, and some

choices. Averaging over the

, the

method enables

-ary cuckoo hashing with non-integer

and higher load factors than a simple interpolation

between two ordinary cuckoo hash tables [10].

Retrieval Data Structures.

Aretrieval data

structure or static function on a set

of objects de-

scribes a function

S→ {

that returns a speciﬁc

-bit value for each object. Applying the function on

an object not in

can return an arbitrary value. The

lower bound of the space requirement of a retrieval data

structure is rN bits. The best retrieval data structures

now come very close to the lower bounds (around 1%

overhead) and are also quite fast [13].

PHF Construction by Cuckoo Hashing.

the best of our knowledge, constructing PHFs through

cuckoo hashing was only mentioned very brieﬂy be-

fore [

]. In this paragraph, we give a more detailed

and intuitive introduction to the idea. A related idea

is the construction of PHFs by solving a matching as

described by, e.g., Botelho et al. [

] and Navarro [

Section 4.5.3].

Assume that all input objects are inserted into a

ary cuckoo hash table. The table then implicitly describes

an injective mapping from objects to table cells, because

each cell only stores one object. For perfect hashing,

we are not interested in storing the objects themselves

but only in mapping objects to numbers, e.g., table cell

indices. Because each object can only be placed in

cells using

hash functions

, we can remember the

placement of each object by simply storing which of the

hash functions was ﬁnally used to place the object. We

can do that using only about

log d

bits

per object by

constructing a retrieval data structure. A query for an

object

then retrieves the hash function index

(

) and

executes hi(x)(x) to obtain a perfect hash function.

Elias-Fano Coding.

Elias-Fano Coding [

] is

a way to eﬃciently store a monotonic sequence of

integers. It consists of two data structures, a bit vector

, and an array

. An item at position

is split into

two parts. The

log N

upper bits

are stored as a 1-bit

[

]. The remaining lower bits are directly stored

. Items can be accessed in constant time by ﬁnding

the

-th 1-bit in

using a

select1

data structure and

by looking up the lower bits in

. The space usage of

an Elias-Fano coded sequence is 2

Ndlog U/N e

bits,

where Uis the maximum value of an item.

Golomb-Rice Coding.

Golomb coding [

] with

parameter

can be used to store a sequence of integers

that have a geometric distribution. The idea is to store

each integer

as a quotient

bx/kc

in unary coding

and a remainder

x−qk

in truncated binary coding. Rice

coding [

] is Golomb coding where

is a power of 2. This

makes arithmetics more eﬃcient and simpliﬁes storing the

remainder to a normal array with binary coding. Items

can be accessed in constant time by looking up the array

and reconstructing the quotient using a select1query.

3 Related Work

In the following, we ﬁrst describe variants and enhance-

ments of cuckoo hashing from the literature. Afterwards,

we introduce existing PHFs, most of which we later in-

clude in our experimental evaluation (see Section 8).

3.1 Cuckoo Hashing.

After describing variants of

cuckoo hashing, we describe construction algorithms and

maximum load factors.

Variants.

Higher maximum load factors can be

achieved by making the cells larger, so that they hold

more than one object [

]. When then allowing the cells

to overlap [

], even higher load factors are possible [

For our application to perfect hashing, we only consider

cells of size 1. On external memory, I/Os can be

reduced by choosing candidate cells on the same page [

Maintaining two tables [

] of asymmetric size [

] can

improve the search time because more objects can be

placed using their ﬁrst hash function. Giving each object

instead of 2 choices [

] increases the maximum load

factor. Irregular cuckoo hashing [

] uses a diﬀerent

number of hash function for each object. A similar idea

can also be found in coding theory, where each message

1Throughout this paper, log xstands for log2x.

bit is covered by an irregular number of check bits [

Speciﬁcally, the probability that a message bit is covered

check bits is proportional to 1

. Another related

result is the weighted Bloom ﬁlter [

], where objects get

a diﬀerent number of hash functions (and therefore false

positive probability) based on their query frequency and

membership likelihood.

Construction.

The enhancement to

-ary cuckoo

hashing [

] makes insertions more complex because it is

no longer clear which of the alternative cells to displace

objects to. Common ways to perform insertion are to

ﬁnd a shortest move sequence by performing breadth-ﬁrst-

search (BFS) in a graph deﬁning possible object moves

or by performing a random walk in that graph. Both

approaches need constant expected time when the table

is not too highly loaded [17,44,21,19,26,27].

In this paper, we are interested in the static case,

where all objects to be stored are known from the start.

In that case, it is also possible to construct the whole hash

table at once instead of using incremental insertions. Let

us model the cuckoo hash table as a bipartite graph. The

ﬁrst set of graph nodes is simply the set of input objects

and the second set represents the table cells. Edges

connect each object to its candidate cells, as determined

by the

hash functions. A matching of size

then gives a

collision free assignment from objects to table cells. This

can be calculated using, for example, the Hopcroft-Karp-

Karzanov algorithm [24] or the LSA algorithm [26,27].

Load Factors and Space Usage.

Classic cuckoo

hashing with

= 2 hash functions has a maximum

load factor of at most

N/M

= 0

5. Using

= 4 hash

functions already increases the maximum load factor to

9768 [

]. In our construction, the load factor of the

PHF equals the load factor of the cuckoo hash table, and

the storage space is determined by the number of hash

functions

. This means that a PHF from binary cuckoo

hashing with a load factor of 0

5 can be represented using

log

2 = 1 bit per object. A PHF with a load factor of

0.9768 can be implemented using 2 bits per object.

Ref. [

] gives maximum load factors for irregular

cuckoo hashing, depending on the distribution of hash

functions used. When looking at a speciﬁc average

number of hash functions

d0∈R

, the best load factors

are given by combining objects with

bd0c

and

dd0e

hash

functions [

]. As we will see in Section 4, this is not the

case in the context of PHFs because we are looking at

storage space instead of the average hash function.

3.2 Perfect Hashing.

The perfect hashing problem

is already considered since the 1970s [

], and is still

an active area of research. In the following paragraphs,

we describe more recent papers.

Order-Preserving.

An order-preserving PHF

maintains the order that the input objects are given in.

CHM [

] and BMZ [

] construct an undirected graph

with edges

{

(

)

, h2

(

))

|x∈S}

. They then assign a

number to each vertex, such that for each edge, the sum

of numbers stored in adjacent vertices gives the desired

PHF value. This can be done by assigning 0 to an

arbitrary vertex and then performing depth-ﬁrst-search

to assign all neighbors by simple subtraction. CHM

and BMZ store 2

and 1

integer numbers,

respectively, therefore needing O

(Nlog(N))

space.

Note that space near

Nlog N

can also be achieved by

explicitly storing the rank of the objects in a retrieval

data structure.

BDZ.

In the BDZ algorithm [

], also called RAM

algorithm or BPZ algorithm, each input object is mapped

to an edge in a random hypergraph using

independent

hash functions

. The hypergraph needs to be peelable

in order to continue. By peeling the graph, BDZ deter-

mines

(

), such that

hi(x)

(

) is unique for each object

. It then uses a linear equation system to determine a

function

, such that

(

) =

P0≤i<d g(hi(x))mod d

Even though our presentation using cuckoo hashing

sounds diﬀerent, SicHash is similar to this idea. The

BDZ algorithm’s task of ﬁnding a unique

hi(x)

(

) for

each object can be interpreted as placing the objects in

a cuckoo hash table. The function

serves as a retrieval

data structure that maps each object to a hash function

index

(

). The most important diﬀerence is that the

BDZ algorithm couples retrieval and object placement

by using the same set of hash functions. In particular,

is evaluated for the entire range

, so the space to

store

depends on

. SicHash, in contrast, separates

the two tasks of object placement and hash function

retrieval. This enables using a retrieval data structure of

size

instead. Moreover, SicHash uses irregular cuckoo

hashing, which cannot be represented eﬃciently with

the integrated retrieval data structure of BDZ. Finally,

SicHash does not depend on peelability.

WBPM.

Weaver et al. [

] describe an algorithm

for calculating MPHFs that is based on weighted bipartite

matchings (WBPM). The left set of the graph is

determined by the

input objects and the right set

is determined by the

possible hash values. The edges

are determined by applying O

(log(N))

hash functions to

each object, where an edge determined from the

-th hash

function has weight

. The weighted matching can be

solved with a weight of 1

, giving an assignment from

objects to hash values. For storing which hash function

to use for each object, WBPM uses a 1-bit retrieval data

2Possibility of obtaining a graph without edges by iteratively

taking away edges that contain a node with degree 1 [5,43].

structure. The keys to the retrieval data structure are

tuples of object and hash function index. The stored

value is 1 for the hash function to ﬁnally be used, and

0 for all smaller indices. The weight of 1

therefore

also equals the space usage of the ﬁnal data structure,

except for overheads like preﬁx sums due to bucketing.

SicHash uses a similar structure but simpliﬁes each of

the ingredients. Instead of a weighted bipartite matching,

SicHash (implicitly) solves a non-weighted bipartite

matching by constructing a cuckoo hash table. Instead of

querying a 1-bit retrieval data structure multiple times

for each hash function evaluation, SicHash only performs

a single query to a retrieval data structure. While WBPM

constructs a retrieval data structure consisting of 1

objects, SicHash generates retrieval data structures with

a total of

objects, which makes the construction faster.

While WBPM’s space usage is competitive for MPHFs,

constructing a non-minimal PHF is less eﬃcient. With a

load factor of 0

85, for example, SicHash achieves a space

usage of 1

bits, while our preliminary experiments

show that a matching like above has a weight of 1

This stems from the fact that SicHash stores the selected

hash function index using binary code, while WBPM

eﬀectively uses unary code.

CHD [2] / FCH [20] / PTHash [38].

The basic

idea of this approach is to hash the input objects to

a number of buckets with expected equal size. After

sorting the buckets by their size, the methods search

for a hash function that stores the entire bucket in the

ﬁnal output domain without causing collisions. CHD [

]

tries out hash functions linearly, so that the data to store

for each bucket can be compressed eﬃciently. With a

load factor of 81%, CHD can construct a PHF using 1

bits per object. With a load factor of 99%, it achieves

98 bits per object. In FCH [

], the bucket sizes are

asymmetric — using the default parameters, 60% of the

objects are mapped to 30% of the buckets. After hashing

the objects of a bucket, it searches for a rotation value

that (mod

) places all objects into the output range

without collisions. FCH produces MPHFs with about

2 bits per object. PTHash [

] combines the two ideas

by using asymmetric buckets and trying hash functions

linearly to enable a compressible representation. To speed

up searching for the hash function in each bucket, it ﬁrst

generates a non-minimal PHF. Instead of using the well

known rank trick [

] to convert a PHF to an MPHF,

PTHash uses a new approach that enables faster queries:

We can interpret the values in the output range

that

no object is mapped to as holes. PTHash now stores

an array of size

M−N

that re-maps each object with

a hash value

> N

to one of the holes. Given that the

holes are a monotonic sequence of integers, the array can

be compressed with Elias-Fano coding. Because most

objects are mapped to values

≤N

, the expensive lookups

into the Elias-Fano sequence are only rarely needed.

MeraculousHash [8] / FiPHa [35] / BB-

Hash [33].

This approach hashes the objects to

γN

buckets, for

γ∈R, γ ≥

1. If a bucket has exactly one

object mapped to it, that mapping is injective. Objects

from buckets that hold more than one object are bumped

to the next layer of the same data structure. For each

layer, a bit vector of size

γN

indicates which buckets had

stored a single object, e.g., where the recursive query

can stop. Together with a rank data structure, that bit

vector can also be used to make the resulting hash func-

tion minimal. This approach allows fast construction and

queries but needs space at least

bits per object [

] –

more than 4 for really good speed.

RecSplit.

RecSplit [

] distributes all objects to

buckets of expected equal size. Within each bucket,

RecSplit ﬁrst searches for a hash function with binary

output that splits the set of objects in half.

This is

repeated recursively until the set of objects has a small,

conﬁgurable size. Usual values for the leaf size are about

8–12 objects. At a leaf, RecSplit then tries out hash

functions until it ﬁnds one that is a bijection. For each

bucket, RecSplit stores the hash function seeds that make

up the splitting tree, the seeds for each leaf, and a preﬁx

sum of the bucket sizes. There are conﬁgurations that

need only 1.56 bits per object.

4 Overloading

In cuckoo hashing, the maximum load factor of a table is

a widely studied subject (see Section 3.1). For example,

it is well known that a table with

= 2 hash functions

has a load threshold of 50%. For

M→ ∞

, the probability

of successful table construction with load factor

50%

approaches 0, while it approaches 1 for load factors

50%.

Let us now look at a very small table of size

= 3

storing

= 2 objects. This table has a load factor of

≈

66%, which is more than load the threshold. Still, the

probability of successful construction, i.e., not all four

hash function values being the same, is 1

−

4≈

88%.

This shows that the load factors can be considerably

higher than the asymptotic limits when using small tables.

We call a table that contains more objects than the

asymptotic limit overloaded.

To illustrate the achievable load actors, we incremen-

tally construct cuckoo hash tables. Figure 1gives a box

plot for the load factors at which the insertion ﬁnally fails.

It shows three fundamental observations that we use in

SicHash to increase the load factor while decreasing the

amount of memory needed.

The number of splits is larger for the bottom recursion layers.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SicHash|SmallIrregularCuckooTablesforPerfectHashingHans-PeterLehmann*PeterSandersStefanWalzerAbstractAPerfectHashFunction(PHF)isahashfunctionthathasnocollisionsonagiveninputset.PHFscanbeusedforspaceecientstorageofdatainanarray,orfordeterminingacompactrepresentativeofeachobjectintheset.Inthispaper...

展开>> 收起<<

SicHash Small Irregular Cuckoo Tables for Perfect Hashing.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SicHash Small Irregular Cuckoo Tables for Perfect Hashing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: