Locality-Preserving Minimal Perfect Hashing of K-Mers Giulio Ermanno Pibiri12 Yoshihiro Shibuya3 Antoine Limasset4

2025-05-02 0 0 2.36MB 9 页 10玖币

侵权投诉

Locality-Preserving Minimal Perfect Hashing of

K-Mers

Giulio Ermanno Pibiri 1,2, Yoshihiro Shibuya 3, Antoine Limasset 4

1Ca’ Foscari University of Venice, Venice, Italy

2ISTI-CNR, Pisa, Italy

3University Gustave Eiffel, Marne-la-Vallée, France

4University of Lille and CNRS, Lille, France

Abstract

Motivation: Minimal perfect hashing is the problem of mapping a static set of ndistinct keys into the

address space {1,...,n}bijectively. It is well-known that nlog2(e)bits are necessary to specify a minimal

perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However,

it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower

the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys:

since two consecutive k-mers share an overlap of k−1symbols, it seems possible to beat the classic

log2(e)bits/key barrier in this case. Moreover, we would like fto map consecutive k-mers to consecutive

addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful

feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better

evaluation time when querying consecutive k-mers.

Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF

designed for k-mers extracted consecutively from a collection of strings. We design a construction whose

space usage decreases for growing kand discuss experiments with a practical implementation of the

method: in practice, the functions built with our method can be several times smaller and even faster to

query than the most efﬁcient MPHFs in the literature.

Code Availability: https://github.com/jermp/lphash

Data Availability: https://zenodo.org/record/7239205

1 Introduction

Given a universe set U, a function f:U→[n] = {1,...,n}is a

minimal perfect hash function (MPHF, henceforth) for a set S⊆U

with n=|S|if f(x)6=f(y)for all x, y ∈S,x6=y. In simpler

words, fmaps each key of Sinto a distinct integer in [n]. The function is

allowed to return any value in [n]for a key x∈U\S. A classic result

established that nlog2(e)=1.442nbits are essentially necessary to

represent such functions for |U|  n[Mehlhorn,1982]. Minimal perfect

hashing is a central problem in data structure design and has received

considerable attention, both in theory and practice. In fact, many practical

constructions have been proposed (see, e.g., [Pibiri and Trani,2021a] and

references therein). These algorithms ﬁnd MPHFs that take space close to

the theoretic-minimum, e.g., 2 – 3 bits/key, retain very fast lookup time,

and scale well to very large sets. Applications of minimal perfect hashing

range from computer networks [Lu et al.,2006] to databases [Chang and

Lin,2005], as well as language models [Pibiri and Venturini,2019,Strimel

et al.,2020], compilers, and operating systems. MPHFs have been also

used recently in Bioinformatics to implement fast and compact dictionaries

for ﬁxed-length DNA strings [Pibiri,2022b,a,Almodaresi et al.,2018,

Marchet et al.,2021].

In its simplicity and versatility, the minimal perfect hashing problem

does not take into account speciﬁc types of inputs, nor the intrinsic

relationships between the input keys. Each key x∈Sis considered

independentlyfromanyotherkey inthesetand, assuch, P[f(x) = i]≈1

for any ﬁxed i∈[n]. In practice, however, the input keys often present

some regularities that we could exploit to let fact “less randomly” on S.

This, in turn, would permit to achieve a lower space complexity for f.

We therefore consider in this paper the following special setting of the

minimal perfect hashing problem: the elements of Sare all the distinct

sub-strings of length k, for some k > 0, from a given collection X

of strings. The elements of Sare called k-mers. The crucial point is

that any two consecutive k-mers in a string of Xhave indeed a strong

intrinsic relationship in that they share an overlap of k−1symbols. It

seems proﬁtable to exploit the overlap information to preserve (as much

as possible) the local relationship between consecutive k-mers as to reduce

the randomness of f, thus lowering its bit complexity and evaluation time.

arXiv:2210.13097v2 [cs.DS] 12 Apr 2023

2G. E. Pibiri, Y. Shibuya, and A. Limasset

In particular, we are interested in the design of a locality-preserving

MPHF in the following sense. Given a query sequence Q, if f(x) = j

for some k-mer x∈Q, we would like fto hash Next(x)to j+ 1,

Next(Next(x)) to j+ 2, and so on, where Next(x)is the k-mer following

xin Q(assuming Next(x)and Next(Next(x)) are in Xas well). This

behavior of fis very desirable in practice, at least for two important

reasons. First, it implies compression for satellite values associated to k-

mers. Typical satellite values are abundance counts, reference identiﬁers

(sometimes called “colors”), or contig identiﬁers (e.g., unitigs) in a

de Bruijn graph. Consecutive k-mers tend to have very similar – if

not identical – satellite values, hence hashing consecutive k-mers to

consecutive identiﬁers induce a natural clustering of the associated satellite

values which is amenable to effective compression. The second important

reason is, clearly, faster evaluation time when querying for consecutive

k-mers in a sequence. This streaming query modality is the query

modality employed by k-mer-based applications [Almodaresi et al.,2018,

Bingmann et al.,2019,Marchet et al.,2021,Robidou and Peterlongo,

2021,Pibiri,2022b].

We formalize the notion of locality-preserving MPHF along with

other preliminary deﬁnitions in Section 2. We show how to obtain

a locality-preserving MPHF in very compact space in Section 3. To

achieve this result, we make use of two algorithmic tools: random

minimizers [Schleimer et al.,2003,Roberts et al.,2004] and a novel

partitioning scheme for sub-sequences of consecutive k-mers sharing

the same minimizers (super-k-mers) which allows a more parsimonious

memory layout. The space of the proposed solution decreases for growing

kand the data structure is built in linear time in the size of the input

(number of distinct k-mers). In Section 4we present experiments across

a breadth of datasets to show that the construction is practical too: the

functions can be several times smaller and even faster to query than the

most efﬁcient, albeit “general-purpose”, minimal perfect hash functions.

We conclude in Section 5where we also sketch some promising future

directions. Our C++ implementation of the method is publicly available at

https://github.com/jermp/lphash.

2 Notation and Deﬁnitions

Let Xbe a set of strings over an alphabet Σ. Throughout the paper we focus

on the DNA alphabet Σ = {A,C,G,T}to better highlight the connection

with our concrete application but our algorithms can be generalized to

work for arbitrary alphabets. A sub-string of length kof a string S∈ X

is called a k-mer of S.

Deﬁnition 1 (Spectrum). The k-mer spectrum of Xis the set of all

distinct k-mers of the strings in X. Formally: spectrumk(X) := {x∈

Σk| ∃S∈ X such that xis a k-mer of S}.

Deﬁnition 2 (Spectrum-Preserving String Set). A spectrum-preserving

string set (or SPSS) Sof Xis a set of strings such that (i) each string of

Shas length at least k, and (ii) spectrumk(S) = spectrumk(X).

Since our goal is to build a MPHF for the k-mers in a SPSS, we are

interested in a SPSS Swhere each k-mer is seen only once, i.e., for each

k-mer x∈spectrumk(S)there is only one string of Swhere xappears

once. We assume that no k-mer appearing at the end of a string shares an

overlap of k−1symbols with the ﬁrst k-mer of another string, otherwise

we could reduce the number of strings in Sand obtain a smaller SPSS.

In the following, we make use of this form of SPSS which is suitable for

the minimal perfect hashing problem. We remark that efﬁcient algorithms

exist to compute such SPSSs (see, e.g., [Rahman and Medvedev,2020,

Bˇrinda et al.,2021,Khan and Patro,2021,Khan et al.,2022]).

The input for our problem is therefore a SPSS Sfor Xwith |S| strings

and n > 1distinct k-mers. Without loss of generality, we index k-mers

based on their positions in S, assuming an order S1, S2, S3,... of the

strings of Sis ﬁxed, and we indicate with xithe i-th k-mer in S, for

i= 1,...,n.

We want to build a MPHF f: Σk→[n]for S; more precisely, for the

ndistinct k-mers in spectrumk(S). We remark again that our objective is

to exploit the overlap of k−1symbols between consecutive k-mers from

a string of Sto preserve their locality, and hence reduce the bit complexity

of fas well as its evaluation time when querying k-mers in sequence.

We deﬁne a locality-preserving MPHF, or LP-MPHF, for Sas follows.

Deﬁnition 3 (LP-MPHF). Let f: Σk→[n]be a MPHF for Sand Abe

the set {1≤i<n| ∃S∈ S, xi, xi+1 ∈S∧f(xi+1) = f(xi)+1}.

The function fis (1 −ε)-locality-preserving for Sif ε≥1− |A|/n.

Intuitively, the “best” LP-MPHF for Sis the one having the smallest

ε, so we look for practical constructions with small ε. On the other hand,

note that a “classic” MPHF corresponds to the case where the locality-

preserving property is almost always not satisﬁed and, as a consequence,

εwill be approximately 1.

Two more considerations are in order. First, it should be clear that the

way we deﬁne locality-preservation in Deﬁnition 3is only pertinent to

SPSSs where having consecutive hash codes for consecutive k-mers is a

very desirable property as motivated in Section 1. A different deﬁnition of

locality-preservation could instead be given if we were considering generic

input keys. Second, we did not use the term order-preserving to stress the

distinction from classic order-preserving functions in the literature [Fox

et al.,1991] that make it possible to preserve any wanted order and, as

such, incur in an avoidable Ω(log n)-bit overhead per key. Here, we are

interested in preserving only the input order of the k-mers which is the

one that matters in practice.

Deﬁnition 4 (Fragmentation Factor). Given a SPSS Swith |S| strings

and n=|spectrumk(S)|distinct k-mers, we deﬁne the fragmentation

factor of Sas α:= (|S| − 1)/n.

The fragmentation factor of Sis a measure of how contiguous the

k-mers in Sare. The minimum fragmentation α= 0 is achieved for

|S| = 1 and, in this case, xishares an overlap of k−1symbols with

xi+1 for all i= 1,...,n−1. This ideal scenario is, however, unlikely to

happen in practice. On the other hand, the worst-case scenario of maximum

fragmentation α= 1 −1/n is achieved when |S| =nand k-mers do

not share any overlap (of length k−1). This is also unlikely to happen

given that k-mers are extracted consecutively from the strings of Xand,

as a result, many overlaps are expected. A more realistic scenario happens,

instead, when |S|  n, resulting in εα. For the rest of the paper, we

focus on this latter scenario to make our analysis meaningful.

From Deﬁnition 3and 4it is easy to see that ε≥1/n when α= 0,

and ε= 1 when α= 1 −1/n. In general, we have ε≥α+ 1/n since

there are at least |S| − 1indexes ifor which f(xi+1)6=f(xi)+1. How

small εcan actually be therefore depends on the input SPSS (and on the

strategy used to implement fin practice, as we are going to illustrate in

Section 3).

Lastly in this section, we deﬁne minimizers and super-k-mers that will

be one of the main ingredients used in Section 3.

Deﬁnition 5 (Random Minimizer of a k-mer). Given a k-mer xand

a random hash function h, the minimizer of xis any m-mer µsuch that

h(µ)≤h(y)for any other m-mer yof x, for some m≤k.

In case the minimizer of xis not unique, we break ties by taking the

leftmost m-mer in x. For convenience, we indicate with w=k−m+1 the

number of m-mers in a k-mer. (Note that Deﬁnition 5deﬁnes a minimizer

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Locality-PreservingMinimalPerfectHashingofK-MersGiulioErmannoPibiri1,2,YoshihiroShibuya3,AntoineLimasset41Ca'FoscariUniversityofVenice,Venice,Italy2ISTI-CNR,Pisa,Italy3UniversityGustaveEiffel,Marne-la-Vallée,France4UniversityofLilleandCNRS,Lille,FranceAbstractMotivation:Minimalperfecthashingisthepro...

展开>> 收起<<

Locality-Preserving Minimal Perfect Hashing of K-Mers Giulio Ermanno Pibiri12 Yoshihiro Shibuya3 Antoine Limasset4.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Locality-Preserving Minimal Perfect Hashing of K-Mers Giulio Ermanno Pibiri12 Yoshihiro Shibuya3 Antoine Limasset4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: