Locality-Preserving Minimal Perfect Hashing of K-Mers Giulio Ermanno Pibiri12 Yoshihiro Shibuya3 Antoine Limasset4

2025-05-02 0 0 2.36MB 9 页 10玖币
侵权投诉
Locality-Preserving Minimal Perfect Hashing of
K-Mers
Giulio Ermanno Pibiri 1,2, Yoshihiro Shibuya 3, Antoine Limasset 4
1Ca’ Foscari University of Venice, Venice, Italy
2ISTI-CNR, Pisa, Italy
3University Gustave Eiffel, Marne-la-Vallée, France
4University of Lille and CNRS, Lille, France
Abstract
Motivation: Minimal perfect hashing is the problem of mapping a static set of ndistinct keys into the
address space {1,...,n}bijectively. It is well-known that nlog2(e)bits are necessary to specify a minimal
perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However,
it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower
the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys:
since two consecutive k-mers share an overlap of k1symbols, it seems possible to beat the classic
log2(e)bits/key barrier in this case. Moreover, we would like fto map consecutive k-mers to consecutive
addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful
feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better
evaluation time when querying consecutive k-mers.
Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF
designed for k-mers extracted consecutively from a collection of strings. We design a construction whose
space usage decreases for growing kand discuss experiments with a practical implementation of the
method: in practice, the functions built with our method can be several times smaller and even faster to
query than the most efficient MPHFs in the literature.
Code Availability: https://github.com/jermp/lphash
Data Availability: https://zenodo.org/record/7239205
1 Introduction
Given a universe set U, a function f:U[n] = {1,...,n}is a
minimal perfect hash function (MPHF, henceforth) for a set SU
with n=|S|if f(x)6=f(y)for all x, y S,x6=y. In simpler
words, fmaps each key of Sinto a distinct integer in [n]. The function is
allowed to return any value in [n]for a key xU\S. A classic result
established that nlog2(e)=1.442nbits are essentially necessary to
represent such functions for |U|  n[Mehlhorn,1982]. Minimal perfect
hashing is a central problem in data structure design and has received
considerable attention, both in theory and practice. In fact, many practical
constructions have been proposed (see, e.g., [Pibiri and Trani,2021a] and
references therein). These algorithms find MPHFs that take space close to
the theoretic-minimum, e.g., 2 – 3 bits/key, retain very fast lookup time,
and scale well to very large sets. Applications of minimal perfect hashing
range from computer networks [Lu et al.,2006] to databases [Chang and
Lin,2005], as well as language models [Pibiri and Venturini,2019,Strimel
et al.,2020], compilers, and operating systems. MPHFs have been also
used recently in Bioinformatics to implement fast and compact dictionaries
for fixed-length DNA strings [Pibiri,2022b,a,Almodaresi et al.,2018,
Marchet et al.,2021].
In its simplicity and versatility, the minimal perfect hashing problem
does not take into account specific types of inputs, nor the intrinsic
relationships between the input keys. Each key xSis considered
independentlyfromanyotherkey inthesetand, assuch, P[f(x) = i]1
n
for any fixed i[n]. In practice, however, the input keys often present
some regularities that we could exploit to let fact “less randomly” on S.
This, in turn, would permit to achieve a lower space complexity for f.
We therefore consider in this paper the following special setting of the
minimal perfect hashing problem: the elements of Sare all the distinct
sub-strings of length k, for some k > 0, from a given collection X
of strings. The elements of Sare called k-mers. The crucial point is
that any two consecutive k-mers in a string of Xhave indeed a strong
intrinsic relationship in that they share an overlap of k1symbols. It
seems profitable to exploit the overlap information to preserve (as much
as possible) the local relationship between consecutive k-mers as to reduce
the randomness of f, thus lowering its bit complexity and evaluation time.
© The Author 2023. 1
arXiv:2210.13097v2 [cs.DS] 12 Apr 2023
2G. E. Pibiri, Y. Shibuya, and A. Limasset
In particular, we are interested in the design of a locality-preserving
MPHF in the following sense. Given a query sequence Q, if f(x) = j
for some k-mer xQ, we would like fto hash Next(x)to j+ 1,
Next(Next(x)) to j+ 2, and so on, where Next(x)is the k-mer following
xin Q(assuming Next(x)and Next(Next(x)) are in Xas well). This
behavior of fis very desirable in practice, at least for two important
reasons. First, it implies compression for satellite values associated to k-
mers. Typical satellite values are abundance counts, reference identifiers
(sometimes called “colors”), or contig identifiers (e.g., unitigs) in a
de Bruijn graph. Consecutive k-mers tend to have very similar – if
not identical – satellite values, hence hashing consecutive k-mers to
consecutive identifiers induce a natural clustering of the associated satellite
values which is amenable to effective compression. The second important
reason is, clearly, faster evaluation time when querying for consecutive
k-mers in a sequence. This streaming query modality is the query
modality employed by k-mer-based applications [Almodaresi et al.,2018,
Bingmann et al.,2019,Marchet et al.,2021,Robidou and Peterlongo,
2021,Pibiri,2022b].
We formalize the notion of locality-preserving MPHF along with
other preliminary definitions in Section 2. We show how to obtain
a locality-preserving MPHF in very compact space in Section 3. To
achieve this result, we make use of two algorithmic tools: random
minimizers [Schleimer et al.,2003,Roberts et al.,2004] and a novel
partitioning scheme for sub-sequences of consecutive k-mers sharing
the same minimizers (super-k-mers) which allows a more parsimonious
memory layout. The space of the proposed solution decreases for growing
kand the data structure is built in linear time in the size of the input
(number of distinct k-mers). In Section 4we present experiments across
a breadth of datasets to show that the construction is practical too: the
functions can be several times smaller and even faster to query than the
most efficient, albeit “general-purpose”, minimal perfect hash functions.
We conclude in Section 5where we also sketch some promising future
directions. Our C++ implementation of the method is publicly available at
https://github.com/jermp/lphash.
2 Notation and Definitions
Let Xbe a set of strings over an alphabet Σ. Throughout the paper we focus
on the DNA alphabet Σ = {A,C,G,T}to better highlight the connection
with our concrete application but our algorithms can be generalized to
work for arbitrary alphabets. A sub-string of length kof a string S∈ X
is called a k-mer of S.
Definition 1 (Spectrum). The k-mer spectrum of Xis the set of all
distinct k-mers of the strings in X. Formally: spectrumk(X) := {x
Σk| ∃S∈ X such that xis a k-mer of S}.
Definition 2 (Spectrum-Preserving String Set). A spectrum-preserving
string set (or SPSS) Sof Xis a set of strings such that (i) each string of
Shas length at least k, and (ii) spectrumk(S) = spectrumk(X).
Since our goal is to build a MPHF for the k-mers in a SPSS, we are
interested in a SPSS Swhere each k-mer is seen only once, i.e., for each
k-mer xspectrumk(S)there is only one string of Swhere xappears
once. We assume that no k-mer appearing at the end of a string shares an
overlap of k1symbols with the first k-mer of another string, otherwise
we could reduce the number of strings in Sand obtain a smaller SPSS.
In the following, we make use of this form of SPSS which is suitable for
the minimal perfect hashing problem. We remark that efficient algorithms
exist to compute such SPSSs (see, e.g., [Rahman and Medvedev,2020,
rinda et al.,2021,Khan and Patro,2021,Khan et al.,2022]).
The input for our problem is therefore a SPSS Sfor Xwith |S| strings
and n > 1distinct k-mers. Without loss of generality, we index k-mers
based on their positions in S, assuming an order S1, S2, S3,... of the
strings of Sis fixed, and we indicate with xithe i-th k-mer in S, for
i= 1,...,n.
We want to build a MPHF f: Σk[n]for S; more precisely, for the
ndistinct k-mers in spectrumk(S). We remark again that our objective is
to exploit the overlap of k1symbols between consecutive k-mers from
a string of Sto preserve their locality, and hence reduce the bit complexity
of fas well as its evaluation time when querying k-mers in sequence.
We define a locality-preserving MPHF, or LP-MPHF, for Sas follows.
Definition 3 (LP-MPHF). Let f: Σk[n]be a MPHF for Sand Abe
the set {1i<n| ∃S∈ S, xi, xi+1 Sf(xi+1) = f(xi)+1}.
The function fis (1 ε)-locality-preserving for Sif ε1− |A|/n.
Intuitively, the “best” LP-MPHF for Sis the one having the smallest
ε, so we look for practical constructions with small ε. On the other hand,
note that a “classic” MPHF corresponds to the case where the locality-
preserving property is almost always not satisfied and, as a consequence,
εwill be approximately 1.
Two more considerations are in order. First, it should be clear that the
way we define locality-preservation in Definition 3is only pertinent to
SPSSs where having consecutive hash codes for consecutive k-mers is a
very desirable property as motivated in Section 1. A different definition of
locality-preservation could instead be given if we were considering generic
input keys. Second, we did not use the term order-preserving to stress the
distinction from classic order-preserving functions in the literature [Fox
et al.,1991] that make it possible to preserve any wanted order and, as
such, incur in an avoidable Ω(log n)-bit overhead per key. Here, we are
interested in preserving only the input order of the k-mers which is the
one that matters in practice.
Definition 4 (Fragmentation Factor). Given a SPSS Swith |S| strings
and n=|spectrumk(S)|distinct k-mers, we define the fragmentation
factor of Sas α:= (|S| − 1)/n.
The fragmentation factor of Sis a measure of how contiguous the
k-mers in Sare. The minimum fragmentation α= 0 is achieved for
|S| = 1 and, in this case, xishares an overlap of k1symbols with
xi+1 for all i= 1,...,n1. This ideal scenario is, however, unlikely to
happen in practice. On the other hand, the worst-case scenario of maximum
fragmentation α= 1 1/n is achieved when |S| =nand k-mers do
not share any overlap (of length k1). This is also unlikely to happen
given that k-mers are extracted consecutively from the strings of Xand,
as a result, many overlaps are expected. A more realistic scenario happens,
instead, when |S|  n, resulting in εα. For the rest of the paper, we
focus on this latter scenario to make our analysis meaningful.
From Definition 3and 4it is easy to see that ε1/n when α= 0,
and ε= 1 when α= 1 1/n. In general, we have εα+ 1/n since
there are at least |S| − 1indexes ifor which f(xi+1)6=f(xi)+1. How
small εcan actually be therefore depends on the input SPSS (and on the
strategy used to implement fin practice, as we are going to illustrate in
Section 3).
Lastly in this section, we define minimizers and super-k-mers that will
be one of the main ingredients used in Section 3.
Definition 5 (Random Minimizer of a k-mer). Given a k-mer xand
a random hash function h, the minimizer of xis any m-mer µsuch that
h(µ)h(y)for any other m-mer yof x, for some mk.
In case the minimizer of xis not unique, we break ties by taking the
leftmost m-mer in x. For convenience, we indicate with w=km+1 the
number of m-mers in a k-mer. (Note that Definition 5defines a minimizer
摘要:

Locality-PreservingMinimalPerfectHashingofK-MersGiulioErmannoPibiri1,2,YoshihiroShibuya3,AntoineLimasset41Ca'FoscariUniversityofVenice,Venice,Italy2ISTI-CNR,Pisa,Italy3UniversityGustaveEiffel,Marne-la-Vallée,France4UniversityofLilleandCNRS,Lille,FranceAbstractMotivation:Minimalperfecthashingisthepro...

展开>> 收起<<
Locality-Preserving Minimal Perfect Hashing of K-Mers Giulio Ermanno Pibiri12 Yoshihiro Shibuya3 Antoine Limasset4.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.36MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注