
2G. E. Pibiri, Y. Shibuya, and A. Limasset
In particular, we are interested in the design of a locality-preserving
MPHF in the following sense. Given a query sequence Q, if f(x) = j
for some k-mer x∈Q, we would like fto hash Next(x)to j+ 1,
Next(Next(x)) to j+ 2, and so on, where Next(x)is the k-mer following
xin Q(assuming Next(x)and Next(Next(x)) are in Xas well). This
behavior of fis very desirable in practice, at least for two important
reasons. First, it implies compression for satellite values associated to k-
mers. Typical satellite values are abundance counts, reference identifiers
(sometimes called “colors”), or contig identifiers (e.g., unitigs) in a
de Bruijn graph. Consecutive k-mers tend to have very similar – if
not identical – satellite values, hence hashing consecutive k-mers to
consecutive identifiers induce a natural clustering of the associated satellite
values which is amenable to effective compression. The second important
reason is, clearly, faster evaluation time when querying for consecutive
k-mers in a sequence. This streaming query modality is the query
modality employed by k-mer-based applications [Almodaresi et al.,2018,
Bingmann et al.,2019,Marchet et al.,2021,Robidou and Peterlongo,
2021,Pibiri,2022b].
We formalize the notion of locality-preserving MPHF along with
other preliminary definitions in Section 2. We show how to obtain
a locality-preserving MPHF in very compact space in Section 3. To
achieve this result, we make use of two algorithmic tools: random
minimizers [Schleimer et al.,2003,Roberts et al.,2004] and a novel
partitioning scheme for sub-sequences of consecutive k-mers sharing
the same minimizers (super-k-mers) which allows a more parsimonious
memory layout. The space of the proposed solution decreases for growing
kand the data structure is built in linear time in the size of the input
(number of distinct k-mers). In Section 4we present experiments across
a breadth of datasets to show that the construction is practical too: the
functions can be several times smaller and even faster to query than the
most efficient, albeit “general-purpose”, minimal perfect hash functions.
We conclude in Section 5where we also sketch some promising future
directions. Our C++ implementation of the method is publicly available at
https://github.com/jermp/lphash.
2 Notation and Definitions
Let Xbe a set of strings over an alphabet Σ. Throughout the paper we focus
on the DNA alphabet Σ = {A,C,G,T}to better highlight the connection
with our concrete application but our algorithms can be generalized to
work for arbitrary alphabets. A sub-string of length kof a string S∈ X
is called a k-mer of S.
Definition 1 (Spectrum). The k-mer spectrum of Xis the set of all
distinct k-mers of the strings in X. Formally: spectrumk(X) := {x∈
Σk| ∃S∈ X such that xis a k-mer of S}.
Definition 2 (Spectrum-Preserving String Set). A spectrum-preserving
string set (or SPSS) Sof Xis a set of strings such that (i) each string of
Shas length at least k, and (ii) spectrumk(S) = spectrumk(X).
Since our goal is to build a MPHF for the k-mers in a SPSS, we are
interested in a SPSS Swhere each k-mer is seen only once, i.e., for each
k-mer x∈spectrumk(S)there is only one string of Swhere xappears
once. We assume that no k-mer appearing at the end of a string shares an
overlap of k−1symbols with the first k-mer of another string, otherwise
we could reduce the number of strings in Sand obtain a smaller SPSS.
In the following, we make use of this form of SPSS which is suitable for
the minimal perfect hashing problem. We remark that efficient algorithms
exist to compute such SPSSs (see, e.g., [Rahman and Medvedev,2020,
Bˇrinda et al.,2021,Khan and Patro,2021,Khan et al.,2022]).
The input for our problem is therefore a SPSS Sfor Xwith |S| strings
and n > 1distinct k-mers. Without loss of generality, we index k-mers
based on their positions in S, assuming an order S1, S2, S3,... of the
strings of Sis fixed, and we indicate with xithe i-th k-mer in S, for
i= 1,...,n.
We want to build a MPHF f: Σk→[n]for S; more precisely, for the
ndistinct k-mers in spectrumk(S). We remark again that our objective is
to exploit the overlap of k−1symbols between consecutive k-mers from
a string of Sto preserve their locality, and hence reduce the bit complexity
of fas well as its evaluation time when querying k-mers in sequence.
We define a locality-preserving MPHF, or LP-MPHF, for Sas follows.
Definition 3 (LP-MPHF). Let f: Σk→[n]be a MPHF for Sand Abe
the set {1≤i<n| ∃S∈ S, xi, xi+1 ∈S∧f(xi+1) = f(xi)+1}.
The function fis (1 −ε)-locality-preserving for Sif ε≥1− |A|/n.
Intuitively, the “best” LP-MPHF for Sis the one having the smallest
ε, so we look for practical constructions with small ε. On the other hand,
note that a “classic” MPHF corresponds to the case where the locality-
preserving property is almost always not satisfied and, as a consequence,
εwill be approximately 1.
Two more considerations are in order. First, it should be clear that the
way we define locality-preservation in Definition 3is only pertinent to
SPSSs where having consecutive hash codes for consecutive k-mers is a
very desirable property as motivated in Section 1. A different definition of
locality-preservation could instead be given if we were considering generic
input keys. Second, we did not use the term order-preserving to stress the
distinction from classic order-preserving functions in the literature [Fox
et al.,1991] that make it possible to preserve any wanted order and, as
such, incur in an avoidable Ω(log n)-bit overhead per key. Here, we are
interested in preserving only the input order of the k-mers which is the
one that matters in practice.
Definition 4 (Fragmentation Factor). Given a SPSS Swith |S| strings
and n=|spectrumk(S)|distinct k-mers, we define the fragmentation
factor of Sas α:= (|S| − 1)/n.
The fragmentation factor of Sis a measure of how contiguous the
k-mers in Sare. The minimum fragmentation α= 0 is achieved for
|S| = 1 and, in this case, xishares an overlap of k−1symbols with
xi+1 for all i= 1,...,n−1. This ideal scenario is, however, unlikely to
happen in practice. On the other hand, the worst-case scenario of maximum
fragmentation α= 1 −1/n is achieved when |S| =nand k-mers do
not share any overlap (of length k−1). This is also unlikely to happen
given that k-mers are extracted consecutively from the strings of Xand,
as a result, many overlaps are expected. A more realistic scenario happens,
instead, when |S| n, resulting in εα. For the rest of the paper, we
focus on this latter scenario to make our analysis meaningful.
From Definition 3and 4it is easy to see that ε≥1/n when α= 0,
and ε= 1 when α= 1 −1/n. In general, we have ε≥α+ 1/n since
there are at least |S| − 1indexes ifor which f(xi+1)6=f(xi)+1. How
small εcan actually be therefore depends on the input SPSS (and on the
strategy used to implement fin practice, as we are going to illustrate in
Section 3).
Lastly in this section, we define minimizers and super-k-mers that will
be one of the main ingredients used in Section 3.
Definition 5 (Random Minimizer of a k-mer). Given a k-mer xand
a random hash function h, the minimizer of xis any m-mer µsuch that
h(µ)≤h(y)for any other m-mer yof x, for some m≤k.
In case the minimizer of xis not unique, we break ties by taking the
leftmost m-mer in x. For convenience, we indicate with w=k−m+1 the
number of m-mers in a k-mer. (Note that Definition 5defines a minimizer