Computing MEMs and Relatives on Repetitive Text Collections Gonzalo Navarro12

2025-04-29 0 0 795.07KB 39 页 10玖币
侵权投诉
Computing MEMs and Relatives
on Repetitive Text Collections
Gonzalo Navarro1,2*
1Center for Biotechnology and Bioengineering (CeBiB), Chile.
2Department of Computer Science, University of Chile, Chile.
Corresponding author(s). E-mail(s): gnavarro@dcc.uchile.cl;
Abstract
We consider the problem of computing the Maximal Exact Matches (MEMs) of
a given pattern P[1 ..m]on a large repetitive text collection T[1 ..n], which
is represented as a (hopefully much smaller) run-length context-free grammar of
size grl . We show that the problem can be solved in time O(m2logϵn), for any
constant ϵ > 0, on a data structure of size O(grl). Further, on a locally consistent
grammar of size O(δlog n
δ), the time decreases to O(mlog m(log m+logϵn)).
The value δis a function of the substring complexity of Tand Ω(δlog n
δ)is a
tight lower bound on the compressibility of repetitive texts T, so our structure
has optimal size in terms of nand δ. We extend our results to several related
problems, such as finding k-MEMs, MUMs, rare MEMs, and applications.
Keywords: repetitive texts, substring complexity, grammar compression, locally
consistent parsing, compressed data structures, MEMs, MUMs
1 Introduction and Related Work
Inexact sequence matching is the norm in Bioinformatic applications. Mutations in the
genomes, and even the possible errors that arise in the sequence acquisition process,
makes researchers expect to see differences between the patterns they look for and what
they call their occurrences (Gusfield,1997;Ohlebusch,2013;akinen et al,2015). For
example, when assembling a genome one must align the reads (which are sequences
obtained from the DNA of an individual, a few hundreds or thousands nucleotides
long) to a reference genome (of length typically in the billions) or a set thereof. For the
reasons above, the read may not appear in the genome in exact form, so one looks for
1
arXiv:2210.09914v5 [cs.DS] 4 Sep 2023
the longest substrings of the read that appear in the genome, to determine the most
likely positions where to align it. In pangenomics, one may compare the substrings of a
whole gene or chromosome against another, or against a set of genomes representative
of a population, to find conserved regions and spot the places that differ, which may
indicate genetic variations or diseases.
Those examples are applications of one of the most relevant tools for inexact match-
ing, which is finding the Maximal Exact Matches (MEMs) of a given pattern P[1 . . m]
in a text T[1 . . n]. A MEM is a maximal substring P[i . . j] that appears in T(i.e.,
P[i1. . j] and P[i . . j + 1] are out of bounds or do not occur in T). In the align-
ment of reads, mis in the hundreds or thousands; when aligning a chromosome mcan
surpass the millions. If the text is one genome, ncan be in the billions, but genomic
collections may contain hundreds to thousands of genomes.
In this paper we are interested in the case where Tis known in advance and thus
can be indexed so that, later, we can efficiently find the MEMs of many patterns P
in it. This is the case in the two applications we have mentioned; in the pangenomic
case Tcan be taken as the concatenation of all the genomes in the collection. Many
other applications of the problem of finding the MEMs of a new string in a static
collection of strings are mentioned by Gusfield (1997), Ohlebusch (2013), and akinen
et al (2015), including simplified problems like finding longest common substrings and
equivalent problems like computing matching statistics. some later.
Finding MEMs on an indexed text Tis a classic problem in stringology and can be
solved in optimal O(m) time using a suffix tree of T(Weiner,1973;McCreight,1976).
Gusfield (1997, Sec. 7.8), for example, shows how to solve the analogous problem of
computing matching statistics (we discuss this equivalence later). In applications where
we handle massive texts T, however, suffix trees are too large to be maintained in
main memory, even if they use linear space. The suffix tree of a single human genome,
for example, whose length is about 3 billion bases, may require 60GB of memory with
a decent implementation. This rules out suffix trees to represent genomes on the large
bioinformatic collections that are arising, and make researchers look for alternatives
using less space. For example, Ohlebusch et al (2010) and Belazzougui et al (2013)
use indices based on the Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,
1994) to compute MEMs in time O(mlog σ), where σis the alphabet size. Such indices
take just a few GBs on a human genome, an order of magnitude below the space
required with suffix trees.
When considering large collections of genomes, however, even such sharp space
reduction can be insufficient. Projects to sequence 100,000 human genomes have been
completed1, and current projects aim to sequencing millions of human genomes2.
In that scenario, even the BWT-based representations will need petabytes of main
memory to run, or resort to orders of magnitude slower disk storage.
A fortunate situation is that many of the fastest growing text collections, including
genome collections, are highly repetitive (Navarro,2021a): two genomes of the same
species feature a small percentage of differences only. Several text indices exploiting
repetitiveness to reduce space have appeared (Navarro,2021b). While probably not
1https://www.genomicsengland.co.uk/initiatives/100000-genomes-project
2https://b1mg-project.eu
2
competitive with the BWT-based indices we mentioned when indexing one genome,
those indices may take orders of magnitude less space than the raw data, not just of
its indices, when representing a collection of many genomes.
Those compressed indices support exact pattern matching, that is, they can list
all the positions where Poccurs in T. While useful, this is insufficient to efficiently
implement the MEM finding algorithms. This is the problem we address in this paper.
1.1 MEM finding with indices for repetitive text collections
The classic MEM-finding algorithm runs on a suffix tree, and those that run on BWT-
based data structures emulate it. Compressed suffix trees for highly repetitive text
collections do exist, but do not compress that much. Gagie et al (2020) show how to
simulate a suffix tree within space O(rlog n
r), where ris the number of equal-letter
runs in the BWT of T. This representation can simulate the suffix-tree-based algorithm
in time O(mlog n
r) if we run it backwards on P, using operations parent and Weiner
link instead of child and suffix link. The problem is the space: while ris an accepted
measure of repetitiveness (Kempa and Kociumaka,2020), it is a weak one (Navarro,
2021a;Kempa and Kociumaka,2020), and multiplying it by log n
rmakes it grow by an
order of magnitude. Current implementations of compressed suffix trees for repetitive
texts achieve remarkable space, but still use at least 2–4 bits per symbol (Russo et al,
2011;Farruggia et al,2018;aceres and Navarro,2022;Boucher et al,2021a).
Another trend has been to expand the functionality of a more basic compressed
text index for repetitive texts so as to support specific operations, MEMs in our case.
Bannai et al (2020) show how to compute matching statistics (from where MEMs
are easily extracted in O(m) time) by extending the RLBWT-index (akinen et al,
2010), in O(m(s+ log log n)) time and O(r) space, with the help of a data structure
that provides access to a symbol of Tin time O(s). This can be, for example, the
samples of the RLBWT-index, which add O(n/s) space to the index, or a context-free
grammar of T, which provides access in time s=O(log n) (Bille et al,2015). Various
implementations of this idea (Rossi et al,2022;Boucher et al,2021b;Tatarnikov et al,
2022) showed its practicality on large genome collections, with indices that are an
order of magnitude smaller than the text.
All those results have been obtained on the so-called suffix-based compressed indices
for repetitive collections (Navarro,2021b). This is natural because those emulate vari-
ants of suffix trees or arrays (Manber and Myers,1993), which simplifies the problem
of simulating the suffix tree traversal of the classic MEM-finding algorithm. Even
the naive algorithm of searching for all the O(m2) substrings of Pcan be run in
O(m2log log n) time on those O(r)-sized indices.
The problem is much harder on the so-called parsing-based indices (Navarro,
2021b). Those are potentially smaller than the suffix-based indices because they build
on stronger measures of repetitiveness. For example, the size gof the smallest context-
free grammar that generates Tis usually considerably smaller than r(Navarro,2021a).
Because these indices cut Tinto phrases, even exact pattern matching is complicated
because the occurrences of Pcan appear in many different forms, and many possible
cuts of Pmust be tried out (m1 in the general case) (Claude et al,2021). This makes
the problem of finding MEMs considerably harder. We are only aware of the results
3
of Gao (2022), who computes matching statistics in time O(m2logϵγ+mlog n) using
O(δlog n
δ) space (for any constant ϵ > 0), or O(m2+mlog γlog log γ+mlog n) using
O(δlog n
δ+γlog γ) space. Here δγare lower-bounding measures of repetitiveness
(Kempa and Prezza,2018;Christiansen et al,2020). The size O(δlog n
δ) matches a
tight lower bound on the size of compressed representations of T(Kociumaka et al,
2023b), so a structure of this size uses asymptotically optimal space for every nand δ.
1.2 Our contribution
We solve the MEM finding problem, and several relatives, within less space and/or
time than previous work, for the case of repetitive text collections.
Let grl be the size of any run-length context-free grammar generating T(those
include and extend classic context-free grammars). The smallest such grammar is of
size grl =O(δlog n
δ) (Kociumaka et al,2023b). We first show that, on an index of size
O(grl), one can compute the MEMs in time O(m2logϵgrl), for any constant ϵ > 0.
This is done by sliding the window P[i . . j] of the classic algorithm while we simulate
the process of searching for that window with the grammar. The simulation is carefully
crafted to avoid expensive operations, so the time stays proportional to the number of
cuts tried out on a single search for P. The space O(grl) is the least known to support
direct access to Twith logarithmic time guarantees (Navarro,2021a), so improving
our space is likely to involve breaking this long-standing barrier as well. Our result
essentially matches the first one of Gao (2022), which, although he did not claim so,
could also run within O(grl) space.
We further show that, on a particular grammar featuring local consistency prop-
erties (Kociumaka et al,2022), we can reduce the time to O(mlog m(log m+ logϵn))
by exploiting the fact that only O(log(ji+ 1)) cuts need to be tried out for P[i . . j],
and using much more sophisticated techniques to amortize the costs. This grammar is
of size O(δlog n
δ), which is optimal for every nand δ, and within this space we sharply
break the quadratic time of previous solutions that run in grammar-bounded space.
We then turn to consider several relatives of the MEM finding problem, adapting
our main algorithm to solve them:
A natural generalization of MEMs are k-MEMs, the maximal substrings of Pthat
appear at least ktimes in T. Those identify the parts of Pthat have sufficient
support in T, for example regions of a gene that appear in most genomes of a
collection, or regions in a reference genome that have sufficient coverage in a set of
reads. Even with kgiven at query time, this problem is also easily solved in O(m)
time with a suffix tree (Navarro,2016), but obtaining the same on grammars is not
so direct. We generalize our results on MEMs to find k-MEMs in time O(km2logϵn)
within O(grl) space, or in time O(mlog m(log m+klogϵn)) within O(δlog n
δ) space.
For k=ω(log2n), we provide faster solutions that run in time O(m2log2+ϵn) and
O(g) space, or in time O(mlog mlog2+ϵn) and O(δlog n
δ) space.
A stricter version of MEMs are MUMs (maximal unique matches), which are
maximal matches that appear exactly once in Pand in T. MUMs have various
applications to sequence alignment (Delcher et al,1999;akinen et al,2015;Giu-
liani et al,2022). Classical solutions using suffix trees (Sung,2010) and suffix arrays
4
(Ohlebusch,2013) compute MUMs in O(m+n) time, though they are easily seen
to take O(m) time if Pand Tare indexed separately. MUMs are also computed
in time O(mlog σ) using a BWT-based index (Belazzougui et al,2013). The only
compressed-space solution for repetitive texts we know (Giuliani et al,2022) com-
putes MUMs in O(r) space (plus a grammar on T) and O(mlog n) time. We compute
MUMs within the same space and time complexities as for finding the MEMs.
A natural generalization of MUMs are rare MEMs (Ohlebusch and Kurtz,2008),
which also have applications in whole genome alignment (Ohlebusch,2013, p. 419).
We say that a MEM is k-rare if it appears at most ktimes in Pand in T, so MUMs
are 1-rare MEMs. We show that k-rare MEMs can be computed within the same
space and time of k-MEMs.
We finally show, through applications to problems like data compression and
genome assembly, that our techniques open the door to using parsing-based indices in
stringology problems that had been addressed only through the more powerful (but
more space-consuming) suffix-based ones.
A conference version of this paper (Navarro,2023) appeared in Proc. CPM 2023.
In this journal version the key result, Section 6, was largely rewritten due to simplifi-
cations, improvements, and filling considerable gaps of the conference version. Several
minor problems were fixed in other sections as well. A new set of extensions of the
basic result to related problems, as well as applications of our result, are also included
in Sections 7and 8. Finally, the whole paper includes more detailed explanations,
figures, and a better presentation.
The roadmap of the paper is as follows. In Section 2we formally define MEMs, the
problem of finding them, and the variants we consider in the paper. We also describe
the algorithms for finding MEMs and how they are adapted to find their variants.
In Section 3we describe the basic grammar-based index for pattern matching, on
which our simple solution builds. This simple solution, quadratic in m, is described in
Section 4. Section 5then describes the more sophisticated locally consistent grammar
we will use, and how the pattern matching problem is solved on it. The more com-
plex, subquadratic, algorithm to find MEMs on locally consistent grammars is then
described in Section 6. This is the central result of the article. Section 7shows how
the tools we have developed can be used to solve the related problems we have con-
sidered: k-MEMs, MUMs, k-rare MEMs, and so on. We explore some direct and not
so direct applications of our results in Section 8. We give our conclusions and future
work directions in Section 9.
2 Maximal Exact Matches (MEMs) and Relatives
We use the classic notation on strings S[1 . . n], so S[i] is the ith symbol of S,S[i . . j]
denotes S[i]···S[j] (i.e., the concatenation of symbols S[i] to S[j] and the empty
string εif i>j), S[. . j] = S[1 . . j] and S[i . .] = S[i . . n]. The concatenation of strings
Sand Sis denoted S·S. We assume that the reader is familiar with the concepts
related to suffix trees (Weiner,1973;McCreight,1976;Crochemore and Rytter,2002).
We start by defining the problems of finding MEMs and k-MEMs, and how to solve
them on suffix trees.
5
摘要:

ComputingMEMsandRelativesonRepetitiveTextCollectionsGonzaloNavarro1,2*1CenterforBiotechnologyandBioengineering(CeBiB),Chile.2DepartmentofComputerScience,UniversityofChile,Chile.Correspondingauthor(s).E-mail(s):gnavarro@dcc.uchile.cl;AbstractWeconsidertheproblemofcomputingtheMaximalExactMatches(MEMs)...

展开>> 收起<<
Computing MEMs and Relatives on Repetitive Text Collections Gonzalo Navarro12.pdf

共39页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:39 页 大小:795.07KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 39
客服
关注