Computing MEMs and Relatives on Repetitive Text Collections Gonzalo Navarro12

2025-04-29 1 0 795.07KB 39 页 10玖币

侵权投诉

Computing MEMs and Relatives

on Repetitive Text Collections

Gonzalo Navarro1,2*

1Center for Biotechnology and Bioengineering (CeBiB), Chile.

2Department of Computer Science, University of Chile, Chile.

Corresponding author(s). E-mail(s): gnavarro@dcc.uchile.cl;

Abstract

We consider the problem of computing the Maximal Exact Matches (MEMs) of

a given pattern P[1 ..m]on a large repetitive text collection T[1 ..n], which

is represented as a (hopefully much smaller) run-length context-free grammar of

size grl . We show that the problem can be solved in time O(m2logϵn), for any

constant ϵ > 0, on a data structure of size O(grl). Further, on a locally consistent

grammar of size O(δlog n

δ), the time decreases to O(mlog m(log m+logϵn)).

The value δis a function of the substring complexity of Tand Ω(δlog n

δ)is a

tight lower bound on the compressibility of repetitive texts T, so our structure

has optimal size in terms of nand δ. We extend our results to several related

problems, such as ﬁnding k-MEMs, MUMs, rare MEMs, and applications.

Keywords: repetitive texts, substring complexity, grammar compression, locally

consistent parsing, compressed data structures, MEMs, MUMs

1 Introduction and Related Work

Inexact sequence matching is the norm in Bioinformatic applications. Mutations in the

genomes, and even the possible errors that arise in the sequence acquisition process,

makes researchers expect to see diﬀerences between the patterns they look for and what

they call their occurrences (Gusﬁeld,1997;Ohlebusch,2013;M¨akinen et al,2015). For

example, when assembling a genome one must align the reads (which are sequences

obtained from the DNA of an individual, a few hundreds or thousands nucleotides

long) to a reference genome (of length typically in the billions) or a set thereof. For the

reasons above, the read may not appear in the genome in exact form, so one looks for

arXiv:2210.09914v5 [cs.DS] 4 Sep 2023

the longest substrings of the read that appear in the genome, to determine the most

likely positions where to align it. In pangenomics, one may compare the substrings of a

whole gene or chromosome against another, or against a set of genomes representative

of a population, to ﬁnd conserved regions and spot the places that diﬀer, which may

indicate genetic variations or diseases.

Those examples are applications of one of the most relevant tools for inexact match-

ing, which is ﬁnding the Maximal Exact Matches (MEMs) of a given pattern P[1 . . m]

in a text T[1 . . n]. A MEM is a maximal substring P[i . . j] that appears in T(i.e.,

P[i−1. . j] and P[i . . j + 1] are out of bounds or do not occur in T). In the align-

ment of reads, mis in the hundreds or thousands; when aligning a chromosome mcan

surpass the millions. If the text is one genome, ncan be in the billions, but genomic

collections may contain hundreds to thousands of genomes.

In this paper we are interested in the case where Tis known in advance and thus

can be indexed so that, later, we can eﬃciently ﬁnd the MEMs of many patterns P

in it. This is the case in the two applications we have mentioned; in the pangenomic

case Tcan be taken as the concatenation of all the genomes in the collection. Many

other applications of the problem of ﬁnding the MEMs of a new string in a static

collection of strings are mentioned by Gusﬁeld (1997), Ohlebusch (2013), and M¨akinen

et al (2015), including simpliﬁed problems like ﬁnding longest common substrings and

equivalent problems like computing matching statistics. some later.

Finding MEMs on an indexed text Tis a classic problem in stringology and can be

solved in optimal O(m) time using a suﬃx tree of T(Weiner,1973;McCreight,1976).

Gusﬁeld (1997, Sec. 7.8), for example, shows how to solve the analogous problem of

computing matching statistics (we discuss this equivalence later). In applications where

we handle massive texts T, however, suﬃx trees are too large to be maintained in

main memory, even if they use linear space. The suﬃx tree of a single human genome,

for example, whose length is about 3 billion bases, may require 60GB of memory with

a decent implementation. This rules out suﬃx trees to represent genomes on the large

bioinformatic collections that are arising, and make researchers look for alternatives

using less space. For example, Ohlebusch et al (2010) and Belazzougui et al (2013)

use indices based on the Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,

1994) to compute MEMs in time O(mlog σ), where σis the alphabet size. Such indices

take just a few GBs on a human genome, an order of magnitude below the space

required with suﬃx trees.

When considering large collections of genomes, however, even such sharp space

reduction can be insuﬃcient. Projects to sequence 100,000 human genomes have been

completed1, and current projects aim to sequencing millions of human genomes2.

In that scenario, even the BWT-based representations will need petabytes of main

memory to run, or resort to orders of magnitude slower disk storage.

A fortunate situation is that many of the fastest growing text collections, including

genome collections, are highly repetitive (Navarro,2021a): two genomes of the same

species feature a small percentage of diﬀerences only. Several text indices exploiting

repetitiveness to reduce space have appeared (Navarro,2021b). While probably not

1https://www.genomicsengland.co.uk/initiatives/100000-genomes-project

2https://b1mg-project.eu

competitive with the BWT-based indices we mentioned when indexing one genome,

those indices may take orders of magnitude less space than the raw data, not just of

its indices, when representing a collection of many genomes.

Those compressed indices support exact pattern matching, that is, they can list

all the positions where Poccurs in T. While useful, this is insuﬃcient to eﬃciently

implement the MEM ﬁnding algorithms. This is the problem we address in this paper.

1.1 MEM ﬁnding with indices for repetitive text collections

The classic MEM-ﬁnding algorithm runs on a suﬃx tree, and those that run on BWT-

based data structures emulate it. Compressed suﬃx trees for highly repetitive text

collections do exist, but do not compress that much. Gagie et al (2020) show how to

simulate a suﬃx tree within space O(rlog n

r), where ris the number of equal-letter

runs in the BWT of T. This representation can simulate the suﬃx-tree-based algorithm

in time O(mlog n

r) if we run it backwards on P, using operations parent and Weiner

link instead of child and suﬃx link. The problem is the space: while ris an accepted

measure of repetitiveness (Kempa and Kociumaka,2020), it is a weak one (Navarro,

2021a;Kempa and Kociumaka,2020), and multiplying it by log n

rmakes it grow by an

order of magnitude. Current implementations of compressed suﬃx trees for repetitive

texts achieve remarkable space, but still use at least 2–4 bits per symbol (Russo et al,

2011;Farruggia et al,2018;C´aceres and Navarro,2022;Boucher et al,2021a).

Another trend has been to expand the functionality of a more basic compressed

text index for repetitive texts so as to support speciﬁc operations, MEMs in our case.

Bannai et al (2020) show how to compute matching statistics (from where MEMs

are easily extracted in O(m) time) by extending the RLBWT-index (M¨akinen et al,

2010), in O(m(s+ log log n)) time and O(r) space, with the help of a data structure

that provides access to a symbol of Tin time O(s). This can be, for example, the

samples of the RLBWT-index, which add O(n/s) space to the index, or a context-free

grammar of T, which provides access in time s=O(log n) (Bille et al,2015). Various

implementations of this idea (Rossi et al,2022;Boucher et al,2021b;Tatarnikov et al,

2022) showed its practicality on large genome collections, with indices that are an

order of magnitude smaller than the text.

All those results have been obtained on the so-called suﬃx-based compressed indices

for repetitive collections (Navarro,2021b). This is natural because those emulate vari-

ants of suﬃx trees or arrays (Manber and Myers,1993), which simpliﬁes the problem

of simulating the suﬃx tree traversal of the classic MEM-ﬁnding algorithm. Even

the naive algorithm of searching for all the O(m2) substrings of Pcan be run in

O(m2log log n) time on those O(r)-sized indices.

The problem is much harder on the so-called parsing-based indices (Navarro,

2021b). Those are potentially smaller than the suﬃx-based indices because they build

on stronger measures of repetitiveness. For example, the size gof the smallest context-

free grammar that generates Tis usually considerably smaller than r(Navarro,2021a).

Because these indices cut Tinto phrases, even exact pattern matching is complicated

because the occurrences of Pcan appear in many diﬀerent forms, and many possible

cuts of Pmust be tried out (m−1 in the general case) (Claude et al,2021). This makes

the problem of ﬁnding MEMs considerably harder. We are only aware of the results

of Gao (2022), who computes matching statistics in time O(m2logϵγ+mlog n) using

O(δlog n

δ) space (for any constant ϵ > 0), or O(m2+mlog γlog log γ+mlog n) using

O(δlog n

δ+γlog γ) space. Here δ≤γare lower-bounding measures of repetitiveness

(Kempa and Prezza,2018;Christiansen et al,2020). The size O(δlog n

δ) matches a

tight lower bound on the size of compressed representations of T(Kociumaka et al,

2023b), so a structure of this size uses asymptotically optimal space for every nand δ.

1.2 Our contribution

We solve the MEM ﬁnding problem, and several relatives, within less space and/or

time than previous work, for the case of repetitive text collections.

Let grl be the size of any run-length context-free grammar generating T(those

include and extend classic context-free grammars). The smallest such grammar is of

size grl =O(δlog n

δ) (Kociumaka et al,2023b). We ﬁrst show that, on an index of size

O(grl), one can compute the MEMs in time O(m2logϵgrl), for any constant ϵ > 0.

This is done by sliding the window P[i . . j] of the classic algorithm while we simulate

the process of searching for that window with the grammar. The simulation is carefully

crafted to avoid expensive operations, so the time stays proportional to the number of

cuts tried out on a single search for P. The space O(grl) is the least known to support

direct access to Twith logarithmic time guarantees (Navarro,2021a), so improving

our space is likely to involve breaking this long-standing barrier as well. Our result

essentially matches the ﬁrst one of Gao (2022), which, although he did not claim so,

could also run within O(grl) space.

We further show that, on a particular grammar featuring local consistency prop-

erties (Kociumaka et al,2022), we can reduce the time to O(mlog m(log m+ logϵn))

by exploiting the fact that only O(log(j−i+ 1)) cuts need to be tried out for P[i . . j],

and using much more sophisticated techniques to amortize the costs. This grammar is

of size O(δlog n

δ), which is optimal for every nand δ, and within this space we sharply

break the quadratic time of previous solutions that run in grammar-bounded space.

We then turn to consider several relatives of the MEM ﬁnding problem, adapting

our main algorithm to solve them:

•A natural generalization of MEMs are k-MEMs, the maximal substrings of Pthat

appear at least ktimes in T. Those identify the parts of Pthat have suﬃcient

support in T, for example regions of a gene that appear in most genomes of a

collection, or regions in a reference genome that have suﬃcient coverage in a set of

reads. Even with kgiven at query time, this problem is also easily solved in O(m)

time with a suﬃx tree (Navarro,2016), but obtaining the same on grammars is not

so direct. We generalize our results on MEMs to ﬁnd k-MEMs in time O(km2logϵn)

within O(grl) space, or in time O(mlog m(log m+klogϵn)) within O(δlog n

δ) space.

For k=ω(log2n), we provide faster solutions that run in time O(m2log2+ϵn) and

O(g) space, or in time O(mlog mlog2+ϵn) and O(δlog n

δ) space.

•A stricter version of MEMs are MUMs (maximal unique matches), which are

maximal matches that appear exactly once in Pand in T. MUMs have various

applications to sequence alignment (Delcher et al,1999;M¨akinen et al,2015;Giu-

liani et al,2022). Classical solutions using suﬃx trees (Sung,2010) and suﬃx arrays

(Ohlebusch,2013) compute MUMs in O(m+n) time, though they are easily seen

to take O(m) time if Pand Tare indexed separately. MUMs are also computed

in time O(mlog σ) using a BWT-based index (Belazzougui et al,2013). The only

compressed-space solution for repetitive texts we know (Giuliani et al,2022) com-

putes MUMs in O(r) space (plus a grammar on T) and O(mlog n) time. We compute

MUMs within the same space and time complexities as for ﬁnding the MEMs.

•A natural generalization of MUMs are rare MEMs (Ohlebusch and Kurtz,2008),

which also have applications in whole genome alignment (Ohlebusch,2013, p. 419).

We say that a MEM is k-rare if it appears at most ktimes in Pand in T, so MUMs

are 1-rare MEMs. We show that k-rare MEMs can be computed within the same

space and time of k-MEMs.

We ﬁnally show, through applications to problems like data compression and

genome assembly, that our techniques open the door to using parsing-based indices in

stringology problems that had been addressed only through the more powerful (but

more space-consuming) suﬃx-based ones.

A conference version of this paper (Navarro,2023) appeared in Proc. CPM 2023.

In this journal version the key result, Section 6, was largely rewritten due to simpliﬁ-

cations, improvements, and ﬁlling considerable gaps of the conference version. Several

minor problems were ﬁxed in other sections as well. A new set of extensions of the

basic result to related problems, as well as applications of our result, are also included

in Sections 7and 8. Finally, the whole paper includes more detailed explanations,

ﬁgures, and a better presentation.

The roadmap of the paper is as follows. In Section 2we formally deﬁne MEMs, the

problem of ﬁnding them, and the variants we consider in the paper. We also describe

the algorithms for ﬁnding MEMs and how they are adapted to ﬁnd their variants.

In Section 3we describe the basic grammar-based index for pattern matching, on

which our simple solution builds. This simple solution, quadratic in m, is described in

Section 4. Section 5then describes the more sophisticated locally consistent grammar

we will use, and how the pattern matching problem is solved on it. The more com-

plex, subquadratic, algorithm to ﬁnd MEMs on locally consistent grammars is then

described in Section 6. This is the central result of the article. Section 7shows how

the tools we have developed can be used to solve the related problems we have con-

sidered: k-MEMs, MUMs, k-rare MEMs, and so on. We explore some direct and not

so direct applications of our results in Section 8. We give our conclusions and future

work directions in Section 9.

2 Maximal Exact Matches (MEMs) and Relatives

We use the classic notation on strings S[1 . . n], so S[i] is the ith symbol of S,S[i . . j]

denotes S[i]···S[j] (i.e., the concatenation of symbols S[i] to S[j] and the empty

string εif i>j), S[. . j] = S[1 . . j] and S[i . .] = S[i . . n]. The concatenation of strings

Sand S′is denoted S·S′. We assume that the reader is familiar with the concepts

related to suﬃx trees (Weiner,1973;McCreight,1976;Crochemore and Rytter,2002).

We start by deﬁning the problems of ﬁnding MEMs and k-MEMs, and how to solve

them on suﬃx trees.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ComputingMEMsandRelativesonRepetitiveTextCollectionsGonzaloNavarro1,2*1CenterforBiotechnologyandBioengineering(CeBiB),Chile.2DepartmentofComputerScience,UniversityofChile,Chile.Correspondingauthor(s).E-mail(s):gnavarro@dcc.uchile.cl;AbstractWeconsidertheproblemofcomputingtheMaximalExactMatches(MEMs)...

展开>> 收起<<

Computing MEMs and Relatives on Repetitive Text Collections Gonzalo Navarro12.pdf

共39页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Computing MEMs and Relatives on Repetitive Text Collections Gonzalo Navarro12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: