
competitive with the BWT-based indices we mentioned when indexing one genome,
those indices may take orders of magnitude less space than the raw data, not just of
its indices, when representing a collection of many genomes.
Those compressed indices support exact pattern matching, that is, they can list
all the positions where Poccurs in T. While useful, this is insufficient to efficiently
implement the MEM finding algorithms. This is the problem we address in this paper.
1.1 MEM finding with indices for repetitive text collections
The classic MEM-finding algorithm runs on a suffix tree, and those that run on BWT-
based data structures emulate it. Compressed suffix trees for highly repetitive text
collections do exist, but do not compress that much. Gagie et al (2020) show how to
simulate a suffix tree within space O(rlog n
r), where ris the number of equal-letter
runs in the BWT of T. This representation can simulate the suffix-tree-based algorithm
in time O(mlog n
r) if we run it backwards on P, using operations parent and Weiner
link instead of child and suffix link. The problem is the space: while ris an accepted
measure of repetitiveness (Kempa and Kociumaka,2020), it is a weak one (Navarro,
2021a;Kempa and Kociumaka,2020), and multiplying it by log n
rmakes it grow by an
order of magnitude. Current implementations of compressed suffix trees for repetitive
texts achieve remarkable space, but still use at least 2–4 bits per symbol (Russo et al,
2011;Farruggia et al,2018;C´aceres and Navarro,2022;Boucher et al,2021a).
Another trend has been to expand the functionality of a more basic compressed
text index for repetitive texts so as to support specific operations, MEMs in our case.
Bannai et al (2020) show how to compute matching statistics (from where MEMs
are easily extracted in O(m) time) by extending the RLBWT-index (M¨akinen et al,
2010), in O(m(s+ log log n)) time and O(r) space, with the help of a data structure
that provides access to a symbol of Tin time O(s). This can be, for example, the
samples of the RLBWT-index, which add O(n/s) space to the index, or a context-free
grammar of T, which provides access in time s=O(log n) (Bille et al,2015). Various
implementations of this idea (Rossi et al,2022;Boucher et al,2021b;Tatarnikov et al,
2022) showed its practicality on large genome collections, with indices that are an
order of magnitude smaller than the text.
All those results have been obtained on the so-called suffix-based compressed indices
for repetitive collections (Navarro,2021b). This is natural because those emulate vari-
ants of suffix trees or arrays (Manber and Myers,1993), which simplifies the problem
of simulating the suffix tree traversal of the classic MEM-finding algorithm. Even
the naive algorithm of searching for all the O(m2) substrings of Pcan be run in
O(m2log log n) time on those O(r)-sized indices.
The problem is much harder on the so-called parsing-based indices (Navarro,
2021b). Those are potentially smaller than the suffix-based indices because they build
on stronger measures of repetitiveness. For example, the size gof the smallest context-
free grammar that generates Tis usually considerably smaller than r(Navarro,2021a).
Because these indices cut Tinto phrases, even exact pattern matching is complicated
because the occurrences of Pcan appear in many different forms, and many possible
cuts of Pmust be tried out (m−1 in the general case) (Claude et al,2021). This makes
the problem of finding MEMs considerably harder. We are only aware of the results
3