1 Substring Density Estimation from Traces Kayvon Mazooji and Ilan Shomorony

2025-04-30 0 0 597.96KB 22 页 10玖币
侵权投诉
1
Substring Density Estimation from Traces
Kayvon Mazooji and Ilan Shomorony
Department of Electrical and Computer Engineering
University of Illinois, Urbana-Champaign
mazooji2@illinois.edu, ilans@illinois.edu
Abstract
In the trace reconstruction problem, one seeks to reconstruct a binary string sfrom a collection
of traces, each of which is obtained by passing sthrough a deletion channel. It is known that
exp( ˜
O(n1/5)) traces suffice to reconstruct any length-nstring with high probability. We consider
a variant of the trace reconstruction problem where the goal is to recover a “density map” that
indicates the locations of each length-ksubstring throughout s. We show that 2·poly(n) traces
suffice to recover the density map with error at most . As a result, when restricted to a set of source
strings whose minimum “density map distance” is at least 1/poly(n), the trace reconstruction
problem can be solved with polynomially many traces.
1 Introduction
In the trace reconstruction problem, there is an unknown binary string s∈ {0,1}n, which we wish
to reconstruct based on Tsubsequences (or traces) of s. Each trace is obtained by passing the source
string sthrough a deletion channel, which deletes each bit of sindependently with probability p.
The main question of interest is how many traces are needed in order to reconstruct scorrectly.
This problem was originally proposed by Batu et al. [1], motivated by problems in sequence
alignment, phylogeny, and computational biology [2]. Most of the work on the trace reconstruction
problem has focused on characterizing the minimum number of traces needed for reconstructing the
source string sexactly. The most common formulation of the problem, known as worst-case trace
reconstruction [3], requires the reconstruction algorithm to recover s∈ {0,1}nexactly with high
probability as n for any string s∈ {0,1}n. While this problem has received considerable
attention, there is still a significant gap between upper and lower bounds on the number of traces
needed. Currently, the best lower bound is ˜
Ω(n3/2), while the best upper bound is exp( ˜
O(n1/5)),
both due to Chase [4, 5].
The exponential gap between the best known lower and upper bounds has motivated the
formulation of several variants of the trace reconstruction problem where tighter bounds can hopefully
be obtained. For example, in the average-case trace reconstruction problem, sis assumed to be drawn
uniformly at random from all {0,1}nstrings. In this case, it is known that only T= exp(O(log1/3(n)))
traces are sufficient [6]. An approximate trace reconstruction problem, where a fraction of the
recovered bits is allowed to be incorrect, has also been formulated [7], and the problem of finding
the maximum likelihood sequence sfrom a small number of traces (possibly insufficient for exact
reconstruction) has been recently studied [8]. We can also consider a more modest goal than the
reconstruction of the source sequence sitself. One example that is particularly relevant to our
discussion is the reconstruction of the k-subword deck of s[9, 10].
The k-subword deck of a binary sequence s∈ {0,1}nis the multiset of all length-ksubstrings,
i.e., {s[i:i+k1] : i= 1, . . . , n k+ 1}. Equivalently, the k-subword deck can be defined by the
counts Ns,x of the number of times xappears in sas a substring:
Sk(s) = [Ns,x :x∈ {0,1}k].(1)
The work of Kayvon Mazooji and Ilan Shomorony was supported in part by the National Science Foundation (NSF)
under grants CCF-2007597 and CCF-2046991.
arXiv:2210.10917v1 [cs.IT] 19 Oct 2022
2
Fig. 1: (a) Example of a source binary string sand its k-subword deck (or k-spectrum) Sk(s),
for k= 4. (b) Given S4(s) in (a), one can build a de Bruijn graph where the elements in
S4(s) correspond to edges (with multiplicities) and the nodes correspond to 3-mers. Notice that
scorresponds to an Eulerian path on the de Bruijn graph, but such a path is not unique; for
example, s0= 000111000111000000000111 corresponds to another Eulerian path.
As shown in [10], for k=O(log n), the k-subword deck Sk(s) can be recovered with poly(n) traces.
The k-subword deck of a sequence is an important object in bioinformatics, with applications in
error correction [11, 12], sequence assembly [13, 14], and genomic complexity analysis [15, 16]. In
these contexts, the k-subword deck Sk(s) is often referred to as the k-spectrum, and each length-k
substring is called a k-mer. Intuitively, as long as kis large enough, the k-subword deck can uniquely
determine the source sequence s. In fact, a classical result by Ukkonen [17] provides a necessary and
sufficient condition for Sk(s) to uniquely determine sbased on the length of the “interleaved repeats”
in s[18]. In particular, if there are no repeats of length k1 in s, one can reconstruct sfrom Sk(s)
by simply merging k-mers with a prefix-suffix match of length k1. More generally, given Sk(s),
one can build the de Bruijn graph, where nodes correspond to (k1)-mers and edges correspond
to k-mers, and sis guaranteed to be an Eulerian path in the graph [13, 19]. This is illustrated in
Figure 1.
While the k-subword deck is a natural intermediate goal towards the reconstruction of s(and
can be recovered with only poly(n) traces), it does not capture all the information present in the
traces. For example, the k-subword deck Sk(s) in Figure 1 also admits the reconstruction s0=
000111000111000000000111, even though s0should be easy to distinguish from sbased on traces
(by estimating the length of the second and third runs of zeros). Motivated by this shortcoming of
the k-subword deck, we propose the idea of a k-mer density map, as a kind of localized k-subword
deck where, in addition to knowing the number of times a given k-mer appears in s, we have some
information about where it occurs.
For a k-mer x∈ {0,1}k, let Is,x ∈ {0,1}nk+1 be the indicator vector of the occurrences of xin s;
i.e., Is,x[j] = {sj:j+k1=x}, as illustrated in Figure 2. Notice that recovering the k-subword deck
can be seen as recovering PjIs,x[j] for each x∈ {0,1}k. Also notice that recovering sis equivalent
to recovering Is,x for all x∈ {0,1}k. A k-mer density map can be obtained by computing
Ks,x[i] =
nk+1
X
j=1
h(i, j)Is,x[j] (2)
for some “smoothing kernel” h(i, j), as illustrated in Figure 2. Intuitively, for a given x,Ks,x gives
a coarse indication of the occurrences of xin s. Moreover, if his such that Pih(i, j) = 1 for each
j, it holds that
X
i
Ks,x[i] = X
iX
j
h(i, j)Is,x[j] = X
jIs,x[j]X
i
h(i, j) = X
jIs,x[j],
which means that the k-subword deck Sk(s) is a function of Ks,x, and the density map Ks,x can be
thought of as a generalization of the k-subword deck that provides information about k-mer location.
3
Fig. 2: For each x∈ {0,1}k,Is,x indicates the occurrences of xin s. The density map Ks,x can be
obtained via Ks,x[i] = Pnk+1
j=1 h(i, j)Is,x[j].
We will focus on a specific choice of h(i, j) that will render Ks,x easier to estimate from the traces.
We will let h(i, j) be the probability that a binomial random variable with j1 trials and probability
parameter 1 pis equal to i1; i.e., h(i, j) = j1
i1(1 p)i1pji. This is also the probability that
the jth bit of s(if not deleted) ends up as the ith bit of a trace. Hence we have
Ks,x[i] =
nk+1
X
j=1 j1
i1!(1 p)i1pjiIs,x[j].(3)
for i∈ {1, . . . , n k+ 1}.Notice that the maximum value of h(i, j) for a fixed joccurs when
ij(1 p) so the kernel h(·, j) has its peak shifted to the left and Ks,x is a density map of
occurrences of xin sshifted to the left. Operationally, (1 p)kKs,x[i] is the probability that a fully
preserved copy of xin sappears in position ion a given trace of s.
We define the k-mer density map of sas Ks= [Ks,x :x∈ {0,1}k] (the concatenation of all vectors
Ks,x). If the k-mer density map Ksis known exactly,scan also be recovered exactly. This can be
seen by noticing the invertibility of the upper-triangular matrix Fthat transforms the binary vector
Is,x into the vector Ks,x (for a fixed x). The matrix Fis upper triangular with non-zero entries on
the main diagonal, which makes it invertible. While invertible, Fis ill-conditioned since some of the
entries on the diagonal are close to 0, making the transformation from Ks,x to Is,x sensitive to noise
in Ks,x.
We present an algorithm that, given Ttraces, constructs an estimate ˆ
Ksfor the k-mer density
map. Our main result establishes that we can achieve estimation error
kˆ
KsKsk= max
x,i |ˆ
Ks,x[i]Ks,x[i]|< 
using T=2·poly(n) traces. Hence, the density map Kscan be estimated with maximum error
n= 1/g(n) for g(n)poly(n) using polynomially many traces. In particular, given a set of candidate
source strings A⊂{0,1}nsuch that, for any s, s0∈ A,
kKsKs0k2,
the true source sequence s∈ A can be recovered with 2·poly(n) traces. This adds to the existing
literature on classes of strings recoverable/distinguishable with polynomially many traces [20–22].
4
Since Is,x and Ks,x are related through an invertible (albeit ill-conditioned) linear transformation
Ks,x =FIs,x, the approximate recovery of the k-mer density map ˆ
Ks,x suggests natural reconstruc-
tion algorithms for Is,x, e.g., based on a regularized least squares problem
min
ˆ
Is,x kˆ
Ks,x Fˆ
Is,xk2
2+δkˆ
Is,xk2
2,
which is a convex program if ˆ
Is,x is allowed to be real-valued. The solution ˆ
Is,x can then be converted
into a reconstructed string ˆs∈ {0,1}nthrough a majority voting across candidate k-mers for each
position. Hence, in contrast to much of the theoretical literature on the trace reconstruction problem,
the k-mer density map leads to new reconstruction approaches.
Our main result relies on a nontrivial estimator for Ks,x that simultaneously uses count information
for all binary strings ythat are supersequences of x. The estimator is obtained by first deriving
a recursive formula for Ks,x, then applying a known result in the combinatorics of strings on the
expansion of the recursive formula to obtain a non-recursive formula. An application of McDiarmid’s
inequality is then used to prove the estimator is successful with high probability. To the best of
our knowledge, these techniques have not appeared in the trace reconstruction literature, where
most recent results have been based on complex analysis [3–6, 23]. Our techniques also lead to
an improvement on a previously known upper bound [10] on the number of traces needed for
reconstructing the k-subword deck of sfor p < 0.5.
A. Related work
The current best upper bounds for worst-case trace reconstruction [3, 5, 23] are obtained using
algorithms that consider each pair of possible source strings y, z ∈ {0,1}n, and decide whether the
set of traces looks more like a set of traces from y, or more like a set of traces from z(we formalize
this shortly). Then if there are enough traces, with high probability the true source string swill
be the unique string such that the set of traces looks more like it came from sthan from any
other string in {0,1}n. Therefore, the trace reconstruction problem is closely related to the trace-
distinguishing problem [21, 22], where we want to decide from the set of traces whether the source
string is either yor zwith high probability. The best existing upper bound of exp(O(n1/5)) for
worst-case trace reconstruction [5] is proved using the fact that any two strings can be distinguished
using exp(O(n1/5)) traces. It has also been shown that string pairs at constant Hamming distance
can be distinguished using poly(n) traces by McGregor, Price and Vorotnikova [20], and separately
by Grigorescu, Sudan, and Zhu using different techniques [21]. It was recently shown that strings at
constant Levenshtein distance can be distinguished using poly(n) traces by Sima and Bruck [22].
To distinguish between two strings yand zfrom a set of traces, current state of the art algorithms
identify a function fy,z such that |E[fy,z(˜
Y)]E[fy,z(˜
Z)]|is sufficiently large where ˜
Ydenotes a trace
of y. Given Ttraces ˜
S1,..., ˜
STof a source string s,E[fy,z(˜
S)] can be estimated as 1
TPT
i=1 fy,z(˜
Si)
and we say that ybeats zif 1
TPT
i=1 fy,z(˜
Si) is closer to E[fy,z(˜
Y)] than to E[fy,z(˜
Z)]. Observe that
if fy,z is such that |E[fy,z (˜
Y)] E[fy,z(˜
Z)]|is large enough, then assuming the source string is yor
z, we can distinguish between the two cases given a reasonable number of traces. If there is a unique
string usuch that ubeats all other strings, then uis output as the reconstruction.
The first such function fy,z introduced by Nazarov and Peres [23], and independently by De,
O’Donnell, and Servedio [3], is simply the value of a single bit in the trace, i.e., fy,z(˜
S) = ˜
S[i] for
source string sand some index ithat depends on yand z([9] also uses a single bit approach). An
argument based on an existing result on complex-valued polynomials was used to show that for any
pair of strings y, z, there is some index iy,z such that |E[˜
Y[iy,z]]E[˜
Z[iy,z]]| ≥ exp (O(n1/3)). Using
this result along with a standard concentration inequality, the upper bound of exp (O(n1/3)) traces
is obtained. This choice of fy,z is known in the literature as a single bit statistic. The current best
5
upper bound by Chase [5] picks a more complicated function fy,z that involves multiple bits, and
proves a novel result on complex polynomials in order to prove a gap of |E[fy,z(˜
Y)] E[fy,z(˜
Z)]| ≥
exp (˜
O(n1/5)), which yields an upper bound of exp ( ˜
O(n1/5)) traces. A choice of fy,z that uses
multiple bits in combination (e.g. that of Chase [5]) is known in the literature as a multi-bit statistic.
There are similarities between the approach used in [5] and the approach used in this paper, and
it may be possible to obtain results similar to those in this paper based on the complex analysis
techniques explored in [3, 5, 23].
Obtaining lower bounds for the number of traces needed in the trace reconstruction problem
amounts to proving a lower bound on the number of traces required to distinguish two strings (trace-
distinguishing problem) since any algorithm to solve the trace reconstruction problem can be used
to solve the trace-distinguishing problem. For a string a, let aidenote the string where ais repeated
itimes. The current best lower bound of ˜
Ω(n3/2) for trace reconstruction discovered by Chase [4] is
obtained by analyzing the string pair (01)m1(01)m+1 and (01)m+11(01)mwhere n= 2m+ 3.
The trace reconstruction problem has also been considered in the smoothed complexity model
by Chen, De, Lee, Servedio, and Sinha [10], in which a worst-case string is passed through a
binary symmetric channel before trace generation, and the noise-corrupted string needs to be
reconstructed with high probability (recall that a binary symmetric channel flips each bit in the string
independently with some fixed probability). Chen et al. proved that in the smoothed complexity
model, trace reconstruction requires poly(n) traces. This result relies on the simple fact that if there
are no repeated substrings of length k1 or greater in the source string s, then the k-subword deck
uniquely determines s. The authors prove that for an arbitrary string, the (log n)-subword deck can
be reconstructed with high probability using poly(n) traces, and prove there will not be any repeats
in the source string of length at least (log n1) with high probability after it is passed through a
binary symmetric channel, thereby proving that poly(n) traces suffice for trace reconstruction in the
smoothed complexity model.
The bulk of the work done in [10] is proving that the (log n)-subword deck can be reconstructed
with high probability using poly(n) traces for any p < 1. In order to prove this, a formula for the
number of times a substring xis present in the source string sis derived in terms of the expected
number of times a trace contains x, and the expected number of times a trace contains supersequences
of x. The expected values of these statistics are then estimated from the set of traces in order to
estimate the number of times xappears in s. Concentration inequalities are used to prove that the
number of times xappears is estimated correctly with high probability. Narayanan and Ren [24]
provide a similar result on reconstructing the (100 log n)-subword deck for circular strings in poly(n)
time. Our result showing that the (log n)-subword deck can be computed using poly(n) traces with
high probability for p < 0.5 was originally proved independently, without knowledge of [10] and [24].
B. Notation
Strings in this paper are binary and indexed starting from 1. If the index iis negative, x[i] is the
(i)th element starting from the right end of x. For example, if s= 1001, then s[1] = 1, s[2] = 0,
s[1] = 1, and s[2] = 0. Let s∈ {0,1}nbe the length-nstring we are trying to recover. The string
swill be called the source string. A trace of sis denoted by ˜
S, and is generated by deleting each bit
of sindependently with probability p.
For a given string x, we let |x|denote the length of x. For a string aand integer r,ardenotes
the string formed by concatenating rcopies of a. A subsequence of xis a string that can be formed
by deleting elements from x, and a supersequence of xis a string that can be formed by inserting
elements into x. This is in contrast to a substring of x, which is a string that appears in x. We let
x[i, j]=(x[i], x[i+ 1], . . . , x[j]) be the substring of xthe begins at position iand ends at position
j. For example, if s= 10101, then s[1 : 3] = 101 and s[2 : 2] = 010. For two stings xand y,
摘要:

1SubstringDensityEstimationfromTracesKayvonMazoojiandIlanShomoronyDepartmentofElectricalandComputerEngineeringUniversityofIllinois,Urbana-Champaignmazooji2@illinois.edu,ilans@illinois.eduAbstractInthetracereconstructionproblem,oneseekstoreconstructabinarystringsfromacollectionoftraces,eachofwhichiso...

展开>> 收起<<
1 Substring Density Estimation from Traces Kayvon Mazooji and Ilan Shomorony.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:597.96KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注