1 Substring Density Estimation from Traces Kayvon Mazooji and Ilan Shomorony

2025-04-30 0 0 597.96KB 22 页 10玖币

侵权投诉

Substring Density Estimation from Traces

Kayvon Mazooji and Ilan Shomorony

Department of Electrical and Computer Engineering

University of Illinois, Urbana-Champaign

mazooji2@illinois.edu, ilans@illinois.edu

Abstract

In the trace reconstruction problem, one seeks to reconstruct a binary string sfrom a collection

of traces, each of which is obtained by passing sthrough a deletion channel. It is known that

exp( ˜

O(n1/5)) traces suﬃce to reconstruct any length-nstring with high probability. We consider

a variant of the trace reconstruction problem where the goal is to recover a “density map” that

indicates the locations of each length-ksubstring throughout s. We show that −2·poly(n) traces

suﬃce to recover the density map with error at most . As a result, when restricted to a set of source

strings whose minimum “density map distance” is at least 1/poly(n), the trace reconstruction

problem can be solved with polynomially many traces.

1 Introduction

In the trace reconstruction problem, there is an unknown binary string s∈ {0,1}n, which we wish

to reconstruct based on Tsubsequences (or traces) of s. Each trace is obtained by passing the source

string sthrough a deletion channel, which deletes each bit of sindependently with probability p.

The main question of interest is how many traces are needed in order to reconstruct scorrectly.

This problem was originally proposed by Batu et al. [1], motivated by problems in sequence

alignment, phylogeny, and computational biology [2]. Most of the work on the trace reconstruction

problem has focused on characterizing the minimum number of traces needed for reconstructing the

source string sexactly. The most common formulation of the problem, known as worst-case trace

reconstruction [3], requires the reconstruction algorithm to recover s∈ {0,1}nexactly with high

probability as n→ ∞ for any string s∈ {0,1}n. While this problem has received considerable

attention, there is still a signiﬁcant gap between upper and lower bounds on the number of traces

needed. Currently, the best lower bound is ˜

Ω(n3/2), while the best upper bound is exp( ˜

O(n1/5)),

both due to Chase [4, 5].

The exponential gap between the best known lower and upper bounds has motivated the

formulation of several variants of the trace reconstruction problem where tighter bounds can hopefully

be obtained. For example, in the average-case trace reconstruction problem, sis assumed to be drawn

uniformly at random from all {0,1}nstrings. In this case, it is known that only T= exp(O(log1/3(n)))

traces are suﬃcient [6]. An approximate trace reconstruction problem, where a fraction of the

recovered bits is allowed to be incorrect, has also been formulated [7], and the problem of ﬁnding

the maximum likelihood sequence sfrom a small number of traces (possibly insuﬃcient for exact

reconstruction) has been recently studied [8]. We can also consider a more modest goal than the

reconstruction of the source sequence sitself. One example that is particularly relevant to our

discussion is the reconstruction of the k-subword deck of s[9, 10].

The k-subword deck of a binary sequence s∈ {0,1}nis the multiset of all length-ksubstrings,

i.e., {s[i:i+k−1] : i= 1, . . . , n −k+ 1}. Equivalently, the k-subword deck can be deﬁned by the

counts Ns,x of the number of times xappears in sas a substring:

Sk(s) = [Ns,x :x∈ {0,1}k].(1)

The work of Kayvon Mazooji and Ilan Shomorony was supported in part by the National Science Foundation (NSF)

under grants CCF-2007597 and CCF-2046991.

arXiv:2210.10917v1 [cs.IT] 19 Oct 2022

Fig. 1: (a) Example of a source binary string sand its k-subword deck (or k-spectrum) Sk(s),

for k= 4. (b) Given S4(s) in (a), one can build a de Bruijn graph where the elements in

S4(s) correspond to edges (with multiplicities) and the nodes correspond to 3-mers. Notice that

scorresponds to an Eulerian path on the de Bruijn graph, but such a path is not unique; for

example, s0= 000111000111000000000111 corresponds to another Eulerian path.

As shown in [10], for k=O(log n), the k-subword deck Sk(s) can be recovered with poly(n) traces.

The k-subword deck of a sequence is an important object in bioinformatics, with applications in

error correction [11, 12], sequence assembly [13, 14], and genomic complexity analysis [15, 16]. In

these contexts, the k-subword deck Sk(s) is often referred to as the k-spectrum, and each length-k

substring is called a k-mer. Intuitively, as long as kis large enough, the k-subword deck can uniquely

determine the source sequence s. In fact, a classical result by Ukkonen [17] provides a necessary and

suﬃcient condition for Sk(s) to uniquely determine sbased on the length of the “interleaved repeats”

in s[18]. In particular, if there are no repeats of length k−1 in s, one can reconstruct sfrom Sk(s)

by simply merging k-mers with a preﬁx-suﬃx match of length k−1. More generally, given Sk(s),

one can build the de Bruijn graph, where nodes correspond to (k−1)-mers and edges correspond

to k-mers, and sis guaranteed to be an Eulerian path in the graph [13, 19]. This is illustrated in

Figure 1.

While the k-subword deck is a natural intermediate goal towards the reconstruction of s(and

can be recovered with only poly(n) traces), it does not capture all the information present in the

traces. For example, the k-subword deck Sk(s) in Figure 1 also admits the reconstruction s0=

000111000111000000000111, even though s0should be easy to distinguish from sbased on traces

(by estimating the length of the second and third runs of zeros). Motivated by this shortcoming of

the k-subword deck, we propose the idea of a k-mer density map, as a kind of localized k-subword

deck where, in addition to knowing the number of times a given k-mer appears in s, we have some

information about where it occurs.

For a k-mer x∈ {0,1}k, let Is,x ∈ {0,1}n−k+1 be the indicator vector of the occurrences of xin s;

i.e., Is,x[j] = {sj:j+k−1=x}, as illustrated in Figure 2. Notice that recovering the k-subword deck

can be seen as recovering PjIs,x[j] for each x∈ {0,1}k. Also notice that recovering sis equivalent

to recovering Is,x for all x∈ {0,1}k. A k-mer density map can be obtained by computing

Ks,x[i] =

n−k+1

j=1

h(i, j)Is,x[j] (2)

for some “smoothing kernel” h(i, j), as illustrated in Figure 2. Intuitively, for a given x,Ks,x gives

a coarse indication of the occurrences of xin s. Moreover, if his such that Pih(i, j) = 1 for each

j, it holds that

Ks,x[i] = X

h(i, j)Is,x[j] = X

jIs,x[j]X

h(i, j) = X

jIs,x[j],

which means that the k-subword deck Sk(s) is a function of Ks,x, and the density map Ks,x can be

thought of as a generalization of the k-subword deck that provides information about k-mer location.

Fig. 2: For each x∈ {0,1}k,Is,x indicates the occurrences of xin s. The density map Ks,x can be

obtained via Ks,x[i] = Pn−k+1

j=1 h(i, j)Is,x[j].

We will focus on a speciﬁc choice of h(i, j) that will render Ks,x easier to estimate from the traces.

We will let h(i, j) be the probability that a binomial random variable with j−1 trials and probability

parameter 1 −pis equal to i−1; i.e., h(i, j) = j−1

i−1(1 −p)i−1pj−i. This is also the probability that

the jth bit of s(if not deleted) ends up as the ith bit of a trace. Hence we have

Ks,x[i] =

n−k+1

j=1 j−1

i−1!(1 −p)i−1pj−iIs,x[j].(3)

for i∈ {1, . . . , n −k+ 1}.Notice that the maximum value of h(i, j) for a ﬁxed joccurs when

i≈j(1 −p) so the kernel h(·, j) has its peak shifted to the left and Ks,x is a density map of

occurrences of xin sshifted to the left. Operationally, (1 −p)kKs,x[i] is the probability that a fully

preserved copy of xin sappears in position ion a given trace of s.

We deﬁne the k-mer density map of sas Ks= [Ks,x :x∈ {0,1}k] (the concatenation of all vectors

Ks,x). If the k-mer density map Ksis known exactly,scan also be recovered exactly. This can be

seen by noticing the invertibility of the upper-triangular matrix Fthat transforms the binary vector

Is,x into the vector Ks,x (for a ﬁxed x). The matrix Fis upper triangular with non-zero entries on

the main diagonal, which makes it invertible. While invertible, Fis ill-conditioned since some of the

entries on the diagonal are close to 0, making the transformation from Ks,x to Is,x sensitive to noise

in Ks,x.

We present an algorithm that, given Ttraces, constructs an estimate ˆ

Ksfor the k-mer density

map. Our main result establishes that we can achieve estimation error

kˆ

Ks−Ksk∞= max

x,i |ˆ

Ks,x[i]−Ks,x[i]|< 

using T=−2·poly(n) traces. Hence, the density map Kscan be estimated with maximum error

n= 1/g(n) for g(n)∈poly(n) using polynomially many traces. In particular, given a set of candidate

source strings A⊂{0,1}nsuch that, for any s, s0∈ A,

kKs−Ks0k∞≥2,

the true source sequence s∈ A can be recovered with −2·poly(n) traces. This adds to the existing

literature on classes of strings recoverable/distinguishable with polynomially many traces [20–22].

Since Is,x and Ks,x are related through an invertible (albeit ill-conditioned) linear transformation

Ks,x =FIs,x, the approximate recovery of the k-mer density map ˆ

Ks,x suggests natural reconstruc-

tion algorithms for Is,x, e.g., based on a regularized least squares problem

min

Is,x kˆ

Ks,x −Fˆ

Is,xk2

2+δkˆ

Is,xk2

which is a convex program if ˆ

Is,x is allowed to be real-valued. The solution ˆ

Is,x can then be converted

into a reconstructed string ˆs∈ {0,1}nthrough a majority voting across candidate k-mers for each

position. Hence, in contrast to much of the theoretical literature on the trace reconstruction problem,

the k-mer density map leads to new reconstruction approaches.

Our main result relies on a nontrivial estimator for Ks,x that simultaneously uses count information

for all binary strings ythat are supersequences of x. The estimator is obtained by ﬁrst deriving

a recursive formula for Ks,x, then applying a known result in the combinatorics of strings on the

expansion of the recursive formula to obtain a non-recursive formula. An application of McDiarmid’s

inequality is then used to prove the estimator is successful with high probability. To the best of

our knowledge, these techniques have not appeared in the trace reconstruction literature, where

most recent results have been based on complex analysis [3–6, 23]. Our techniques also lead to

an improvement on a previously known upper bound [10] on the number of traces needed for

reconstructing the k-subword deck of sfor p < 0.5.

A. Related work

The current best upper bounds for worst-case trace reconstruction [3, 5, 23] are obtained using

algorithms that consider each pair of possible source strings y, z ∈ {0,1}n, and decide whether the

set of traces looks more like a set of traces from y, or more like a set of traces from z(we formalize

this shortly). Then if there are enough traces, with high probability the true source string swill

be the unique string such that the set of traces looks more like it came from sthan from any

other string in {0,1}n. Therefore, the trace reconstruction problem is closely related to the trace-

distinguishing problem [21, 22], where we want to decide from the set of traces whether the source

string is either yor zwith high probability. The best existing upper bound of exp(O(n1/5)) for

worst-case trace reconstruction [5] is proved using the fact that any two strings can be distinguished

using exp(O(n1/5)) traces. It has also been shown that string pairs at constant Hamming distance

can be distinguished using poly(n) traces by McGregor, Price and Vorotnikova [20], and separately

by Grigorescu, Sudan, and Zhu using diﬀerent techniques [21]. It was recently shown that strings at

constant Levenshtein distance can be distinguished using poly(n) traces by Sima and Bruck [22].

To distinguish between two strings yand zfrom a set of traces, current state of the art algorithms

identify a function fy,z such that |E[fy,z(˜

Y)]−E[fy,z(˜

Z)]|is suﬃciently large where ˜

Ydenotes a trace

of y. Given Ttraces ˜

S1,..., ˜

STof a source string s,E[fy,z(˜

S)] can be estimated as 1

TPT

i=1 fy,z(˜

Si)

and we say that ybeats zif 1

TPT

i=1 fy,z(˜

Si) is closer to E[fy,z(˜

Y)] than to E[fy,z(˜

Z)]. Observe that

if fy,z is such that |E[fy,z (˜

Y)] −E[fy,z(˜

Z)]|is large enough, then assuming the source string is yor

z, we can distinguish between the two cases given a reasonable number of traces. If there is a unique

string usuch that ubeats all other strings, then uis output as the reconstruction.

The ﬁrst such function fy,z introduced by Nazarov and Peres [23], and independently by De,

O’Donnell, and Servedio [3], is simply the value of a single bit in the trace, i.e., fy,z(˜

S) = ˜

S[i] for

source string sand some index ithat depends on yand z([9] also uses a single bit approach). An

argument based on an existing result on complex-valued polynomials was used to show that for any

pair of strings y, z, there is some index iy,z such that |E[˜

Y[iy,z]]−E[˜

Z[iy,z]]| ≥ exp (−O(n1/3)). Using

this result along with a standard concentration inequality, the upper bound of exp (O(n1/3)) traces

is obtained. This choice of fy,z is known in the literature as a single bit statistic. The current best

upper bound by Chase [5] picks a more complicated function fy,z that involves multiple bits, and

proves a novel result on complex polynomials in order to prove a gap of |E[fy,z(˜

Y)] −E[fy,z(˜

Z)]| ≥

exp (−˜

O(n1/5)), which yields an upper bound of exp ( ˜

O(n1/5)) traces. A choice of fy,z that uses

multiple bits in combination (e.g. that of Chase [5]) is known in the literature as a multi-bit statistic.

There are similarities between the approach used in [5] and the approach used in this paper, and

it may be possible to obtain results similar to those in this paper based on the complex analysis

techniques explored in [3, 5, 23].

Obtaining lower bounds for the number of traces needed in the trace reconstruction problem

amounts to proving a lower bound on the number of traces required to distinguish two strings (trace-

distinguishing problem) since any algorithm to solve the trace reconstruction problem can be used

to solve the trace-distinguishing problem. For a string a, let aidenote the string where ais repeated

itimes. The current best lower bound of ˜

Ω(n3/2) for trace reconstruction discovered by Chase [4] is

obtained by analyzing the string pair (01)m1(01)m+1 and (01)m+11(01)mwhere n= 2m+ 3.

The trace reconstruction problem has also been considered in the smoothed complexity model

by Chen, De, Lee, Servedio, and Sinha [10], in which a worst-case string is passed through a

binary symmetric channel before trace generation, and the noise-corrupted string needs to be

reconstructed with high probability (recall that a binary symmetric channel ﬂips each bit in the string

independently with some ﬁxed probability). Chen et al. proved that in the smoothed complexity

model, trace reconstruction requires poly(n) traces. This result relies on the simple fact that if there

are no repeated substrings of length k−1 or greater in the source string s, then the k-subword deck

uniquely determines s. The authors prove that for an arbitrary string, the (log n)-subword deck can

be reconstructed with high probability using poly(n) traces, and prove there will not be any repeats

in the source string of length at least (log n−1) with high probability after it is passed through a

binary symmetric channel, thereby proving that poly(n) traces suﬃce for trace reconstruction in the

smoothed complexity model.

The bulk of the work done in [10] is proving that the (log n)-subword deck can be reconstructed

with high probability using poly(n) traces for any p < 1. In order to prove this, a formula for the

number of times a substring xis present in the source string sis derived in terms of the expected

number of times a trace contains x, and the expected number of times a trace contains supersequences

of x. The expected values of these statistics are then estimated from the set of traces in order to

estimate the number of times xappears in s. Concentration inequalities are used to prove that the

number of times xappears is estimated correctly with high probability. Narayanan and Ren [24]

provide a similar result on reconstructing the (100 log n)-subword deck for circular strings in poly(n)

time. Our result showing that the (log n)-subword deck can be computed using poly(n) traces with

high probability for p < 0.5 was originally proved independently, without knowledge of [10] and [24].

B. Notation

Strings in this paper are binary and indexed starting from 1. If the index iis negative, x[i] is the

(−i)th element starting from the right end of x. For example, if s= 1001, then s[1] = 1, s[2] = 0,

s[−1] = 1, and s[−2] = 0. Let s∈ {0,1}nbe the length-nstring we are trying to recover. The string

swill be called the source string. A trace of sis denoted by ˜

S, and is generated by deleting each bit

of sindependently with probability p.

For a given string x, we let |x|denote the length of x. For a string aand integer r,ardenotes

the string formed by concatenating rcopies of a. A subsequence of xis a string that can be formed

by deleting elements from x, and a supersequence of xis a string that can be formed by inserting

elements into x. This is in contrast to a substring of x, which is a string that appears in x. We let

x[i, j]=(x[i], x[i+ 1], . . . , x[j]) be the substring of xthe begins at position iand ends at position

j. For example, if s= 10101, then s[1 : 3] = 101 and s[2 : −2] = 010. For two stings xand y,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1SubstringDensityEstimationfromTracesKayvonMazoojiandIlanShomoronyDepartmentofElectricalandComputerEngineeringUniversityofIllinois,Urbana-Champaignmazooji2@illinois.edu,ilans@illinois.eduAbstractInthetracereconstructionproblem,oneseekstoreconstructabinarystringsfromacollectionoftraces,eachofwhichiso...

展开>> 收起<<

1 Substring Density Estimation from Traces Kayvon Mazooji and Ilan Shomorony.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Substring Density Estimation from Traces Kayvon Mazooji and Ilan Shomorony

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: