
4
Since Is,x and Ks,x are related through an invertible (albeit ill-conditioned) linear transformation
Ks,x =FIs,x, the approximate recovery of the k-mer density map ˆ
Ks,x suggests natural reconstruc-
tion algorithms for Is,x, e.g., based on a regularized least squares problem
min
ˆ
Is,x kˆ
Ks,x −Fˆ
Is,xk2
2+δkˆ
Is,xk2
2,
which is a convex program if ˆ
Is,x is allowed to be real-valued. The solution ˆ
Is,x can then be converted
into a reconstructed string ˆs∈ {0,1}nthrough a majority voting across candidate k-mers for each
position. Hence, in contrast to much of the theoretical literature on the trace reconstruction problem,
the k-mer density map leads to new reconstruction approaches.
Our main result relies on a nontrivial estimator for Ks,x that simultaneously uses count information
for all binary strings ythat are supersequences of x. The estimator is obtained by first deriving
a recursive formula for Ks,x, then applying a known result in the combinatorics of strings on the
expansion of the recursive formula to obtain a non-recursive formula. An application of McDiarmid’s
inequality is then used to prove the estimator is successful with high probability. To the best of
our knowledge, these techniques have not appeared in the trace reconstruction literature, where
most recent results have been based on complex analysis [3–6, 23]. Our techniques also lead to
an improvement on a previously known upper bound [10] on the number of traces needed for
reconstructing the k-subword deck of sfor p < 0.5.
A. Related work
The current best upper bounds for worst-case trace reconstruction [3, 5, 23] are obtained using
algorithms that consider each pair of possible source strings y, z ∈ {0,1}n, and decide whether the
set of traces looks more like a set of traces from y, or more like a set of traces from z(we formalize
this shortly). Then if there are enough traces, with high probability the true source string swill
be the unique string such that the set of traces looks more like it came from sthan from any
other string in {0,1}n. Therefore, the trace reconstruction problem is closely related to the trace-
distinguishing problem [21, 22], where we want to decide from the set of traces whether the source
string is either yor zwith high probability. The best existing upper bound of exp(O(n1/5)) for
worst-case trace reconstruction [5] is proved using the fact that any two strings can be distinguished
using exp(O(n1/5)) traces. It has also been shown that string pairs at constant Hamming distance
can be distinguished using poly(n) traces by McGregor, Price and Vorotnikova [20], and separately
by Grigorescu, Sudan, and Zhu using different techniques [21]. It was recently shown that strings at
constant Levenshtein distance can be distinguished using poly(n) traces by Sima and Bruck [22].
To distinguish between two strings yand zfrom a set of traces, current state of the art algorithms
identify a function fy,z such that |E[fy,z(˜
Y)]−E[fy,z(˜
Z)]|is sufficiently large where ˜
Ydenotes a trace
of y. Given Ttraces ˜
S1,..., ˜
STof a source string s,E[fy,z(˜
S)] can be estimated as 1
TPT
i=1 fy,z(˜
Si)
and we say that ybeats zif 1
TPT
i=1 fy,z(˜
Si) is closer to E[fy,z(˜
Y)] than to E[fy,z(˜
Z)]. Observe that
if fy,z is such that |E[fy,z (˜
Y)] −E[fy,z(˜
Z)]|is large enough, then assuming the source string is yor
z, we can distinguish between the two cases given a reasonable number of traces. If there is a unique
string usuch that ubeats all other strings, then uis output as the reconstruction.
The first such function fy,z introduced by Nazarov and Peres [23], and independently by De,
O’Donnell, and Servedio [3], is simply the value of a single bit in the trace, i.e., fy,z(˜
S) = ˜
S[i] for
source string sand some index ithat depends on yand z([9] also uses a single bit approach). An
argument based on an existing result on complex-valued polynomials was used to show that for any
pair of strings y, z, there is some index iy,z such that |E[˜
Y[iy,z]]−E[˜
Z[iy,z]]| ≥ exp (−O(n1/3)). Using
this result along with a standard concentration inequality, the upper bound of exp (O(n1/3)) traces
is obtained. This choice of fy,z is known in the literature as a single bit statistic. The current best