2
NOWADAYS, the information explosion leads to the generation of massive data, that brings great challenges to traditional
storage systems, such as mobile hard disks, USB flash memory, and integrated circuits. When utilizing these storage
mediums, several problems arise inevitably, including insufficient storage duration, high energy consumption, and environmental
pollution [1]. Meanwhile, Deoxyribonucleic Acid (DNA) molecule emerges as a promising storage medium, owing to its
theoretically high storage density and long storage term, which fits the request of storing huge amounts of data [2], [3]. The
workflow of DNA storage is summarized in Figure 1.
Generally, the DNA storage consists of firstly encoding binary stream to the alphabet {A, T, C, G}strings, chemically
synthesizing short DNA oligos, namely references, and then storing the synthesized DNA strands in vitro or in vivo. To read
the information via next-generation sequencing, the references should be retrieved from a large, unordered collection of error-
prone reads. This is because both synthesis and sequencing in DNA storage inevitably introduce insertion-deletion-substitution
(IDS) errors to the DNA strands, with the error probability being 1%-2% in the mainstream next-generation sequencing and up
to 10% for Nanopore sequencers [4]. During sequencing, each single reference outputs an uncertain number of noisy copies,
and the reads corresponding to different references are gathered without ordering [2], [5]. Clustering is usually applied on
the sequencing file, such that the noisy reads originated from the same reference are grouped into clusters [6]. After that, the
multi-read reconstruction, which is the topic of this paper, is performed to infer the the original reference from a cluster of
noisy reads [7].
During the past ten years, a lot of research has been devoted to the sequence reconstruction problem in DNA storage.
Roughly, they are divided into three categories: the consensus methods of Bitwise Majority Alignment (BMA) [7]–[10], the
statistical inference methods [11]–[13], and the recent deep learning ones [14]–[16]. The BMA and its variations are elaborated
for IDS channels and applied to DNA storage systems in [7]–[9]. They perform position-to-position alignment among multiple
reads and implement a majority voting strategy. The BMA-based methods are effective, especially for datasets with low IDS
error rates.
The second category is based on statistical inference, where at each position of the sequence, the maximum a posterior
(MAP) probabilities of all the possible input symbols are estimated and compared [11]–[13]. In [11], marker codes are inserted
into LDPC codes at fixed intervals for error correction, and the decoder is based on a forward and backward (FB) algorithm.
In [12], a drift vector is introduced to model the insertion/deletion errors in each received word, and a factor graph is derived
for joint probability estimation. Concatenated codes are considered in [13], whose inner codes and channels are modeled as
joint Hidden Markov Models (HMM) and the BCJR inference is derived. The so-called Trellis BMA marries BMA with BCJR
decoding and achieves a linear complexity in the number of traces [17]. However, due to the computational overhead, the