1 Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network

2025-04-30 0 0 1.66MB 19 页 10玖币
侵权投诉
1
Robust Multi-Read Reconstruction from
Contaminated Clusters Using Deep Neural Network
for DNA Storage
Yun Qin, Fei Zhu Member, IEEE, and Bo Xi
Abstract
DNA has immense potential as an emerging data storage medium. The principle of DNA storage is the conversion and flow
of digital information between binary code stream, quaternary base, and actual DNA fragments. This process will inevitably
introduce errors, posing challenges to accurate data recovery. Sequence reconstruction consists of inferring the DNA reference
from a cluster of erroneous copies. A common assumption in existing methods is that all the strands within a cluster are noisy
copies originating from the same reference, thereby contributing equally to the reconstruction. However, this is not always valid
considering the existence of contaminated sequences caused, for example, by DNA fragmentation and rearrangement during the
DNA storage process. This paper proposed a robust multi-read reconstruction model using DNN, which is resilient to contaminated
clusters with outlier sequences, as well as to noisy reads with IDS errors. The effectiveness and robustness of the method are
validated on three next-generation sequencing datasets, where a series of comparative experiments are performed by simulating
varying contamination levels that occurring during the process of DNA storage.
Index Terms
DNA storage, sequence reconstruction, robust method, attention, deep neural network.
I. INTRODUCTION
This work was supported by the National Key Research and Development Program of China (No. 2020YFA0712100) .
The authors are with the Center for Applied Mathematics, Tianjin University, China. ( fei.zhu@tju.edu.cn)
arXiv:2210.11106v1 [cs.IR] 20 Oct 2022
2
NOWADAYS, the information explosion leads to the generation of massive data, that brings great challenges to traditional
storage systems, such as mobile hard disks, USB flash memory, and integrated circuits. When utilizing these storage
mediums, several problems arise inevitably, including insufficient storage duration, high energy consumption, and environmental
pollution [1]. Meanwhile, Deoxyribonucleic Acid (DNA) molecule emerges as a promising storage medium, owing to its
theoretically high storage density and long storage term, which fits the request of storing huge amounts of data [2], [3]. The
workflow of DNA storage is summarized in Figure 1.
Generally, the DNA storage consists of firstly encoding binary stream to the alphabet {A, T, C, G}strings, chemically
synthesizing short DNA oligos, namely references, and then storing the synthesized DNA strands in vitro or in vivo. To read
the information via next-generation sequencing, the references should be retrieved from a large, unordered collection of error-
prone reads. This is because both synthesis and sequencing in DNA storage inevitably introduce insertion-deletion-substitution
(IDS) errors to the DNA strands, with the error probability being 1%-2% in the mainstream next-generation sequencing and up
to 10% for Nanopore sequencers [4]. During sequencing, each single reference outputs an uncertain number of noisy copies,
and the reads corresponding to different references are gathered without ordering [2], [5]. Clustering is usually applied on
the sequencing file, such that the noisy reads originated from the same reference are grouped into clusters [6]. After that, the
multi-read reconstruction, which is the topic of this paper, is performed to infer the the original reference from a cluster of
noisy reads [7].
During the past ten years, a lot of research has been devoted to the sequence reconstruction problem in DNA storage.
Roughly, they are divided into three categories: the consensus methods of Bitwise Majority Alignment (BMA) [7]–[10], the
statistical inference methods [11]–[13], and the recent deep learning ones [14]–[16]. The BMA and its variations are elaborated
for IDS channels and applied to DNA storage systems in [7]–[9]. They perform position-to-position alignment among multiple
reads and implement a majority voting strategy. The BMA-based methods are effective, especially for datasets with low IDS
error rates.
The second category is based on statistical inference, where at each position of the sequence, the maximum a posterior
(MAP) probabilities of all the possible input symbols are estimated and compared [11]–[13]. In [11], marker codes are inserted
into LDPC codes at fixed intervals for error correction, and the decoder is based on a forward and backward (FB) algorithm.
In [12], a drift vector is introduced to model the insertion/deletion errors in each received word, and a factor graph is derived
for joint probability estimation. Concatenated codes are considered in [13], whose inner codes and channels are modeled as
joint Hidden Markov Models (HMM) and the BCJR inference is derived. The so-called Trellis BMA marries BMA with BCJR
decoding and achieves a linear complexity in the number of traces [17]. However, due to the computational overhead, the
3
feasible reads number per cluster can hardly exceed ten when applying these methods in practical DNA storage systems.
With the emergence of deep learning, a few lately works have attempted to exploit deep neural networks (DNN) to address
the multi-read reconstruction [14], as well as single read reconstruction [15] in DNA storage systems. Similar in spirit of
this work, the main idea is to train a DNN model with good error correction capacity, that can map a cluster of noisy reads
to the corresponding DNA reference. As this work also focuses on the multi-read reconstruction using DNN, the relevant
works [14]–[16] will be reviewed in Section II.
In practice, the stability and robustness of current DNA storage systems are threatened by contaminated sequences that occur
at different stages of the DNA storage. Unlike a noisy read that differs from its reference by only a few IDS errors, we refer
the contaminated sequences to strands with a more significant edit distance from the original DNA references. Several factors
contribute to the occurrence of contaminated sequences. In long-term storage and under certain conditions, DNA strands are
susceptible to degradation, which results in strand breaks and loss [18]. Unspecific amplification inevitably causes frequent DNA
breaks and rearrangements, where oligos are fragmented and rejoined to new ones [19], as shown in Figure 2. Contaminated
sequences also include the complementary strands of the references produced during sequencing [20]. Considering the security
issue in DNA storage, contaminated sequences are intentionally added for the purpose of data encryption in [21]–[23]. Obviously,
the existence of contaminated sequences makes the already challenging reconstruction problem more difficult [18], [19].
In all aforementioned methods, every strand within a cluster contributes equally to the reconstruction of the reference strand,
which holds only when the cluster under reconstruction comprises only the noisy copies originating from the same reference.
However, such prerequisite for perfect clustering is not always achievable, accounting for the properties of current DNA storage
systems. When the sequencing file contains a portion of contaminated sequences, clustering algorithms will fail to generate
clusters in accordance with latent DNA references. On the other hand, as sequencing are biased towards strands with specific
properties, existing perfect clustering methods (e.g., [6], [24], [25]) have the risk of losing references rarely sequenced [14],
[18]. That is, clustering them into the wrong clusters. To the best of our knowledge, there is no method to differentiate the
sequence quality and reliability within the cluster, in the context of sequence reconstruction for DNA storage.
This paper proposes a robust multi-read reconstruction method based on DNN. Taking advantage of the attention mechanism
and the conformer block, the proposed model is resilient to contaminated clusters with outlier sequences, as well as noisy
reads with IDS errors. The main contributions are as follows:
Integration of sequence quality to multi-read reconstruction. By far, this is the first multi-read reconstruction method
that takes into account sequence reliability within the cluster. After scored according to sequence quality by the attention
module, strands will contribute to the reconstruction at varying degrees. Thus the effect of various kinds of contaminated
4
sequences can be suppressed automatically.
Error correction capacity of IDS errors within cluster. The proposed model realizes the error correction of IDS errors
within the cluster. The Conformer-Encoder has strong feature extraction ability, such that the local features extracted by
the convolutional layers and global features extracted from the attention module are smartly integrated. The resulting
features are high-level and representative, such that the underlying reference of the noisy cluster can be well recovered
by a single-layer long short-term memory (LSTM) decoder.
Sequence reconstruction model accommodating varying cluster sizes. The network is trained directly from clusters
of different sizes, rather than summing up the reads within a cluster to form a structured input format [14]. Thereby, it
is compatible with the input cluster of varying sizes at the testing stage.
Small network with less parameters. The proposed neural network has a small structure (2.5M parameters) with
good generalization ability. This helps to mitigate the overfitting issue caused by the shortage of training data, when using
DNN to address the sequence reconstruction problem in DNA storage.
Fig. 1: Overview of the DNA storage system. The workflow consists of five stages: encoding, synthesis, storage, sequencing,
and decoding.
Fig. 2: Illustration of strands breaks and rearrangements in DNA data storage.
The rest of this paper is organized as follows. The related work is reviewed in Section II. In Section III, we present the
proposed multi-read reconstruction model. Experimental results and analysis are given in Section IV. Finally, Section II
concludes the paper.
II. RELATED WORK
We succinctly review several deep learning-based sequence reconstruction methods in DNA storage. The most relevant
literature to this paper is the so-called DNAformer, a scalable and robust solution for the DNA sequence reconstruction recently
摘要:

1RobustMulti-ReadReconstructionfromContaminatedClustersUsingDeepNeuralNetworkforDNAStorageYunQin,FeiZhuMember,IEEE,andBoXiAbstractDNAhasimmensepotentialasanemergingdatastoragemedium.TheprincipleofDNAstorageistheconversionandowofdigitalinformationbetweenbinarycodestream,quaternarybase,andactualDNAfr...

展开>> 收起<<
1 Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.66MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注