1 Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network

2025-04-30 0 0 1.66MB 19 页 10玖币

侵权投诉

Robust Multi-Read Reconstruction from

Contaminated Clusters Using Deep Neural Network

for DNA Storage

Yun Qin, Fei Zhu Member, IEEE, and Bo Xi

Abstract

DNA has immense potential as an emerging data storage medium. The principle of DNA storage is the conversion and ﬂow

of digital information between binary code stream, quaternary base, and actual DNA fragments. This process will inevitably

introduce errors, posing challenges to accurate data recovery. Sequence reconstruction consists of inferring the DNA reference

from a cluster of erroneous copies. A common assumption in existing methods is that all the strands within a cluster are noisy

copies originating from the same reference, thereby contributing equally to the reconstruction. However, this is not always valid

considering the existence of contaminated sequences caused, for example, by DNA fragmentation and rearrangement during the

DNA storage process. This paper proposed a robust multi-read reconstruction model using DNN, which is resilient to contaminated

clusters with outlier sequences, as well as to noisy reads with IDS errors. The effectiveness and robustness of the method are

validated on three next-generation sequencing datasets, where a series of comparative experiments are performed by simulating

varying contamination levels that occurring during the process of DNA storage.

Index Terms

DNA storage, sequence reconstruction, robust method, attention, deep neural network.

I. INTRODUCTION

This work was supported by the National Key Research and Development Program of China (No. 2020YFA0712100) .

The authors are with the Center for Applied Mathematics, Tianjin University, China. ( fei.zhu@tju.edu.cn)

arXiv:2210.11106v1 [cs.IR] 20 Oct 2022

NOWADAYS, the information explosion leads to the generation of massive data, that brings great challenges to traditional

storage systems, such as mobile hard disks, USB ﬂash memory, and integrated circuits. When utilizing these storage

mediums, several problems arise inevitably, including insufﬁcient storage duration, high energy consumption, and environmental

pollution [1]. Meanwhile, Deoxyribonucleic Acid (DNA) molecule emerges as a promising storage medium, owing to its

theoretically high storage density and long storage term, which ﬁts the request of storing huge amounts of data [2], [3]. The

workﬂow of DNA storage is summarized in Figure 1.

Generally, the DNA storage consists of ﬁrstly encoding binary stream to the alphabet {A, T, C, G}strings, chemically

synthesizing short DNA oligos, namely references, and then storing the synthesized DNA strands in vitro or in vivo. To read

the information via next-generation sequencing, the references should be retrieved from a large, unordered collection of error-

prone reads. This is because both synthesis and sequencing in DNA storage inevitably introduce insertion-deletion-substitution

(IDS) errors to the DNA strands, with the error probability being 1%-2% in the mainstream next-generation sequencing and up

to 10% for Nanopore sequencers [4]. During sequencing, each single reference outputs an uncertain number of noisy copies,

and the reads corresponding to different references are gathered without ordering [2], [5]. Clustering is usually applied on

the sequencing ﬁle, such that the noisy reads originated from the same reference are grouped into clusters [6]. After that, the

multi-read reconstruction, which is the topic of this paper, is performed to infer the the original reference from a cluster of

noisy reads [7].

During the past ten years, a lot of research has been devoted to the sequence reconstruction problem in DNA storage.

Roughly, they are divided into three categories: the consensus methods of Bitwise Majority Alignment (BMA) [7]–[10], the

statistical inference methods [11]–[13], and the recent deep learning ones [14]–[16]. The BMA and its variations are elaborated

for IDS channels and applied to DNA storage systems in [7]–[9]. They perform position-to-position alignment among multiple

reads and implement a majority voting strategy. The BMA-based methods are effective, especially for datasets with low IDS

error rates.

The second category is based on statistical inference, where at each position of the sequence, the maximum a posterior

(MAP) probabilities of all the possible input symbols are estimated and compared [11]–[13]. In [11], marker codes are inserted

into LDPC codes at ﬁxed intervals for error correction, and the decoder is based on a forward and backward (FB) algorithm.

In [12], a drift vector is introduced to model the insertion/deletion errors in each received word, and a factor graph is derived

for joint probability estimation. Concatenated codes are considered in [13], whose inner codes and channels are modeled as

joint Hidden Markov Models (HMM) and the BCJR inference is derived. The so-called Trellis BMA marries BMA with BCJR

decoding and achieves a linear complexity in the number of traces [17]. However, due to the computational overhead, the

feasible reads number per cluster can hardly exceed ten when applying these methods in practical DNA storage systems.

With the emergence of deep learning, a few lately works have attempted to exploit deep neural networks (DNN) to address

the multi-read reconstruction [14], as well as single read reconstruction [15] in DNA storage systems. Similar in spirit of

this work, the main idea is to train a DNN model with good error correction capacity, that can map a cluster of noisy reads

to the corresponding DNA reference. As this work also focuses on the multi-read reconstruction using DNN, the relevant

works [14]–[16] will be reviewed in Section II.

In practice, the stability and robustness of current DNA storage systems are threatened by contaminated sequences that occur

at different stages of the DNA storage. Unlike a noisy read that differs from its reference by only a few IDS errors, we refer

the contaminated sequences to strands with a more signiﬁcant edit distance from the original DNA references. Several factors

contribute to the occurrence of contaminated sequences. In long-term storage and under certain conditions, DNA strands are

susceptible to degradation, which results in strand breaks and loss [18]. Unspeciﬁc ampliﬁcation inevitably causes frequent DNA

breaks and rearrangements, where oligos are fragmented and rejoined to new ones [19], as shown in Figure 2. Contaminated

sequences also include the complementary strands of the references produced during sequencing [20]. Considering the security

issue in DNA storage, contaminated sequences are intentionally added for the purpose of data encryption in [21]–[23]. Obviously,

the existence of contaminated sequences makes the already challenging reconstruction problem more difﬁcult [18], [19].

In all aforementioned methods, every strand within a cluster contributes equally to the reconstruction of the reference strand,

which holds only when the cluster under reconstruction comprises only the noisy copies originating from the same reference.

However, such prerequisite for perfect clustering is not always achievable, accounting for the properties of current DNA storage

systems. When the sequencing ﬁle contains a portion of contaminated sequences, clustering algorithms will fail to generate

clusters in accordance with latent DNA references. On the other hand, as sequencing are biased towards strands with speciﬁc

properties, existing perfect clustering methods (e.g., [6], [24], [25]) have the risk of losing references rarely sequenced [14],

[18]. That is, clustering them into the wrong clusters. To the best of our knowledge, there is no method to differentiate the

sequence quality and reliability within the cluster, in the context of sequence reconstruction for DNA storage.

This paper proposes a robust multi-read reconstruction method based on DNN. Taking advantage of the attention mechanism

and the conformer block, the proposed model is resilient to contaminated clusters with outlier sequences, as well as noisy

reads with IDS errors. The main contributions are as follows:

•Integration of sequence quality to multi-read reconstruction. By far, this is the ﬁrst multi-read reconstruction method

that takes into account sequence reliability within the cluster. After scored according to sequence quality by the attention

module, strands will contribute to the reconstruction at varying degrees. Thus the effect of various kinds of contaminated

sequences can be suppressed automatically.

•Error correction capacity of IDS errors within cluster. The proposed model realizes the error correction of IDS errors

within the cluster. The Conformer-Encoder has strong feature extraction ability, such that the local features extracted by

the convolutional layers and global features extracted from the attention module are smartly integrated. The resulting

features are high-level and representative, such that the underlying reference of the noisy cluster can be well recovered

by a single-layer long short-term memory (LSTM) decoder.

•Sequence reconstruction model accommodating varying cluster sizes. The network is trained directly from clusters

of different sizes, rather than summing up the reads within a cluster to form a structured input format [14]. Thereby, it

is compatible with the input cluster of varying sizes at the testing stage.

•Small network with less parameters. The proposed neural network has a small structure (≈2.5M parameters) with

good generalization ability. This helps to mitigate the overﬁtting issue caused by the shortage of training data, when using

DNN to address the sequence reconstruction problem in DNA storage.

Fig. 1: Overview of the DNA storage system. The workﬂow consists of ﬁve stages: encoding, synthesis, storage, sequencing,

and decoding.

Fig. 2: Illustration of strands breaks and rearrangements in DNA data storage.

The rest of this paper is organized as follows. The related work is reviewed in Section II. In Section III, we present the

proposed multi-read reconstruction model. Experimental results and analysis are given in Section IV. Finally, Section II

concludes the paper.

II. RELATED WORK

We succinctly review several deep learning-based sequence reconstruction methods in DNA storage. The most relevant

literature to this paper is the so-called DNAformer, a scalable and robust solution for the DNA sequence reconstruction recently

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1RobustMulti-ReadReconstructionfromContaminatedClustersUsingDeepNeuralNetworkforDNAStorageYunQin,FeiZhuMember,IEEE,andBoXiAbstractDNAhasimmensepotentialasanemergingdatastoragemedium.TheprincipleofDNAstorageistheconversionandowofdigitalinformationbetweenbinarycodestream,quaternarybase,andactualDNAfr...

展开>> 收起<<

1 Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: