NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data Lake Sankalpa Timilsina Justin Presley David Reddick Susmit Shannigrahi

2025-05-02 1 0 284.35KB 12 页 10玖币

侵权投诉

NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data

Lake

Sankalpa Timilsina, Justin Presley, David Reddick, Susmit Shannigrahi,

Tennessee Tech,

Xusheng Ai, Coleman Mcknight, Alex Feltus

Clemson University

October 20, 2022

Abstract

As the growth of genomics samples rapidly expands due to increased access to high resolu-

tion DNA sequencing technology, the need for a scalable platform to aggregate dispersed

datasets enable easy access to the vast wealth of DNA sequences available is paramount.

In this work, we introduce and demonstrate a novel way to use Named Data Networking

(NDN) in conjunction with a Kubernetes cluster to design a ﬂexible and scalable genomics

Data Lake in the cloud. In addition, the use of the NDN Data Plane Development Kit

(DPDK) provides an eﬃcient and accessible distribution of the datasets to researchers any-

where. This report will explain the need to deploy a Data Lake for genomics data, what is

necessary to deploy successfully, and detailed instructions to replicate the proposed design.

Finally, this technical report outlines future enhancement options for further improvements.

1 Introduction

A Data Lake is a large-scale data repository for raw/intermediate ﬁle storage for data analytics. Why create

a Data Lake for genomics data? Since scientists constructed the reference human DNA genome over two

decades ago, there has been an explosion of genomes sequenced from thousands of organisms across the

tree of life. Each genome sequence is the blueprint for that species and serves as a scaﬀold for organizing

new knowledge on the ﬂow of information from DNA to a trait. While interesting from a general biology

perspective, the applications of merging the genome with other genomics datasets have practical implications,

including applications in medicine, agriculture, bioenergy, and other ﬁelds.

A barrier to using genomics datasets is that they are often diﬃcult to ﬁnd and access (download)

from multiple public repositories and merge with other public or with local data so that they may become

co-analyzed (interoperable). This makes data un-FAIR (Findable, Accessible, Interoperable, Reuseable; [2]).

Here we propose a FAIR Data Lake where genomics datasets from diﬀerent sources are named and co-

published without the end-user needing to know how to transfer data to the Data Lake. Datasets are then

pulled from the data lake using an intuitive naming convention based on Named Data Networking (NDN;

[12]). An overview of genomics data naming can be found in [28].

Since genomics datasets are massive (gigabyte to terabyte-scale), the Data Lake must be a ﬂexible,

optimized, and scalable system. We believe that clouds (public or private) are the scalable solution, so we

are building a Data Lake framework for the cloud. Speciﬁcally, we will build the data lake using containers

that run on a Kubernetes (K8s; [17])-based system so that the Data Lake can scale dynamically as well as

be deployed on multiple commercial and public clouds. One of the critical problems of location-independent

deployment has been management complexity. While current data lakes can be deployed on any cloud

platform, they need to be conﬁgured manually to be accessible to the users. The data also depends on the

IP address and DNS name of the platform where the data lake is deployed.

We will use an open-source Data Plane Development Kit (DPDK) for network transport that allows

location-independent optimization. This technical report describes the design and implementation of the

NDN Data Lake for the cloud.

arXiv:2210.10251v1 [cs.NI] 19 Oct 2022

2 Background

2.1 Named Data Networking

Currently, networking for layer 3 of the OSI model [13] uses the Internet Protocol (IP) that implements the

Host-based networking (TCP/IP) paradigm in which data is transferred from one host to another using a

location-based address (IP address). In contrast to IP, Named Data Networking (NDN) [32] is one of the

most developed and implemented designs using the Information-Centric Networking (ICN) paradigm. ICN

fully removes the location requirement that was introduced with the TCP/IP paradigm and shifts the focus

from location to data.

Transferring data via NDN from one host to another is accomplished via two types of packets: Interest

packets and Data packets. Receiving data is as simple as expressing (sending) an Interest packet and receiving

a matching Data packet. The applications decide which name(s) they are going to request. Once an Interest

is expressed, NDN utilizes name-based forwarding to forward the packet toward the data source. Data is

also signed and can be veriﬁed by the recipient; therefore, it can come from anywhere: a publisher, a proxy,

or an in-network cache.

By using NDN, consumers and producers can utilize several beneﬁts, such as data availability after

server failures, a signiﬁcant decrease in server traﬃc, and faster data retrieval. Instead of securing a con-

nection, NDN secures the data, removing most connection-oriented attacks, including a Man-In-The-Middle

attack. In addition, serving and replicating data across nodes is built into NDN. In NDN, the names are

hierarchical, similar to the HTTP Uniform Resource Locator (URL). However, NDN names are Uniform

Resource Identiﬁers (URIs) - unlike URLs, they point to a piece of content and not the location of the

content. Hierarchical names provide the ability to reduce in-network state as well as make discovery easier.

All these properties make NDN an excellent mechanism for constructing a named genomics Content Delivery

Network (CDN) – the Genomics Data Lake.

2.1.1 Forwarders

In order to utilize the NDN architecture, a forwarder must be present in order to route and fulﬁl NDN interests

properly. NDN-DPDK [31] and NDN Forwarding Daemon (NFD) [26] are network forwarders for NDN that

support Interest and Data forwarding as well as content caching in the network. This is accomplished

by abstracting lower-level network transport technologies into NDN Faces, maintaining fundamental data

structures such as CS, PIT, and FIB, and implementing packet processing logic[25].

NDN-DPDK [31] is a high-speed NDN forwarder developed with the Data Plane Development Kit

(DPDK)[3]. DPDK includes data plane libraries and polling-mode network interface controller drivers for

oﬄoading TCP packet processing from the operating system kernel to user-space programs. This oﬄoading

enables better computational eﬃciency and packet throughput than is attainable with the kernel’s interrupt-

driven processing.

With NFD, high-speed forwarding is still a challenge due to variable-length named-based lookups as

well as packet state updates. In this project, we choose the NDN-DPDK forwarder due to its performance

advantages over NFD. While running on commodity hardware, the NDN-DPDK forwarder can reach a

forwarding speed of more than 100 Gbps [29]. This will be useful in transferring data between NDN data

lakes and cloud deployments (such as the Paciﬁc Research Platform (PRP) Kubernetes cluster [30]).

2.2 Kubernetes

In recent years, container technology has been gaining increasing traction. A container is a unit of software

that bundles code and all dependencies required for the app to run. Containers also lend themselves well

to the architectural approach where an application is separated into multiple services that rely on each

other to perform the full desired function.[22]. Kubernetes [17] is an open-source container orchestration

framework that provides an ideal platform for automating containerized applications in diﬀerent deployment

environments. Kubernetes typically deploys containers in environments with high bandwidth and low la-

tency network connections. The applications should spread across the service nodes with high availability,

applications are always accessible by the users, and scalability, scaling application fast when more users are

trying to access it. Moreover, Kubernetes has a disaster recovery mechanism that prevents users from losing

any data when hardware or software failures happen to the service center.

2.2.1 Limitations of external inbound access.

When working with Kubernetes, internal networking inside a cluster has default communication rules that

facilitate networking between services on diﬀerent hosts in a cluster. This communication is further facilitated

by using cluster IPs attached to each pod, allowing for accessible internal communication but not external.

While only having communication inside a Kubernetes cluster may be suﬃcient for some applications, others

require external access. A Kubernetes service like NodePorts, NGINX, or a load balancer will be necessary

to resolve this communication limitation [11][18].

2.3 Data Lake

A Data Lake is a repository that centralizes raw data for storage from many data sources into a single data

store. This data repository is designed to handle large amounts of data ranging from multiple terabytes to

petabytes. The data is long-lived, unprocessed, and left in a form that other services can access and use for

analytics or machine learning, depending on the application [23][24][33].

3 Overview

Figure 1: The ﬁgure demonstrates a high-level overview of the proposed Data Lake using Kubernetes. As

shown, the data storage and docker containers are contained inside the Kubernetes cluster with networking

services used to provide external access to a client application.

The proposed Data Lake for genomics research is built upon the highly scalable platform of Kuber-

netes. Inside Kubernetes, multiple functions are required to enable the successful creation and access of the

novel genomics Data Lake that utilizes NDN and DPDK to discover and deliver needed samples. To assist

with comprehension of the overall design that will later be explained in more detail later in this report, Fig-

ure 1provides a high-level overview of the proposal. As discussed before, one of the ﬁrst and most essential

components is Kubernetes. All the storage and features needed to facilitate the content delivery are run

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NDN-TR70-UtilizingNDN-DPDKforKubernetesGenomicsDataLakeSankalpaTimilsina,JustinPresley,DavidReddick,SusmitShannigrahi,TennesseeTech,XushengAi,ColemanMcknight,AlexFeltusClemsonUniversityOctober20,2022AbstractAsthegrowthofgenomicssamplesrapidlyexpandsduetoincreasedaccesstohighresolu-tionDNAsequencingt...

展开>> 收起<<

NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data Lake Sankalpa Timilsina Justin Presley David Reddick Susmit Shannigrahi.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data Lake Sankalpa Timilsina Justin Presley David Reddick Susmit Shannigrahi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: