NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data Lake Sankalpa Timilsina Justin Presley David Reddick Susmit Shannigrahi

2025-05-02 0 0 284.35KB 12 页 10玖币
侵权投诉
NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data
Lake
Sankalpa Timilsina, Justin Presley, David Reddick, Susmit Shannigrahi,
Tennessee Tech,
Xusheng Ai, Coleman Mcknight, Alex Feltus
Clemson University
October 20, 2022
Abstract
As the growth of genomics samples rapidly expands due to increased access to high resolu-
tion DNA sequencing technology, the need for a scalable platform to aggregate dispersed
datasets enable easy access to the vast wealth of DNA sequences available is paramount.
In this work, we introduce and demonstrate a novel way to use Named Data Networking
(NDN) in conjunction with a Kubernetes cluster to design a flexible and scalable genomics
Data Lake in the cloud. In addition, the use of the NDN Data Plane Development Kit
(DPDK) provides an efficient and accessible distribution of the datasets to researchers any-
where. This report will explain the need to deploy a Data Lake for genomics data, what is
necessary to deploy successfully, and detailed instructions to replicate the proposed design.
Finally, this technical report outlines future enhancement options for further improvements.
1 Introduction
A Data Lake is a large-scale data repository for raw/intermediate file storage for data analytics. Why create
a Data Lake for genomics data? Since scientists constructed the reference human DNA genome over two
decades ago, there has been an explosion of genomes sequenced from thousands of organisms across the
tree of life. Each genome sequence is the blueprint for that species and serves as a scaffold for organizing
new knowledge on the flow of information from DNA to a trait. While interesting from a general biology
perspective, the applications of merging the genome with other genomics datasets have practical implications,
including applications in medicine, agriculture, bioenergy, and other fields.
A barrier to using genomics datasets is that they are often difficult to find and access (download)
from multiple public repositories and merge with other public or with local data so that they may become
co-analyzed (interoperable). This makes data un-FAIR (Findable, Accessible, Interoperable, Reuseable; [2]).
Here we propose a FAIR Data Lake where genomics datasets from different sources are named and co-
published without the end-user needing to know how to transfer data to the Data Lake. Datasets are then
pulled from the data lake using an intuitive naming convention based on Named Data Networking (NDN;
[12]). An overview of genomics data naming can be found in [28].
Since genomics datasets are massive (gigabyte to terabyte-scale), the Data Lake must be a flexible,
optimized, and scalable system. We believe that clouds (public or private) are the scalable solution, so we
are building a Data Lake framework for the cloud. Specifically, we will build the data lake using containers
that run on a Kubernetes (K8s; [17])-based system so that the Data Lake can scale dynamically as well as
be deployed on multiple commercial and public clouds. One of the critical problems of location-independent
deployment has been management complexity. While current data lakes can be deployed on any cloud
platform, they need to be configured manually to be accessible to the users. The data also depends on the
IP address and DNS name of the platform where the data lake is deployed.
We will use an open-source Data Plane Development Kit (DPDK) for network transport that allows
location-independent optimization. This technical report describes the design and implementation of the
NDN Data Lake for the cloud.
1
arXiv:2210.10251v1 [cs.NI] 19 Oct 2022
2 Background
2.1 Named Data Networking
Currently, networking for layer 3 of the OSI model [13] uses the Internet Protocol (IP) that implements the
Host-based networking (TCP/IP) paradigm in which data is transferred from one host to another using a
location-based address (IP address). In contrast to IP, Named Data Networking (NDN) [32] is one of the
most developed and implemented designs using the Information-Centric Networking (ICN) paradigm. ICN
fully removes the location requirement that was introduced with the TCP/IP paradigm and shifts the focus
from location to data.
Transferring data via NDN from one host to another is accomplished via two types of packets: Interest
packets and Data packets. Receiving data is as simple as expressing (sending) an Interest packet and receiving
a matching Data packet. The applications decide which name(s) they are going to request. Once an Interest
is expressed, NDN utilizes name-based forwarding to forward the packet toward the data source. Data is
also signed and can be verified by the recipient; therefore, it can come from anywhere: a publisher, a proxy,
or an in-network cache.
By using NDN, consumers and producers can utilize several benefits, such as data availability after
server failures, a significant decrease in server traffic, and faster data retrieval. Instead of securing a con-
nection, NDN secures the data, removing most connection-oriented attacks, including a Man-In-The-Middle
attack. In addition, serving and replicating data across nodes is built into NDN. In NDN, the names are
hierarchical, similar to the HTTP Uniform Resource Locator (URL). However, NDN names are Uniform
Resource Identifiers (URIs) - unlike URLs, they point to a piece of content and not the location of the
content. Hierarchical names provide the ability to reduce in-network state as well as make discovery easier.
All these properties make NDN an excellent mechanism for constructing a named genomics Content Delivery
Network (CDN) – the Genomics Data Lake.
2.1.1 Forwarders
In order to utilize the NDN architecture, a forwarder must be present in order to route and fulfil NDN interests
properly. NDN-DPDK [31] and NDN Forwarding Daemon (NFD) [26] are network forwarders for NDN that
support Interest and Data forwarding as well as content caching in the network. This is accomplished
by abstracting lower-level network transport technologies into NDN Faces, maintaining fundamental data
structures such as CS, PIT, and FIB, and implementing packet processing logic[25].
NDN-DPDK [31] is a high-speed NDN forwarder developed with the Data Plane Development Kit
(DPDK)[3]. DPDK includes data plane libraries and polling-mode network interface controller drivers for
offloading TCP packet processing from the operating system kernel to user-space programs. This offloading
enables better computational efficiency and packet throughput than is attainable with the kernel’s interrupt-
driven processing.
With NFD, high-speed forwarding is still a challenge due to variable-length named-based lookups as
well as packet state updates. In this project, we choose the NDN-DPDK forwarder due to its performance
advantages over NFD. While running on commodity hardware, the NDN-DPDK forwarder can reach a
forwarding speed of more than 100 Gbps [29]. This will be useful in transferring data between NDN data
lakes and cloud deployments (such as the Pacific Research Platform (PRP) Kubernetes cluster [30]).
2.2 Kubernetes
In recent years, container technology has been gaining increasing traction. A container is a unit of software
that bundles code and all dependencies required for the app to run. Containers also lend themselves well
to the architectural approach where an application is separated into multiple services that rely on each
other to perform the full desired function.[22]. Kubernetes [17] is an open-source container orchestration
framework that provides an ideal platform for automating containerized applications in different deployment
environments. Kubernetes typically deploys containers in environments with high bandwidth and low la-
tency network connections. The applications should spread across the service nodes with high availability,
2
applications are always accessible by the users, and scalability, scaling application fast when more users are
trying to access it. Moreover, Kubernetes has a disaster recovery mechanism that prevents users from losing
any data when hardware or software failures happen to the service center.
2.2.1 Limitations of external inbound access.
When working with Kubernetes, internal networking inside a cluster has default communication rules that
facilitate networking between services on different hosts in a cluster. This communication is further facilitated
by using cluster IPs attached to each pod, allowing for accessible internal communication but not external.
While only having communication inside a Kubernetes cluster may be sufficient for some applications, others
require external access. A Kubernetes service like NodePorts, NGINX, or a load balancer will be necessary
to resolve this communication limitation [11][18].
2.3 Data Lake
A Data Lake is a repository that centralizes raw data for storage from many data sources into a single data
store. This data repository is designed to handle large amounts of data ranging from multiple terabytes to
petabytes. The data is long-lived, unprocessed, and left in a form that other services can access and use for
analytics or machine learning, depending on the application [23][24][33].
3 Overview
Figure 1: The figure demonstrates a high-level overview of the proposed Data Lake using Kubernetes. As
shown, the data storage and docker containers are contained inside the Kubernetes cluster with networking
services used to provide external access to a client application.
The proposed Data Lake for genomics research is built upon the highly scalable platform of Kuber-
netes. Inside Kubernetes, multiple functions are required to enable the successful creation and access of the
novel genomics Data Lake that utilizes NDN and DPDK to discover and deliver needed samples. To assist
with comprehension of the overall design that will later be explained in more detail later in this report, Fig-
ure 1provides a high-level overview of the proposal. As discussed before, one of the first and most essential
components is Kubernetes. All the storage and features needed to facilitate the content delivery are run
3
摘要:

NDN-TR70-UtilizingNDN-DPDKforKubernetesGenomicsDataLakeSankalpaTimilsina,JustinPresley,DavidReddick,SusmitShannigrahi,TennesseeTech,XushengAi,ColemanMcknight,AlexFeltusClemsonUniversityOctober20,2022AbstractAsthegrowthofgenomicssamplesrapidlyexpandsduetoincreasedaccesstohighresolu-tionDNAsequencingt...

展开>> 收起<<
NDN-TR70 - Utilizing NDN-DPDK for Kubernetes Genomics Data Lake Sankalpa Timilsina Justin Presley David Reddick Susmit Shannigrahi.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:284.35KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注