RMLStreamer-SISO an RDF stream generator from streaming heterogeneous data Sitt Min Oo10000000191577507 Gerald Haesendonck10000000316053855

2025-05-03 0 0 1.14MB 18 页 10玖币
侵权投诉
RMLStreamer-SISO: an RDF stream generator
from streaming heterogeneous data
Sitt Min Oo1[0000000191577507], Gerald Haesendonck1[0000000316053855],
Ben De Meester1[0000000302480987], and Anastasia
Dimou2[0000000321387972]
1IDLab, Dept. Electronics & Information Systems, Ghent University – imec, Belgium
{x.sittminoo, gerald.haesendonck, ben.demeester}@ugent.be
2KULeuven, Dept. Computer Science – Leuven.AI – Flanders Make, Belgium
anastasia.dimou@kuleuven.be
Abstract. Stream-reasoning query languages such as CQELS and C-
SPARQL enable query answering over RDF streams. Unfortunately, there
currently is a lack of efficient RDF stream generators to feed RDF stream
reasoners. State-of-the-art RDF stream generators are limited with re-
gard to the velocity and volume of streaming data they can handle.
To efficiently generate RDF streams in a scalable way, we extended the
RMLStreamer to also generate RDF streams from dynamic heteroge-
neous data streams. This paper introduces a scalable solution that relies
on a dynamic window approach to generate RDF streams with low la-
tency and high throughput from multiple heterogeneous data streams.
Our evaluation shows that our solution outperforms the state-of-the-
art by achieving millisecond latency (compared to seconds that state-of-
the-art solutions need), constant memory usage for all workloads, and
sustainable throughput of around 70,000 records/s (compared to 10,000
records/s that state-of-the-art solutions take). This opens up the access
to numerous data streams for integration with the semantic web.
Resource type: Software
License: MIT License
URL:https://github.com/RMLio/RMLStreamer/releases/tag/v2.3.0
Keywords: RML ·Stream processing ·Window Joins ·Knowledge
graph generation
1 Introduction
An increasing portion of data are continuous in nature, e.g., sensor events, user
activities on a website, or financial trade events. This type of data is known as
data streams; sequences of unbounded tuples generated continuously in different
rates and volumes [3]. Due to the temporal nature of data streams, low latency
computation of analytical results is needed to timely react in different use cases,
e.g., fraud detection [9]. Thus, stream processing engines must efficiently handle
low latency computation of varying velocity and volume.
arXiv:2210.14599v1 [cs.DB] 26 Oct 2022
2 S. Min Oo et al.
On the one hand, different frameworks were proposed to handle data streams,
e.g., Flink, Spark or Storm [6,26,19]. On the other hand, RDF stream process-
ing (RSP) engines, e.g., CQELS and C-SPARQL [16,1,5], were widely studied
and perform high-throughput analysis of RDF streams with low memory foot-
prints [16]. Yet, these stream processing frameworks are not substantially used
in the domain of RDF graph generation from streaming data sources, despite
the demand of these mature RSP engines for more RDF streams.
Between data processing frameworks and stream processing engines, there are
tools to generate RDF streams from heterogeneous data streams (e.g. SPARQL-
Generate [17], RDFGen [21], TripleWave [18], Cefriel’s Chimera [22]). However,
some of these tools are inefficient when the data stream starts to scale in terms of
volume and velocity, such as TripleWave, and SPARQL-Generate. While other
tools are not open sourced nor suitable for the mapping of streaming data, such as
RDFGen, and Cefriel’s Chimera respectively. Overall, there are no RDF stream
generators that keep up with the needs of stream reasoning engines while taking
advantage of data processing frameworks to efficiently produce RDF streams.
In this paper, we present the RMLStreamer-SISO, a parallel, vertically and
horizontally scalable stream processing engine to generate RDF streams from
heterogeneous data streams of any format (e.g. JSON, CSV, XML, etc.). We
extended previous preliminary work [13] of heterogeneous data stream mapping
solution: an open source implementation on top of Apache Flink [6], available un-
der MIT license, which generates high volume RDF data from high volume het-
erogeneous data. RMLStreamer-SISO extends RMLStreamer to also support any
input data streams and export RDF streams (Stream-In-Stream-Out (SISO)).
RMLStreamer-SISO now supports a much larger part of the RML specification3,
including all features of RML but relational databases.
The RMLStreamer-SISO outperforms the the state-of-the-art tools when
handling high velocity data stream, increasing the throughput it could han-
dle while maintaining low latency. The RMLStreamer-SISO achieves millisec-
ond latency, as opposed to seconds that state-of-the-art solutions need, constant
memory usage for all workloads, and sustainable throughput of around 70,000
records/s, compared to 10,000 records/s that state-of-the-art solutions take.
Through the utilization of a low-latency tool like RMLStreamer-SISO, legacy
streaming systems could exploit the unique characteristics of real-life streaming
data, while enabling analysts to exploit the semantic reasoning using knowledge
graphs in real-time and have access to more reliable data.
The contributions presented in this paper are: (i) an algorithm to generate
the RDF streams from heterogeneous streaming data; (ii) its implementation,
the RMLStreamer-SISO, as an extension of RMLStreamer; and (iii) an evalu-
ation demonstrating that the RMLStreamer-SISO outperform the state-of-the-
art. The paper is structured as follows: Section 2 discusses related work, Section 3
the approach and its implementation, Section 4 the evaluation of RMLStreamer-
SISO against state-of-the-art, Section 5 the results of our evaluations, and Sec-
tion 7 concludes our work with possible future works.
3Implementation report of RML: https://rml.io/implementation-report/
RMLStreamer-SISO: an RDF stream generator 3
2 Related Works
Streaming RDF mapping engines transform heterogeneous data streams to RDF
data streams. Several solutions exist in the literature for generating RDF from
persistent data sources [23,14,13,2], but only few generate RDF from data streams
[21,17,18]. Although the implementations details are elaborated in these works,
their evaluations are designed without considering the different data stream be-
haviours nor the resource contention between different evaluation components.
TripleWave [18] generates RDF streams from streaming or static data
sources using R2RML mappings, and publishes them as RDF stream. However,
the R2RML mappings of TripleWave are invalid according to the specifications
of R2RML and it does not support joins. Although it is purported to support
several input sources, the user has to write the code to process the input data
and iterate over them before using the tool. This can result in poor performance
from improper implementation. Last, it is not designed to support distributed
parallel processing, resulting in limited scaling with data volume and velocity.
RDF-Gen [21] generates static or streaming RDF data from static or
streaming data sources. A Data connector communicates with the data source,
iterates over its data entries, and converts every entry to a record of values.
These records are converted to RDF using a graph template: a listing of RDF-
like statements with variables bound to the record values coming from data con-
nectors. RDF-Gen generates RDF on a per record basis, theoretically allowing a
distributed parallel processing set-up. However, the current implementation and
documentation show no indication of a clustered setup nor how to run it.
SPARQL-Generate [17] extends SPARQL 1.1 syntax to support mapping
of heterogeneous data to RDF data. SPARQL-Generate could be implemented
on top of any SPARQL query engine, and knowledge engineers with SPARQL
experience could use it with ease. The reference implementation of SPARQL-
Generate4generates RDF streams from data streams, even though it is not
reported in the original paper. Although joining data from multiple sources is
supported, SPARQL-Generate waits for one of the data streams to end first be-
fore consuming other data sources to join the data. Thus, joins with unbounded
streaming data sources are not supported. The implementation is based on single
machine setup without scaling with data volume and velocity.
Cefriel’s Chimera [22] is an integration framework based on Apache Camel 5
split into four “blocks” of components to map heterogeneous data to RDF data:
lifting block, data enricher, inference enricher, and lowering block. Chimera aims
to be modular and allows each block to be replaced with custom implementa-
tions. The current implementation uses a modified version of RMLMapper6in
the lifting block for data stream processing. However, the whole RML map-
ping process is recreated with each incoming message which could lead to high
performance overhead in a highly dynamic data stream environment.
4SPARQL-Generate: https://github.com/sparql-generate/sparql-generate
5Apache Camel: https://camel.apache.org/
6RMLMapper: https://github.com/RMLio/rmlmapper-java
4 S. Min Oo et al.
3 Stream In - Stream Out (SISO)
We extend RMLStreamer’s architecture for generating RDF from persistent big
data sources [13] to also generate RDF streams from heterogeneous data streams
with high data velocity and volume, while keeping the latency low. The RDF
mapping language (RML) [10], a superset of R2RML, expresses customized map-
ping from heterogeneous data sources to RDF datasets. We illustrate the con-
cepts of RML with the example RML document in Listing 1.2.
We break the process of generating RDF from a data stream into tasks and
subtasks (Figure 1). Each task or subtask is a stream processing operator acting
on an incoming data stream. They could be chained one after the other to form
a pipeline of operators and result in one or more outgoing data streams. This
approach introduces parallelism on both data and processing level, enabling each
data stream and operator to be processed and executed respectively in parallel.
To illustrate RMLStreamer-SISO’s pipeline, we use the examples in Listing
1.1 and 1.2. The mapping document in Listing 1.2 is used to join and map JSON
data (Listing 1.1) from websocket streams to RDF with dynamic window join.
b) Data source
connector
a) Data source
c) Records
d) Partitioner*
e) Item
generator*
f) Data items
Data
Components
mapped to Flink
operator(s)
Component
Data flow
*Introduces
parallelism
Ingestion
Pre-mapping
(optional)
g) Stream processing
operators
Window operators FnO functions
h) Statement generators*
Subject
generator
Predicate
generator
Object
generator
i) Abstract RDF statements
l) Stream merger
m) Sink writer
Mapping
Combination
j) RDF serializer
k) Serialized RDF statements
Fig. 1. Workflow of RMLStreamer. Data flows from the Data Source at the top through
all the components pipeline to the Sink writer at the bottom.
Listing 1.1. Data records from 2 data streams “Flow” & “Speed”.
1// data recor ds from S peed stream
2{" spe ed " :123 .0 , " ti me ":" 1 4:4 2: 00 " ," i d ":" la ne1 " }
3// data recor ds from Flow strea m
4{" fl ow " :1680 , " t ime ":" 1 4: 42 :00 " ," id " :" lane 1 "}
摘要:

RMLStreamer-SISO:anRDFstreamgeneratorfromstreamingheterogeneousdataSittMinOo1[0000000191577507],GeraldHaesendonck1[0000000316053855],BenDeMeester1[0000000302480987],andAnastasiaDimou2[0000000321387972]1IDLab,Dept.Electronics&InformationSystems,GhentUniversityimec,Belgium{x.sittminoo,gerald.haesendo...

展开>> 收起<<
RMLStreamer-SISO an RDF stream generator from streaming heterogeneous data Sitt Min Oo10000000191577507 Gerald Haesendonck10000000316053855.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.14MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注