
2 S. Min Oo et al.
On the one hand, different frameworks were proposed to handle data streams,
e.g., Flink, Spark or Storm [6,26,19]. On the other hand, RDF stream process-
ing (RSP) engines, e.g., CQELS and C-SPARQL [16,1,5], were widely studied
and perform high-throughput analysis of RDF streams with low memory foot-
prints [16]. Yet, these stream processing frameworks are not substantially used
in the domain of RDF graph generation from streaming data sources, despite
the demand of these mature RSP engines for more RDF streams.
Between data processing frameworks and stream processing engines, there are
tools to generate RDF streams from heterogeneous data streams (e.g. SPARQL-
Generate [17], RDFGen [21], TripleWave [18], Cefriel’s Chimera [22]). However,
some of these tools are inefficient when the data stream starts to scale in terms of
volume and velocity, such as TripleWave, and SPARQL-Generate. While other
tools are not open sourced nor suitable for the mapping of streaming data, such as
RDFGen, and Cefriel’s Chimera respectively. Overall, there are no RDF stream
generators that keep up with the needs of stream reasoning engines while taking
advantage of data processing frameworks to efficiently produce RDF streams.
In this paper, we present the RMLStreamer-SISO, a parallel, vertically and
horizontally scalable stream processing engine to generate RDF streams from
heterogeneous data streams of any format (e.g. JSON, CSV, XML, etc.). We
extended previous preliminary work [13] of heterogeneous data stream mapping
solution: an open source implementation on top of Apache Flink [6], available un-
der MIT license, which generates high volume RDF data from high volume het-
erogeneous data. RMLStreamer-SISO extends RMLStreamer to also support any
input data streams and export RDF streams (Stream-In-Stream-Out (SISO)).
RMLStreamer-SISO now supports a much larger part of the RML specification3,
including all features of RML but relational databases.
The RMLStreamer-SISO outperforms the the state-of-the-art tools when
handling high velocity data stream, increasing the throughput it could han-
dle while maintaining low latency. The RMLStreamer-SISO achieves millisec-
ond latency, as opposed to seconds that state-of-the-art solutions need, constant
memory usage for all workloads, and sustainable throughput of around 70,000
records/s, compared to 10,000 records/s that state-of-the-art solutions take.
Through the utilization of a low-latency tool like RMLStreamer-SISO, legacy
streaming systems could exploit the unique characteristics of real-life streaming
data, while enabling analysts to exploit the semantic reasoning using knowledge
graphs in real-time and have access to more reliable data.
The contributions presented in this paper are: (i) an algorithm to generate
the RDF streams from heterogeneous streaming data; (ii) its implementation,
the RMLStreamer-SISO, as an extension of RMLStreamer; and (iii) an evalu-
ation demonstrating that the RMLStreamer-SISO outperform the state-of-the-
art. The paper is structured as follows: Section 2 discusses related work, Section 3
the approach and its implementation, Section 4 the evaluation of RMLStreamer-
SISO against state-of-the-art, Section 5 the results of our evaluations, and Sec-
tion 7 concludes our work with possible future works.
3Implementation report of RML: https://rml.io/implementation-report/