Currently the amount of web data has
increased excessively. Its metadata is widely used in
order to fully exploit web information resources. The
Semantic Web is a Web of data that the World Wide
Web Consortium has the vision to provide a common
framework that allows data to be shared and reused
across applications and enterprises. Thus, there is
the need for the definition of the relations among data
that allows a better and automatic interchange of data.
Resource Description Framework (RDF), which is one
of the fundamental building blocks of the Semantic
Web, gives a formation definition for the interchange
of data. It is a standard for describing web resources.
RDF data is in the form subject-predicate-
object which is called triples. The subject describes
the resource while the predicate is the relation or
property between the subject and the object.
For example, one way to represent the notion
“The woman has the sweets” in RDF is as the triple:
a subject denoting “the woman”, a predicate denoting
“has”, and an object denoting “the sweets.”
Many types of storage engines are designed
and evaluate for triples. One of those types is a
triple store which is a purpose-built database for the
storage and retrieval of triples. Queries on these
triples are in SPARQL, which is a language designed
specifically to query RDF databases. The efficiency
of RDF data analysis depends on the performance of
RDF storage and query engine.
Traditional RDF database systems query
data from native RDF stores or from relational
database systems. The motivation for such native
RDF-specific stores is that the relational model is not
particularly suitable towards storage and retrieval of
RDF data because RDF is a graph data model.
However, relational database systems are equipped
with mature optimization techniques for storing and
querying data.
NoSQL database is another type of database
that is not relational database and not use SQL to
query the data. NoSQL database has the data model
that can divide into four types which are document
database using JSON data format, key-value
database, column store database, and graph
database. NoSQL database has different
characteristics from relational databases, such as
schema-free and replication support. The motivation
for this approach includes the simplicity of design
and the horizontal scaling for supporting big data.
Recently, NoSQL databases have been more
successful than traditional relational database
systems for the ability inprocessing big data on the
cloud effectively [1]. In NoSQL databases,to gain
performance, ACID (Atomic, Consistency, Isolation,
and Duration), which is a set of properties that
guarantee that database transactions are processed
reliably, is sacrificed [2]. However, the advocates of
NoSQL databases argue that they should rather enforce
the triple of requirements including consistency (C),
availability (A) and partitioning tolerance (P), shortly
CAP [1].
of the question how to process web data quickly.
Thus, we propose a method to exploit a NoSQL
database, specifically MongoDB, to store and query
RDF. MongoDB is chosen because it is one of widely
used NoSQL databases. The system first invokes
NoSQL API to retrieve MongoDB data in JSON format.
Then, the JSON parser module converts JSON
data to RDF data. We evaluate our design and
implementation by using the Berlin SPARQL
Benchmark, which is one of the most widely accepted
benchmarks for comparing the performance of three
RDF storage systems which include Apache Jena
TDB (native RDF store), MySQL (relational database),
and MongoDB (NoSQL database).
data management research. Bizer and Schultz [3]
proposed the Berlin SPARQL Benchmark (BSBM) for
comparing the performance of native RDF stores
(Sesame, Virtuoso, Jena TDB, and Jena SDB),
SPARQL-to-SQL rewriters (D2R Server and Virtuoso
RDF Views), and relational database management
systems (MySQL and Virtuoso RDBMS). The rewriting
approach outperformed native RDF storage with the
increasing dataset. The other important result was
that relational database management systems were
faster than the SPARQL-to-SQL rewriters. The authors
of this related paper explained that RDF stores might
not have a mature optimization technique as SQL
query engines had. Our paper uses the BSBM
benchmark to evaluate RDF storage systems but we
also propose the approach to use and evaluate
NoSQL database as a RDF data query processing
system.
There has been some work on querying RDF
data from NoSQL databases [4-6]. Cudre-Mauroux
et al. [4] made the first attempt at characterizing and
comparing NoSQL stores and native RDF stores for
RDF processing. They used the Berlin SPARQL
Benchmark and the DBpedia SPARQL Benchmark to
evaluate and compare a native RDF store (4store)
with four NoSQL databases which included Jena+H-
Base, Hive+HBase, CumulusRDF, and Counchbase.
All experiments were performed on the Amazon EC2
Elastic Compute Cloud infrastructure. Based on the
experimental results, NoSQL systems, such as
Jena+HBase, processed simple SPARQL queries
more efficiently than native RDF stores, such as 4store.
On the other hand, for more complex SPARQL queries
requiring several many joins and filters, NoSQL
systems took longer time than 4store. Although both
this related work and our work compare NoSQL
systems and native RDF systems, but our paper also
evaluates the performance of a relational base
database system as well.
Angles and Gtierrez studied the RDF model
from a database perspective and compared it with
other database models [5]. However, they did not
implement and evaluate a graph database for storying
and querying RDF data like we do. Lately, Bendar et
al. [6] performed the comparison of RDF databases,
NoSQL databases, and relational databases for the
Semantic Web applications with their own developed
benchmark. However, they did not provide the
analysis the types of queries for which each database
was suitable.
Sequeda and Miranker [7] chose to execute
SPARQL queries on RDF representation of the legacy
relational data by implementing the system called
Ultrawrap. Ultrawrap encoded a logical representation
of the database as an RDF graph using SQL views
and a translation of SPARQL queries to SQL queries.
To improve query execution time, detection of
unsatisfiable conditions and self-join elimination could
be applied to the SQL from the translations of
SPARQL queries.
Alexaki et al. [8] presented the ICS-FORTH
RDFSuite, a suite of tools for RDF validation, store,and
querying. They proposed the design of a persistent
RDF store (RSSDB) for loading resource descriptions
in an Object Relational Database Management
System (ORDBMS) by using RDF schema knowledge.
They also presented RQL as a declarative language
for querying both RDF descriptions and schemas.
However, they did not compare their proposed system
with other database systems and did not use a
standard benchmark like BSBM.
Several researchers have attempted to
design and develop RDF storage and query engine
using relational DBMSs [9-11]. Harris et al. [9]
proposed 3store as a RDF storage and query engine
and extended it to support SPARQL query interface
[10]. However, 3store had not been evaluated and
compared with other systems [9-10]. Jena1 [11] and
Jena2 [12] are popular Semantic Web programmers’
toolkits that have been downloaded for several
thousand times. Jena1 is an open-source project,
implemented in Java, and available for download for
free. Its core is the capability in manipulating RDF
graphs. Jena2 was extended to support multiple and
flexible presentations of RDF graphs and to provide
a simple minimal list view of the RDF graph to the
application programmers.
There are several works about scalable RDF
engines for storing, indexing, and querying [13-16].
The main focus of Jena2 was to improve the
performance and scalability due to these problems:
too many joins, single statement table, reification
storage bloat, and query optimization [13]. To address
these issues, the Jena2 schema design supported a
denormalized schema for storing resource URIs and
simple literal values directly in the statement table. In
addition, to improve performance through locality and
caching, Jena2 also supported the use of multiple
statement tables.
Sesame [14] was one of the first architectures
which its aim was for efficient storing and querying a
large amount of RDF data. However, there were some
unsupported operations, such as aggregates [15].
Also, implementing triple store directly in PostgresSQL
was faster than that of Sesame’s interfaces and
SeRQL [15]. Abadi et al. [15] proposed the approach
of vertically partitioning the RDF data. The results
showed that vertical partitioning achieved similar
performance to the property table technique
proposed to reduce the number of self-joins.
The RDF-3X (RDF Triple eXpress) [16],
designed and implemented from scratch specifically
for the management and querying of RDF data,
outperformed the previously best alternative [15] by
one or two orders of magnitude.
The contributions of this paper are as
following:1) applying MongoDB to store and query
RDF data; 2) using the standard Berlin SPARQL
benchmark to compare all three kinds of database
systems: native RDF store, relational database, and
NoSQL database. The analysis of the comparison can
be a guideline for choosing an appropriate database
system for different kinds of applications. For example,
relational databases are suitable for applications with
complex queries while NoSQL databases should be
used for applications with simple queries.
Currently the amount of web data has increased excessively. Its metadata is widely used in order to fully exploit web information resources. The Semantic Web is a Web of data that the World Wide Web Consortium has the vision to provide a common framework that allows data to be shared and reused across applications and enterprises. Thus, there is the need for the definition of the relations among datathat allows a better and automatic interchange of data. Resource Description Framework (RDF), which is one of the fundamental building blocks of the Semantic Web, gives a formation definition for the interchange of data. It is a standard for describing web resources.RDF data is in the form subject-predicate-object which is called triples. The subject describes the resource while the predicate is the relation or property between the subject and the object. For example, one way to represent the notion“The woman has the sweets” in RDF is as the triple: a subject denoting “the woman”, a predicate denoting “has”, and an object denoting “the sweets.”Many types of storage engines are designed and evaluate for triples. One of those types is a triple store which is a purpose-built database for the storage and retrieval of triples. Queries on these triples are in SPARQL, which is a language designed specifically to query RDF databases. The efficiency of RDF data analysis depends on the performance of RDF storage and query engine. Traditional RDF database systems query data from native RDF stores or from relational database systems. The motivation for such native RDF-specific stores is that the relational model is not particularly suitable towards storage and retrieval of RDF data because RDF is a graph data model. However, relational database systems are equipped with mature optimization techniques for storing and querying data.NoSQL database is another type of database that is not relational database and not use SQL to query the data. NoSQL database has the data model that can divide into four types which are document database using JSON data format, key-value database, column store database, and graph database. NoSQL database has different characteristics from relational databases, such as schema-free and replication support. The motivation for this approach includes the simplicity of design and the horizontal scaling for supporting big data.Recently, NoSQL databases have been more successful than traditional relational database systems for the ability inprocessing big data on the cloud effectively [1]. In NoSQL databases,to gain performance, ACID (Atomic, Consistency, Isolation, and Duration), which is a set of properties that guarantee that database transactions are processed reliably, is sacrificed [2]. However, the advocates ofNoSQL databases argue that they should rather enforcethe triple of requirements including consistency (C), availability (A) and partitioning tolerance (P), shortly CAP [1].of the question how to process web data quickly. Thus, we propose a method to exploit a NoSQL database, specifically MongoDB, to store and query RDF. MongoDB is chosen because it is one of widely used NoSQL databases. The system first invokes NoSQL API to retrieve MongoDB data in JSON format. Then, the JSON parser module converts JSON data to RDF data. We evaluate our design and implementation by using the Berlin SPARQL Benchmark, which is one of the most widely accepted benchmarks for comparing the performance of three RDF storage systems which include Apache Jena TDB (native RDF store), MySQL (relational database), and MongoDB (NoSQL database).data management research. Bizer and Schultz [3] proposed the Berlin SPARQL Benchmark (BSBM) for comparing the performance of native RDF stores (Sesame, Virtuoso, Jena TDB, and Jena SDB), SPARQL-to-SQL rewriters (D2R Server and Virtuoso RDF Views), and relational database management systems (MySQL and Virtuoso RDBMS). The rewriting approach outperformed native RDF storage with the increasing dataset. The other important result was that relational database management systems were faster than the SPARQL-to-SQL rewriters. The authors of this related paper explained that RDF stores might not have a mature optimization technique as SQL query engines had. Our paper uses the BSBM benchmark to evaluate RDF storage systems but we also propose the approach to use and evaluate NoSQL database as a RDF data query processing system.There has been some work on querying RDF data from NoSQL databases [4-6]. Cudre-Mauroux et al. [4] made the first attempt at characterizing and comparing NoSQL stores and native RDF stores for RDF processing. They used the Berlin SPARQL Benchmark and the DBpedia SPARQL Benchmark to evaluate and compare a native RDF store (4store) with four NoSQL databases which included Jena+H-Base, Hive+HBase, CumulusRDF, and Counchbase. All experiments were performed on the Amazon EC2 Elastic Compute Cloud infrastructure. Based on the experimental results, NoSQL systems, such as Jena+HBase, processed simple SPARQL queries more efficiently than native RDF stores, such as 4store. On the other hand, for more complex SPARQL queries requiring several many joins and filters, NoSQL systems took longer time than 4store. Although both this related work and our work compare NoSQL systems and native RDF systems, but our paper also evaluates the performance of a relational base database system as well. Angles and Gtierrez studied the RDF model from a database perspective and compared it with other database models [5]. However, they did not implement and evaluate a graph database for storying and querying RDF data like we do. Lately, Bendar et al. [6] performed the comparison of RDF databases, NoSQL databases, and relational databases for the Semantic Web applications with their own developed benchmark. However, they did not provide the analysis the types of queries for which each database was suitable.Sequeda and Miranker [7] chose to execute SPARQL queries on RDF representation of the legacy relational data by implementing the system called Ultrawrap. Ultrawrap encoded a logical representation of the database as an RDF graph using SQL views and a translation of SPARQL queries to SQL queries. To improve query execution time, detection ofunsatisfiable conditions and self-join elimination couldbe applied to the SQL from the translations of SPARQL queries. Alexaki et al. [8] presented the ICS-FORTH RDFSuite, a suite of tools for RDF validation, store,and querying. They proposed the design of a persistent RDF store (RSSDB) for loading resource descriptions in an Object Relational Database Management System (ORDBMS) by using RDF schema knowledge. They also presented RQL as a declarative language for querying both RDF descriptions and schemas. However, they did not compare their proposed system with other database systems and did not use a standard benchmark like BSBM. Several researchers have attempted to design and develop RDF storage and query engine using relational DBMSs [9-11]. Harris et al. [9] proposed 3store as a RDF storage and query engine
and extended it to support SPARQL query interface
[10]. However, 3store had not been evaluated and
compared with other systems [9-10]. Jena1 [11] and
Jena2 [12] are popular Semantic Web programmers’
toolkits that have been downloaded for several
thousand times. Jena1 is an open-source project,
implemented in Java, and available for download for
free. Its core is the capability in manipulating RDF
graphs. Jena2 was extended to support multiple and
flexible presentations of RDF graphs and to provide
a simple minimal list view of the RDF graph to the
application programmers.
There are several works about scalable RDF
engines for storing, indexing, and querying [13-16].
The main focus of Jena2 was to improve the
performance and scalability due to these problems:
too many joins, single statement table, reification
storage bloat, and query optimization [13]. To address
these issues, the Jena2 schema design supported a
denormalized schema for storing resource URIs and
simple literal values directly in the statement table. In
addition, to improve performance through locality and
caching, Jena2 also supported the use of multiple
statement tables.
Sesame [14] was one of the first architectures
which its aim was for efficient storing and querying a
large amount of RDF data. However, there were some
unsupported operations, such as aggregates [15].
Also, implementing triple store directly in PostgresSQL
was faster than that of Sesame’s interfaces and
SeRQL [15]. Abadi et al. [15] proposed the approach
of vertically partitioning the RDF data. The results
showed that vertical partitioning achieved similar
performance to the property table technique
proposed to reduce the number of self-joins.
The RDF-3X (RDF Triple eXpress) [16],
designed and implemented from scratch specifically
for the management and querying of RDF data,
outperformed the previously best alternative [15] by
one or two orders of magnitude.
The contributions of this paper are as
following:1) applying MongoDB to store and query
RDF data; 2) using the standard Berlin SPARQL
benchmark to compare all three kinds of database
systems: native RDF store, relational database, and
NoSQL database. The analysis of the comparison can
be a guideline for choosing an appropriate database
system for different kinds of applications. For example,
relational databases are suitable for applications with
complex queries while NoSQL databases should be
used for applications with simple queries.
การแปล กรุณารอสักครู่..
