Distributed SPARQL Evaluation
SPARQL
is the W3C standard query language for querying data
expressed in the
Resource
Description Framework (RDF). The increasing amounts
of RDF data available raise a major need and research
interest in building efficient and scalable distributed
SPARQL query evaluators. In this context, we propose
strategies to efficiently store RDF datasets in a
distributed manner, and methods for querying them in an
efficient manner.
Experimental Studies
This project is strongly backed by practical experiments
and cluster tuning. Indeed, here, theoretical concepts
face the test of reality. Therfore, we compare our
systems with competitors of the scientific literature.
-
We systematically benchmarked a panel of opensource,
distributed, recent or popular state-of-the-art SPARQL
evaluators i.e. 4store, CliqueSquare,
CumulusRDF, CouchBaseRDF, S2RDF, RYA &
PigSPARQL. We present tutorials and the obtained
results
here.
-
We also propose a new reading grid to rank SPARQL
evaluators based on several criteria (respectively the
velocity, the immediacy, the dynamicity, the parsimony
and the resiliency) which is especially designed for a
distributed context.
Sources
All the SPARQL evaluators developed are also openly
available under the terms of
the CeCILL
license on
the team
github with other related software. In this
particular project, we share the following evaluators:
-
SPARQLGX
An Efficient
Distributed SPARQL Evaluator Based on Apache
Spark.
-
RDFHive
A Direct Evaluator of
SPARQL on top of Apache Hive.
-
SDE (i.e. SPARQLGX as a Direct
Evaluator)
A Solution to Directly
Evaluate SPARQL using Apache Spark.
Related Publications
-
[CONFERENCE] SPARQLGX :
Une Solution Distribuée pour RDF Traduisant SPARQL
vers Spark [HAL, PDF,Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil
Layaïda
BDA 2016 - 32ème Conférence sur la
Gestion de Données - Principes, Technologies et
Applications, Nov 2016, Poitiers, France. BDA2016
SPARQL est un langage de requête standardisé par
le W3C permettant d'interroger des données
exprimées au format RDF (Resource Description
Framework). Avec l'augmentation des volumes de
données RDF disponibles, de nombreux efforts de
recherche ont été faits pour permettre
l'évaluation distribuée et efficace de requêtes
SPARQL. Dans ce contexte, nous proposons et
partageons SPARQLGX : notre solution de stockage
RDF distribuée utilisant Apache Spark pour évaluer
des requêtes SPARQL et stockant les données via
des infrastructures Hadoop (HDFS). SPARQLGX repose
sur un traducteur de requêtes SPARQL vers une
séquence d'instructions exécutables par Spark en
adoptant des stratégies d'évaluation selon (1) le
schéma de stockage des données utilisé et (2) des
statistiques sur les données. Nous montrons que
SPARQLGX permet l'évaluation de requêtes SPARQL
sur plusieurs milliards de triplets RDF répartis
sur plusieurs nœuds. Nous comparons aussi SPARQLGX
à d'autres solutions issues de
l'état-de-l'art. Nous démontrons ainsi les
performances obtenues en permettant aux
participants de reproduire par eux-mêmes les
résultats présentés grâce à différents scénarios
mettant directement en compétition plusieurs
solutions de l'état-de-l'art. Nous montrons dans
ce travail que tout en ayant une architecture
relativement simple, SPARQLGX représente une
alternative intéressante dans de nombreux cas
d'utilisation.
-
[CONFERENCE] SPARQLGX in
Action: Efficient Distributed Evaluation of SPARQL
with Apache Spark [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil
Layaïda
15th International Semantic Web
Conference (ISWC 2016 demo paper), Oct 2016, Kobe,
Japan. 15th International Semantic Web Conference
We demonstrate SPARQLGX: our implementation of a
distributed sparql evaluator. We show that
sparqlgx makes it possible to evaluate SPARQL
queries on billions of triples distributed across
multiple nodes, while providing attractive
performance figures.
-
[CONFERENCE] SPARQLGX:
Efficient Distributed Evaluation of SPARQL with Apache
Spark [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil
Layaïda
The 15th International Semantic Web
Conference, Oct 2016, Kobe, Japan. The 15th
International Semantic Web Conference,
<10.1007/978-3-319-46547-0_9>
sparql is the w3c standard query language for
querying data expressed in the Resource
Description Framework (rdf). The increasing
amounts of rdf data available raise a major need
and research interest in building efficient and
scalable distributed sparql query eval-uators. In
this context, we propose sparqlgx: our
implementation of a distributed rdf datastore
based on Apache Spark. sparqlgx is designed to
leverage existing Hadoop infrastructures for
evaluating sparql queries. sparqlgx relies on a
translation of sparql queries into exe-cutable
Spark code that adopts evaluation strategies
according to (1) the storage method used and (2)
statistics on data. We show that spar-qlgx makes
it possible to evaluate sparql queries on billions
of triples distributed across multiple nodes,
while providing attractive performance figures. We
report on experiments which show how sparqlgx
compares to related state-of-the-art
implementations and we show that our approach
scales better than these systems in terms of
supported dataset size. With its simple design,
sparqlgx represents an interesting alternative in
several scenarios.