Damien Graux
Since January
2016:
Postdoctoral Researcher in
Tyrex Team (Inria, France).
Prior
Position:
PhD. Student from
November 2013 to December 2016 in Tyrex Team (Inria,
France).
Advisors: Nabil
Layaïda and Pierre Genevès. Funded
by: the Datalyse
project.
Defended on December 15,
2016. [Dissertation
& Slides]. ☺
Research activities
Currently, I'm pushing further what I developed during
my PhD thesis by integrating SPARQL evaluators into
larger systems where various kinds of data structures
are involved: several query results are needed (and
aggregated) to build a complex answer. More
specifically, I'm trying to design efficient languages
to facilitate the development of optimized ETL pipelines
in a semantic context.
During my PhD thesis, I focused on Semantic
Web standards,
especially on the Resource Description
Framework RDF
and its dedicated query language
SPARQL. My
main goal was to design efficient tools to evaluate
SPARQL queries on very large RDF datasets (i.e.
≥100GB). Indeed, I provided a new reading grid to
rank SPARQL evaluators before designing several
efficient ones. More particularly, I worked on:
-
Distributed systems e.g. Apache Hadoop, Apache Spark...
-
RDF storage methods
-
SPARQL evaluation strategies in a distributed context
-
RDF/SPARQL Benchmarks
As a past time during my PhD main researches, I also
designed a semantic pipeline for trip planning
aggregating heterogeneous datasets (e.g. GTFS,
RDF, CSV) in order to provide users touristic
alternatives at plane stopovers.
Previously (before 2013), I worked on designing and
implementing broadcast algorithms with special
properties such as UTO (uniform and totally
ordered). This work, mainly developed in C, is also
openly available from
github. [here].
Publications
(List generated from hal.inria.fr.)
-
[PRE-PRINT] SPARUB: SPARQL UPDATE Benchmark
[HAL, PDF, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
2017
One aim of the RDF data model, as standardized by the W3C, is to facilitate the evolution of data over time without requiring all the data consumers to be changed. To this end, one of the latest addition to the SPARQL standard query language is an update language for RDF graphs. The research on efficient and scalable SPARQL evaluation methods increasingly relies on standardized methodologies for benchmarking and comparing systems. However, current RDF benchmarks do not support graphs updates. We propose and share SPARUB: a benchmark for the SPARQL UPDATE language on RDF graphs. The aim of SPARUB is not to be yet another RDF benchmark. Instead it provides the mean to automatically extend and improve existing RDF benchmarks along a new dimension of data updates, while preserving their structure and query scenarios.
-
[PRE-PRINT] HAP: Building Pipelines with Heterogeneous Data and Hive
[HAL, PDF, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
2017
The increasing number of available datasets gives opportunities to build large and complex applications which aggregate results coming from several sources. These emerging usecases require new systems where combinations of heterogeneous sources are both allowed and efficient.To tackle these challenges, we provide a simple high-level set of primitives – called HAP – to easily describe processing chains. These descriptions are then compiled into optimized SQL queries executed by Hive.
-
[PHD] On the Efficient Distributed Evaluation of SPARQL Queries
[HAL, Abstract]
Damien Graux
Web. Université Grenoble Alpes, 2016. English
The Semantic Web standardized by the World Wide Web Consortium aims at providing a common framework that allows data to be shared and analyzed across applications. The Resource Description Framework (RDF) and the query language SPARQL constitute two major components of this vision.Because of the increasing amounts of RDF data available, dataset distribution across clusters is poised to become a standard storage method. As a consequence, efficient and distributed sparql evaluators are needed.To tackle these needs, we first benchmark several state-of-the-art distributed SPARQL evaluators while monitoring a set of metrics which is appropriate in a distributed context (e.g. network traffic). Then, an analysis driven by typical use cases leads us to define new development perspectives in the field of distributed sparql evaluation. On the basis of these perspectives, we design several efficient distributed SPARQL evaluators whose performances are validated and compared to state- of-the-art evaluators. For instance, our distributed SPARQL evaluator named SPARQLGX offers efficient time performances while being resilient to the loss of nodes.
-
[CONFERENCE] SPARQLGX : Une Solution Distribuée pour RDF Traduisant SPARQL vers Spark
[HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
BDA 2016 - 32ème Conférence sur la Gestion de Données - Principes, Technologies et Applications, Nov 2016, Poitiers, France. BDA2016
SPARQL est un langage de requête standardisé par le W3C permettant d'interroger des données exprimées au format RDF (Resource Description Framework). Avec l'augmentation des volumes de données RDF disponibles, de nombreux efforts de recherche ont été faits pour permettre l'évaluation distribuée et efficace de requêtes SPARQL. Dans ce contexte, nous proposons et partageons SPARQLGX : notre solution de stockage RDF distribuée utilisant Apache Spark pour évaluer des requêtes SPARQL et stockant les données via des infrastructures Hadoop (HDFS). SPARQLGX repose sur un traducteur de requêtes SPARQL vers une séquence d'instructions exécutables par Spark en adoptant des stratégies d'évaluation selon (1) le schéma de stockage des données utilisé et (2) des statistiques sur les données. Nous montrons que SPARQLGX permet l'évaluation de requêtes SPARQL sur plusieurs milliards de triplets RDF répartis sur plusieurs nœuds. Nous comparons aussi SPARQLGX à d'autres solutions issues de l'état-de-l'art. Nous démontrons ainsi les performances obtenues en permettant aux participants de reproduire par eux-mêmes les résultats présentés grâce à différents scénarios mettant directement en compétition plusieurs solutions de l'état-de-l'art. Nous montrons dans ce travail que tout en ayant une architecture relativement simple, SPARQLGX représente une alternative intéressante dans de nombreux cas d'utilisation.
-
[CONFERENCE] SPARQLGX in Action: Efficient Distributed Evaluation of SPARQL with Apache Spark
[HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
15th International Semantic Web Conference (ISWC 2016 demo paper), Oct 2016, Kobe, Japan. 15th International Semantic Web Conference
We demonstrate SPARQLGX: our implementation of a distributed sparql evaluator. We show that sparqlgx makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures.
-
[CONFERENCE] Smart Trip Alternatives for the Curious
[HAL, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
15th International Semantic Web Conference (ISWC 2016 demo paper), Oct 2016, Kobe, Japan. 15th International Semantic Web Conference
When searching for flights, current systems often suggest routesinvolving waiting times at stopovers. There might exist alternative routes which aremore attractive from a touristic perspective because their duration isnot necessarily much longer while offering enough time in anappropriate place. Choosing among suchalternatives requires additional planning efforts to make sure thate.g. points of interest can conveniently be reached in theallowed time frame. We present a system that automatically computes smart tripalternatives between any two cities. To do so, it searchespoints of interest in large semantic datasets considering theset of accessible areas around each possible layover. It then elects feasible alternatives and displays theirdifferences with respect to the default trip.
-
[CONFERENCE] SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark
[HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
The 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. The 15th International Semantic Web Conference, <10.1007/978-3-319-46547-0_9>
sparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query eval-uators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into exe-cutable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that spar-qlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios.
-
[PRE-PRINT] A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators
[HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
Submitted. 2016
SPARQL is the standard language for querying RDF data. There exists a variety of SPARQL query evaluation systems implementing different architectures for the distribution of data and computations. Differences in architectures coupled with specific optimizations, for e.g. preprocessing and indexing, make these systems incomparable from a purely theoretical perspective. This results in many implementations solving the SPARQL query evaluation problem while exhibiting very different behaviors, not all of them being adapted in any context. We provide a new perspective on distributed SPARQL eval-uators, based on multi-criteria experimental rankings. Our suggested set of 5 features (namely velocity, immediacy, dynamic-ity, parsimony, and resiliency) provides a more comprehensive description of the behaviors of distributed evaluators when compared to traditional runtime performance metrics. We show how these features help in more accurately evaluating to which extent a given system is appropriate for a given use case. For this purpose, we systematically benchmarked a panel of 10 state-of-the-art implementations. We ranked them using a reading grid that helps in pinpointing the advantages and limitations of current technologies for the distributed evaluation of SPARQL queries.
-
[CONFERENCE] TRAINS : a Throughput-Efficient Uniform Total Order Broadcast Algorithm
[HAL, Abstract]
Michel Simatic, Arthur Foltz, Damien Graux, Nicolas Hascoet, Stéphanie Ouillon, Nathan Reboud, Tiezhen Wang
NTDS - ICPE 2015 : International Conference on Protocol Engineering (ICPE) and International Conference on New Technologies of Distributed Systems (NTDS), Jul 2015, Paris, France. IEEE, Proceedings NTDS - ICPE 2015 : International Conference on Protocol Engineering (ICPE) and International Conference on New Technologies of Distributed Systems (NTDS), pp.1 - 8, 2015, <10.1109/NOTERE.2015.7293477>
Within data centers, many applications rely on a uniform total order broadcast algorithm to achieve load-balancing or fault-tolerance. In this context, achieving high throughput for uniform total order broadcast algorithms is an important issue: It contributes to optimize data center resources usage and to reduce its energy consumption. This paper presents TRAINS , a throughput-efficient uniform total order broadcast algorithm. The paper estimates T RAINS performance. It evaluates the prediction-oriented throughput efficiency (POTE) - i.e. the theoretical ratio between bytes delivered and bytes transmitted on the network. TRAINS POTE improves the POTE of the best algorithm of the literature. For 5 processes, the POTE improvement reaches a peak of 250% for 10 bytes messages. Experimental evaluation confirms T RAINS high throughput capabilities. The trade-off of this throughput improvement is the alteration of the latency. The worst alteration is in the case of 2 processes: 125%
Software
I also contributed to:
-
SPARQLGX:
an efficient distributed evaluator of SPARQL
queries based on Apache Spark.
-
RDFHive:
a direct evaluator of SPARQL queries on top of
Apache Hive.
-
SDE:
a solution to directly evaluate SPARQL queries
using Apache Spark.
-
GTFS-Store: the first distributed store
dedicated to GTFS datasets which uses Apache
Spark to find paths.
Teaching
Between September 2014 and June 2016 i.e.
over two school years, I was a
math interrogateur oral
aka khôlleur at Lycée
Champollion (Grenoble, France). I taught more than
70 hours to respectively MP and PC
students i.e. relatively to the American
education system, second year students of
bachelor degrees in mathematics and in physics.
An excerpt of given exercises (in
French). [kholles.pdf]
Misc.
In parallel, I also try to find time to:
- Develop and update electronic projects that
have been pending for several years such
as:
-
a basic 8-bits game console created from
scratch which only uses an ATmega168.
[here]
-
a $2-less slide presenter which outputs
slideshows on a simple video wire (like the
yellow ones behind TVs) while using
only one microcontrolor: in this case
an ATmega168.
[here]
-
Finish (after all)
the coinche
book I'm writing. [The current (and unfinished)
version of the book in French.]
-
Smile and go out with friends...