Damien Graux

Since January 2016:
Postdoctoral Researcher in Tyrex Team (Inria, France).

Prior Position:
PhD. Student from November 2013 to December 2016 in Tyrex Team (Inria, France).
Advisors: Nabil Layaïda and Pierre Genevès. Funded by: the Datalyse project.
Defended on December 15, 2016. [Dissertation & Slides]. ☺

Research activities

Currently, I'm pushing further what I developed during my PhD thesis by integrating SPARQL evaluators into larger systems where various kinds of data structures are involved: several query results are needed (and aggregated) to build a complex answer. More specifically, I'm trying to design efficient languages to facilitate the development of optimized ETL pipelines in a semantic context.

During my PhD thesis, I focused on Semantic Web standards, especially on the Resource Description Framework RDF and its dedicated query language SPARQL. My main goal was to design efficient tools to evaluate SPARQL queries on very large RDF datasets (i.e. ≥100GB). Indeed, I provided a new reading grid to rank SPARQL evaluators before designing several efficient ones. More particularly, I worked on:

Distributed systems e.g. Apache Hadoop, Apache Spark...
RDF storage methods
SPARQL evaluation strategies in a distributed context
RDF/SPARQL Benchmarks

As a past time during my PhD main researches, I also designed a semantic pipeline for trip planning aggregating heterogeneous datasets (e.g. GTFS, RDF, CSV) in order to provide users touristic alternatives at plane stopovers.

Previously (before 2013), I worked on designing and implementing broadcast algorithms with special properties such as UTO (uniform and totally ordered). This work, mainly developed in C, is also openly available from github. [here].

Publications

(List generated from hal.inria.fr.)

[PRE-PRINT] SPARUB: SPARQL UPDATE Benchmark [HAL, PDF, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
2017

One aim of the RDF data model, as standardized by the W3C, is to facilitate the evolution of data over time without requiring all the data consumers to be changed. To this end, one of the latest addition to the SPARQL standard query language is an update language for RDF graphs. The research on efficient and scalable SPARQL evaluation methods increasingly relies on standardized methodologies for benchmarking and comparing systems. However, current RDF benchmarks do not support graphs updates. We propose and share SPARUB: a benchmark for the SPARQL UPDATE language on RDF graphs. The aim of SPARUB is not to be yet another RDF benchmark. Instead it provides the mean to automatically extend and improve existing RDF benchmarks along a new dimension of data updates, while preserving their structure and query scenarios.
[PRE-PRINT] HAP: Building Pipelines with Heterogeneous Data and Hive [HAL, PDF, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
2017

The increasing number of available datasets gives opportunities to build large and complex applications which aggregate results coming from several sources. These emerging usecases require new systems where combinations of heterogeneous sources are both allowed and efficient.To tackle these challenges, we provide a simple high-level set of primitives – called HAP – to easily describe processing chains. These descriptions are then compiled into optimized SQL queries executed by Hive.
[PHD] On the Efficient Distributed Evaluation of SPARQL Queries [HAL, Abstract]
Damien Graux
Web. Université Grenoble Alpes, 2016. English

The Semantic Web standardized by the World Wide Web Consortium aims at providing a common framework that allows data to be shared and analyzed across applications. The Resource Description Framework (RDF) and the query language SPARQL constitute two major components of this vision.Because of the increasing amounts of RDF data available, dataset distribution across clusters is poised to become a standard storage method. As a consequence, efficient and distributed sparql evaluators are needed.To tackle these needs, we first benchmark several state-of-the-art distributed SPARQL evaluators while monitoring a set of metrics which is appropriate in a distributed context (e.g. network traffic). Then, an analysis driven by typical use cases leads us to define new development perspectives in the field of distributed sparql evaluation. On the basis of these perspectives, we design several efficient distributed SPARQL evaluators whose performances are validated and compared to state- of-the-art evaluators. For instance, our distributed SPARQL evaluator named SPARQLGX offers efficient time performances while being resilient to the loss of nodes.
[CONFERENCE] SPARQLGX : Une Solution Distribuée pour RDF Traduisant SPARQL vers Spark [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
BDA 2016 - 32ème Conférence sur la Gestion de Données - Principes, Technologies et Applications, Nov 2016, Poitiers, France. BDA2016

SPARQL est un langage de requête standardisé par le W3C permettant d'interroger des données exprimées au format RDF (Resource Description Framework). Avec l'augmentation des volumes de données RDF disponibles, de nombreux efforts de recherche ont été faits pour permettre l'évaluation distribuée et efficace de requêtes SPARQL. Dans ce contexte, nous proposons et partageons SPARQLGX : notre solution de stockage RDF distribuée utilisant Apache Spark pour évaluer des requêtes SPARQL et stockant les données via des infrastructures Hadoop (HDFS). SPARQLGX repose sur un traducteur de requêtes SPARQL vers une séquence d'instructions exécutables par Spark en adoptant des stratégies d'évaluation selon (1) le schéma de stockage des données utilisé et (2) des statistiques sur les données. Nous montrons que SPARQLGX permet l'évaluation de requêtes SPARQL sur plusieurs milliards de triplets RDF répartis sur plusieurs nœuds. Nous comparons aussi SPARQLGX à d'autres solutions issues de l'état-de-l'art. Nous démontrons ainsi les performances obtenues en permettant aux participants de reproduire par eux-mêmes les résultats présentés grâce à différents scénarios mettant directement en compétition plusieurs solutions de l'état-de-l'art. Nous montrons dans ce travail que tout en ayant une architecture relativement simple, SPARQLGX représente une alternative intéressante dans de nombreux cas d'utilisation.
[CONFERENCE] SPARQLGX in Action: Efficient Distributed Evaluation of SPARQL with Apache Spark [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
15th International Semantic Web Conference (ISWC 2016 demo paper), Oct 2016, Kobe, Japan. 15th International Semantic Web Conference

We demonstrate SPARQLGX: our implementation of a distributed sparql evaluator. We show that sparqlgx makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures.
[CONFERENCE] Smart Trip Alternatives for the Curious [HAL, Abstract]
Damien Graux, Pierre Genevès, Nabil Layaïda
15th International Semantic Web Conference (ISWC 2016 demo paper), Oct 2016, Kobe, Japan. 15th International Semantic Web Conference

When searching for flights, current systems often suggest routesinvolving waiting times at stopovers. There might exist alternative routes which aremore attractive from a touristic perspective because their duration isnot necessarily much longer while offering enough time in anappropriate place. Choosing among suchalternatives requires additional planning efforts to make sure thate.g. points of interest can conveniently be reached in theallowed time frame. We present a system that automatically computes smart tripalternatives between any two cities. To do so, it searchespoints of interest in large semantic datasets considering theset of accessible areas around each possible layover. It then elects feasible alternatives and displays theirdifferences with respect to the default trip.
[CONFERENCE] SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
The 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. The 15th International Semantic Web Conference, <10.1007/978-3-319-46547-0_9>

sparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query eval-uators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into exe-cutable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that spar-qlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios.
[PRE-PRINT] A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators [HAL, PDF, Abstract]
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda
Submitted. 2016

SPARQL is the standard language for querying RDF data. There exists a variety of SPARQL query evaluation systems implementing different architectures for the distribution of data and computations. Differences in architectures coupled with specific optimizations, for e.g. preprocessing and indexing, make these systems incomparable from a purely theoretical perspective. This results in many implementations solving the SPARQL query evaluation problem while exhibiting very different behaviors, not all of them being adapted in any context. We provide a new perspective on distributed SPARQL eval-uators, based on multi-criteria experimental rankings. Our suggested set of 5 features (namely velocity, immediacy, dynamic-ity, parsimony, and resiliency) provides a more comprehensive description of the behaviors of distributed evaluators when compared to traditional runtime performance metrics. We show how these features help in more accurately evaluating to which extent a given system is appropriate for a given use case. For this purpose, we systematically benchmarked a panel of 10 state-of-the-art implementations. We ranked them using a reading grid that helps in pinpointing the advantages and limitations of current technologies for the distributed evaluation of SPARQL queries.
[CONFERENCE] TRAINS : a Throughput-Efficient Uniform Total Order Broadcast Algorithm [HAL, Abstract]
Michel Simatic, Arthur Foltz, Damien Graux, Nicolas Hascoet, Stéphanie Ouillon, Nathan Reboud, Tiezhen Wang
NTDS - ICPE 2015 : International Conference on Protocol Engineering (ICPE) and International Conference on New Technologies of Distributed Systems (NTDS), Jul 2015, Paris, France. IEEE, Proceedings NTDS - ICPE 2015 : International Conference on Protocol Engineering (ICPE) and International Conference on New Technologies of Distributed Systems (NTDS), pp.1 - 8, 2015, <10.1109/NOTERE.2015.7293477>

Within data centers, many applications rely on a uniform total order broadcast algorithm to achieve load-balancing or fault-tolerance. In this context, achieving high throughput for uniform total order broadcast algorithms is an important issue: It contributes to optimize data center resources usage and to reduce its energy consumption. This paper presents TRAINS , a throughput-efficient uniform total order broadcast algorithm. The paper estimates T RAINS performance. It evaluates the prediction-oriented throughput efficiency (POTE) - i.e. the theoretical ratio between bytes delivered and bytes transmitted on the network. TRAINS POTE improves the POTE of the best algorithm of the literature. For 5 processes, the POTE improvement reaches a peak of 250% for 10 bytes messages. Experimental evaluation confirms T RAINS high throughput capabilities. The trade-off of this throughput improvement is the alteration of the latency. The worst alteration is in the case of 2 processes: 125%

Software

I also contributed to:

SPARQLGX: an efficient distributed evaluator of SPARQL queries based on Apache Spark.
RDFHive: a direct evaluator of SPARQL queries on top of Apache Hive.
SDE: a solution to directly evaluate SPARQL queries using Apache Spark.
GTFS-Store: the first distributed store dedicated to GTFS datasets which uses Apache Spark to find paths.

Teaching

Between September 2014 and June 2016 i.e. over two school years, I was a math interrogateur oral aka khôlleur at Lycée Champollion (Grenoble, France). I taught more than 70 hours to respectively MP and PC students i.e. relatively to the American education system, second year students of bachelor degrees in mathematics and in physics.

An excerpt of given exercises (in French). [kholles.pdf]

Misc.

In parallel, I also try to find time to:

Develop and update electronic projects that have been pending for several years such as:
- a basic 8-bits game console created from scratch which only uses an ATmega168. [here]
- a $2-less slide presenter which outputs slideshows on a simple video wire (like the yellow ones behind TVs) while using only one microcontrolor: in this case an ATmega168. [here]
Finish (after all) the coinche book I'm writing. [The current (and unfinished) version of the book in French.]
Smile and go out with friends...

Contact

Mail:
Phone: +33 4 76 61 52 45
Address: Desk B206 - 655 avenue de l'Europe - Montbonnot, 38334 Saint Ismier Cedex
Inria Grenoble - Rhône-Alpes, Inovallée (45°13'4.43"N, 5°48'26.77"E)

Damien Graux

Research activities

Publications

[PRE-PRINT] SPARUB: SPARQL UPDATE Benchmark [HAL, PDF, Abstract]

[PRE-PRINT] HAP: Building Pipelines with Heterogeneous Data and Hive [HAL, PDF, Abstract]

[PHD] On the Efficient Distributed Evaluation of SPARQL Queries [HAL, Abstract]

[CONFERENCE] SPARQLGX : Une Solution Distribuée pour RDF Traduisant SPARQL vers Spark [HAL, PDF, Abstract]

[CONFERENCE] SPARQLGX in Action: Efficient Distributed Evaluation of SPARQL with Apache Spark [HAL, PDF, Abstract]

[CONFERENCE] Smart Trip Alternatives for the Curious [HAL, Abstract]

[CONFERENCE] SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark [HAL, PDF, Abstract]

[PRE-PRINT] A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators [HAL, PDF, Abstract]

[CONFERENCE] TRAINS : a Throughput-Efficient Uniform Total Order Broadcast Algorithm [HAL, Abstract]

Software

Teaching

Misc.

Contact