Tyrex | Project

Datalyse RDF Pipeline

Datalyse RDF Pipeline

Datalyse RDF Pipeline is a Hadoop and Spark-based application written in Java for the extraction, cleaning and publication of RDF triples from raw data stored in Excel/CSV files. The application is composed of a workflow of modules (tasks) coordinated using Apache Hue/Oozie. The tasks are: (1) CSV/XLS importers, which import data in CSV format into RDF triples. The process is distributed to handle very large files. This importation task produces raw triples that can be used as input by the others tasks. (2) RDF Importers, which import data into triple stores. (3) OSM enrichment, which retrieves geographical data from OpenStreetMap to enrich the source triples dataset. (4) A statistical normalization task. (5) an Alignment task with a target ontology. (6) a Silk task, which establishes links between two datasets. It uses the Silk LSL language (derived from XML) to establish equivalence relations and generate links between two datasets.