Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University
IEEE E-science 2016 : Hot Topics
Crossing Analytics Systems: Case for Integrated Provenance in Data - - PowerPoint PPT Presentation
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University IEEE E-science 2016 : Hot Topics The Data Lake has arisen within last couple of
IEEE E-science 2016 : Hot Topics
Credit: https://www.linkedin.com/topic/data-warehouse-architecture
Ingest API Data Data Lake Clickstream Sensor data IoT Devices Social Media Could Platforms Server Logs Metadata Lineage Transform Transform Transform Data Data Data
Analysis
Big Data Processing Frameworks Ex: Hadoop, Spark, Storm
d1
T1
d2 d3 d4 d5 d6 d7 d8
T2 T3
d1 d3 d4 d6 d7 d8
Chain of transformations sharing Ids Backward provenance from central provenance store
Ingest API Batch Processing Ex: Hadoop, Spark Lineage Raw Data from various sources Transformations Workflow Engines Ex: Kepler Legacy Scripts Stream Processing Ex: Storm, Spark Monitoring Debugging Reproducing Data Quality Queries Visualization Data Data Data Data Import Lineage Data Export Data Lake Messaging System Ingest API Query API Provenance Subsystem Prov Stream Processing Prov Storage Prov Stream
Application Layer API Komadu Client Layer RabbitMQ Client Layer client.addGeneration(A, E)
batching prov thread pool
RabbitMQ Server Komadu
Client Library
IEEE E-science 2016 : Hot Topics