Crossing Analytics Systems: Case for Integrated Provenance in Data - - PowerPoint PPT Presentation

▶

Nov 30, 2023 453 likes •729 views

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University IEEE E-science 2016 : Hot Topics The Data Lake has arisen within last couple of

SLIDE 1

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University

IEEE E-science 2016 : Hot Topics

SLIDE 2

The Data Lake has arisen within last couple of years as conceptualization of data management framework with flexibility to support multiple data processing tools needed for truly Big Data analytics.

SLIDE 3

Data Warehouse

Supports multidimensional analytical processing

– Online Analytical Processing (OLAP) or Multidimensional OLAP

Numeric facts (measures) categorized by

dimensions creating vector space (OLAP cube).

Interface is matrix interface like Pivot tables
Schema is star schema, snowflake schema
Storage is largely relational database

SLIDE 4

Credit: https://www.linkedin.com/topic/data-warehouse-architecture

Data Warehouse Architecture

ETL: Extraction, Transformation, Load

SLIDE 5

Challenging the Warehouse: Big Data

From numerous sources

– social media, sensor data, IoT devices, server logs, clickstream etc.

Not all numeric (quantitative) thus differently

structured

– Structured, semi-structured, unstructured

Continuously generated or archived

SLIDE 6

Suitability of Data Warehouse for Today’s Big Data

ETL imposes burden

– Schema on write – Inflexibility/inefficiency at ingest time – Information loss upon schema translation

Weak fit for popular Big Data analytical tools

(e.g., Spark, Hadoop) and data serving platforms (e.g., HDFS, S3)

SLIDE 7

Data Lake

A scalable storage infrastructure with no schema

enforcement at ingest

Data ingested in raw form: no loss
Schema-on-read
Integrated Transformations

– With e.g., Hadoop, Spark

Ingest API Data Data Lake Clickstream Sensor data IoT Devices Social Media Could Platforms Server Logs Metadata Lineage Transform Transform Transform Data Data Data

Analysis

Big Data Processing Frameworks Ex: Hadoop, Spark, Storm

SLIDE 8

Data Lake Challenges

Increased flexibility leads to harder

manageability

– Differently typed data can be easily dumped into the Data Lake – Data products can be in different stages of their lifecycle: raw, half processed, processed etc. – Can easily turn into “data swamps”

Requires traceability!!..

– Provenance can help

SLIDE 9

Data Provenance

Information about activities, entities and people

who involved in producing a data product

Standards

– OPM – PROV

If a Data Lake ensures that every data product’s

provenance is in place starting from data product’s origin, critical traceability can be had

SLIDE 10

What provenance perspective could bring to a Data Lake?

Track origins of data, chained transformations
Contribute to reuse determinations of trust

and quality

React!! Minimally constrain what enters a

Lake?

SLIDE 11

Challenges in Provenance Capturing

Chains of Transformations

– Different analytics systems: Hadoop, Spark etc.

Need is end to end integrated provenance across

transformations

System specific provenance

collection methods are less useful – Integration/stitching problems – E.g.: RAMP, HadoopProv for Hadoop

SLIDE 12

Solution to minimal lake governance

All components in lake stream provenance to

central provenance subsystem

– Stores provenance for long term queries – Monitors provenance stream in real time

Event in stream represented by edge in

provenance graph

Global lake wide policy: Uniform Persistent ID

(PID) (Handle, UUIDs, DOIs) attached to all data objects in Data Lake

– required to guarantee integrated provenance

SLIDE 13

Model

PID assigned to all data objects

– granularity

Transformations T1, T2, and T3

– Distributed – May use different frameworks

d2 d3 d4 d5 d6 d7 d8

T2 T3

d1 d3 d4 d6 d7 d8

Chain of transformations sharing Ids Backward provenance from central provenance store

SLIDE 14

Provenance traces integrate across systems of Data Lake

SLIDE 15

Reference Architecture

Ingest API Batch Processing Ex: Hadoop, Spark Lineage Raw Data from various sources Transformations Workflow Engines Ex: Kepler Legacy Scripts Stream Processing Ex: Storm, Spark Monitoring Debugging Reproducing Data Quality Queries Visualization Data Data Data Data Import Lineage Data Export Data Lake Messaging System Ingest API Query API Provenance Subsystem Prov Stream Processing Prov Storage Prov Stream

Real-time provenance stream processing
Stored provenance for long term usage

SLIDE 16

Prototype Use Case

Different frameworks used

– Flume: Captures tweets and write into HDFS – Hadoop Job: Computes hashtag counts – Spark Job: Computes category counts

SLIDE 17

Central provenance store

Uses Komadu

– A distributed provenance collection tool – Visualization, Custom Queries

I. Suriarachchi, Q. Zhou and B. Plale (2015). Komadu: A Capture and Visualization System for Scientific Data
Provenance. Journal of Open Research Software 3(1):e4

SLIDE 18

Client Library

Log4j like API for provenance capture
Dedicated thread pool in provenance layer
Batching to minimize network overhead

Application Layer API Komadu Client Layer RabbitMQ Client Layer client.addGeneration(A, E)

batching prov thread pool

RabbitMQ Server Komadu

Client Library

SLIDE 19

Use case evaluation

Flume, Hadoop and Spark jobs instrumented

using Komadu client libraries

Jobs stream provenance events into central

provenance store (Komadu)

Persistent IDs (UUID) assigned for each data
bject at entry to data lake; PID persists

thereafter with data object

SLIDE 20

Use case evaluation: experimental environment

5 small VM instances, 2 2.5GhZ cores, 4 GB

RAM, 50 GB local storage

4 VM instances used for HDFS cluster
3.23 GB Twitter data collected over 5 days

running Flume on master node

Hadoop and Spark set up on top of HDFS

cluster

Separate instance for RabbitMQ and Komadu

SLIDE 21

Use case evaluation: Metrics

Batch size:

– impact of batch size on provenance capture efficiency. Measured by total execution time for Hadoop using provenance event batching mechanism in Komadu library

Overhead of provenance capture:

– Measured against total tool-specific execution time – measure overhead of customized value field (in key value pair) – Measure overhead of provenance capture for Hadoop and Spark

SLIDE 22

Batch Size Test

Hadoop job execution times with varying

batch sizes

Optimal batch size: ~5000 KB

SLIDE 23

Overhead: Hadoop

custom val: emits PID with key value pair

as (#nba, <2, id>) instead of (#nba, 2)

data prov HDFS: writes provenance into HDFS,

used by HadoopProv and RAMP

SLIDE 24

Overhead: Spark

Higher provenance capture overhead

compared to Hadoop

SLIDE 25

Future Work

Performance overhead is prohibitively high

– decouple PID assignment from execution? Examine granularity

Live provenance stream processing for real

time monitoring/reaction

Explore minimal provenance at on-line rates

and more comprehensive provenance at off- line rates

SLIDE 26

Work funded in part by National Science Foundation OCI-0940824

IEEE E-science 2016 : Hot Topics