Data Inges*on for the Connected World John Meehan, Cansu Aslantas, - - PowerPoint PPT Presentation

data inges on for the connected world
SMART_READER_LITE
LIVE PREVIEW

Data Inges*on for the Connected World John Meehan, Cansu Aslantas, - - PowerPoint PPT Presentation

Data Inges*on for the Connected World John Meehan, Cansu Aslantas, Stan Zdonik (Brown University) Nesime Tatbul (Intel Labs & MIT) Jiang Du (University of Toronto) The IoT Era Tradi*onal Data Inges*on (ETL) E XTRACT T RANSFORM L OAD


slide-1
SLIDE 1

Data Inges*on for the Connected World

John Meehan, Cansu Aslantas, Stan Zdonik (Brown University) Nesime Tatbul (Intel Labs & MIT) Jiang Du (University of Toronto)

slide-2
SLIDE 2

The IoT Era

slide-3
SLIDE 3

Tradi*onal Data Inges*on (ETL)

DATA WAREHOUSE FLAT FILES STAGING OLAP/STORAGE DATA SOURCES

EXTRACT LOAD

INTERMEDIATE RESULTS DATA CLEANING

TRANSFORM

DATA NORMALIZATION INTERMEDIATE RESULTS

3

slide-4
SLIDE 4

An Example: TPC-DI

4

hWp://www.tpc.org/tpcdi/ Poess et al, VLDB 2014

  • Brokerage firm
  • 6 heterogeneous sources
  • 3 key parts:
  • 1. Ingest raw data
  • 2. ETL transform
  • 3. Update warehouse
slide-5
SLIDE 5

An Example: TPC-DI

5

hWp://www.tpc.org/tpcdi/ Poess et al, VLDB 2014.

  • Brokerage firm
  • 6 heterogeneous sources
  • 3 key parts:
  • 1. Ingest raw data
  • 2. ETL transform
  • 3. Update warehouse

ü Data collected into flat files ü Heterogeneous data types ü Incremental update from an OLTP source, once a day

slide-6
SLIDE 6

An Example: TPC-DI

6

hWp://www.tpc.org/tpcdi/ Poess et al, VLDB 2014.

  • Brokerage firm
  • 6 heterogeneous sources
  • 3 key parts:
  • 1. Ingest raw data
  • 2. ETL transform
  • 3. Update warehouse

ü Storage for intermediate results ü Transac*onal state management

slide-7
SLIDE 7

An Example: TPC-DI

7

hWp://www.tpc.org/tpcdi/ Poess et al, VLDB 2014.

  • Brokerage firm
  • 6 heterogeneous sources
  • 3 key parts:
  • 1. Ingest raw data
  • 2. ETL transform
  • 3. Update warehouse

ü Bulk loading

slide-8
SLIDE 8

Streaming Data Inges*on

  • In modern apps such as IoT:

– real-*me streams of data from a large number of sources – majority of these sources report in the form of *me-series – data currency & low latency is key for real-*me decision making & control

ü Need a stream-based inges*on architecture ü Must pay aWen*on to *me-series data type and

  • pera*ons (both during inges*on & analy*cs)

8

slide-9
SLIDE 9

An Architecture for Streaming Data Inges*on

9

slide-10
SLIDE 10

OLAP

DISK STORAGE S-STORE

SP1 SP2 SP3

MAIN-MEMORY STORAGE

STREAMING ETL

POSTGRES

DATA COLLECTOR

BIGDAWG KAFKA

DATA MIGRATOR

DATA SOURCES

Implementa*on

10

slide-11
SLIDE 11

OLAP

DISK STORAGE S-STORE

SP1 SP2 SP3

MAIN-MEMORY STORAGE

STREAMING ETL

POSTGRES

DATA COLLECTOR

BIGDAWG KAFKA

DATA MIGRATOR

DATA SOURCES

Implementa*on

11

slide-12
SLIDE 12
  • A hybrid system for transac*on & stream processing

– combines main-memory OLTP with streaming constructs (windowing, triggers, dataflow graphs)

  • Transac*ons as user-defined stored procedures (Java + SQL)
  • Three complementary correctness guarantees

– ACID, for individual transac*ons – Ordered execu8on, for streams and dataflow graphs – Exactly-once processing, for streams (no loss or duplicates due to failures/recovery)

12

  • Store : Shared Mutable State in Streaming
slide-13
SLIDE 13

Example: A TPC-DI Dataflow Graph in S-Store

DA DATE, TE, TIME, TIME, ST STATUS, TUS, TYPE TYPE SECURITY LOOKUP ACCOUNT LOOKUP UPDATE TRADE DATA (STAGING) Date Time

Status Type

DimSecurity DimAccount DimTrade

13

slide-14
SLIDE 14

DA DATE, TE, TIME, TIME, ST STATUS, TUS, TYPE TYPE SECURITY LOOKUP ACCOUNT LOOKUP UPDATE TRADE DATA (STAGING) Date Time

Status Type

DimSecurity DimAccount DimTrade TE1 TE2

Transaction Execution (TE) = An instance of a stored procedure executing on an input batch

14

Example: A TPC-DI Dataflow Graph in S-Store

slide-15
SLIDE 15

DA DATE, TE, TIME, TIME, ST STATUS, TUS, TYPE TYPE SECURITY LOOKUP ACCOUNT LOOKUP UPDATE TRADE DATA (STAGING) Date Time

Status Type

DimSecurity DimAccount DimTrade TE1 TE2

Shared state read or written by TEs

15

Example: A TPC-DI Dataflow Graph in S-Store

slide-16
SLIDE 16

OLAP

DISK STORAGE S-STORE

SP1 SP2 SP3

MAIN-MEMORY STORAGE

STREAMING ETL

POSTGRES

DATA COLLECTOR

BIGDAWG KAFKA

DATA MIGRATOR

DATA SOURCES

Implementa*on

16

slide-17
SLIDE 17

Data Migrator

  • Provides durable migra*on into the data warehouse

using an ack mechanism that simulates 2PC

  • Leverages the BigDAWG polystore middleware (see

Session 4)

– can support a variety of des*na*on warehouses – can par*cipate in federated querying

  • Supports both “push” and “pull” modes

17

slide-18
SLIDE 18

TPC-DI Experiment: Push vs. Pull Tradeoffs

  • How omen to migrate? Push or pull?
  • Impacts:

– Maximum ingest latency in S-Store – Query execu*on *me in Postgres – Staleness of the query results in Postgres

  • Result summary: Push in small batches, every 1-5
  • seconds. Fine-grained inges*on performs well.

18

slide-19
SLIDE 19

Ongoing Work

  • Time-series data management (inges*on & beyond)

– New inges*on challenges and opportuni*es (e.g., synchroniza*on/alignment of *me-series, using predic*ve techniques for dealing with missing/delayed values) – Append-based updates, window-based reads – Need to support complex analy*cs opera*ons (forecas*ng/ predic*on, paWern matching, anomaly detec*on, signal processing) – Exploit the resources on edge devices

19