data inges on for the connected world
play

Data Inges*on for the Connected World John Meehan, Cansu Aslantas, - PowerPoint PPT Presentation

Data Inges*on for the Connected World John Meehan, Cansu Aslantas, Stan Zdonik (Brown University) Nesime Tatbul (Intel Labs & MIT) Jiang Du (University of Toronto) The IoT Era Tradi*onal Data Inges*on (ETL) E XTRACT T RANSFORM L OAD


  1. Data Inges*on for the Connected World John Meehan, Cansu Aslantas, Stan Zdonik (Brown University) Nesime Tatbul (Intel Labs & MIT) Jiang Du (University of Toronto)

  2. The IoT Era

  3. Tradi*onal Data Inges*on (ETL) E XTRACT T RANSFORM L OAD DATA CLEANING OLAP/STORAGE STAGING DATA SOURCES INTERMEDIATE RESULTS DATA DATA FLAT NORMALIZATION WAREHOUSE FILES INTERMEDIATE RESULTS 3

  4. An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse hWp://www.tpc.org/tpcdi/ 4 Poess et al, VLDB 2014

  5. An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform ü Data collected into flat files ü Heterogeneous data types 3. Update warehouse ü Incremental update from an OLTP source, once a day hWp://www.tpc.org/tpcdi/ 5 Poess et al, VLDB 2014.

  6. An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse ü Storage for intermediate results ü Transac*onal state management hWp://www.tpc.org/tpcdi/ 6 Poess et al, VLDB 2014.

  7. An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse ü Bulk loading hWp://www.tpc.org/tpcdi/ 7 Poess et al, VLDB 2014.

  8. Streaming Data Inges*on • In modern apps such as IoT: – real-*me streams of data from a large number of sources – majority of these sources report in the form of *me-series – data currency & low latency is key for real-*me decision making & control ü Need a stream-based inges*on architecture ü Must pay aWen*on to *me-series data type and opera*ons (both during inges*on & analy*cs) 8

  9. An Architecture for Streaming Data Inges*on 9

  10. Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 10

  11. Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 11

  12. -Store : Shared Mutable State in Streaming • A hybrid system for transac*on & stream processing – combines main-memory OLTP with streaming constructs (windowing, triggers, dataflow graphs) • Transac*ons as user-defined stored procedures (Java + SQL) • Three complementary correctness guarantees – ACID , for individual transac*ons – Ordered execu8on , for streams and dataflow graphs – Exactly-once processing , for streams (no loss or duplicates due to failures/recovery) 12

  13. Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP (STAGING) TYPE TYPE Date Time DimSecurity DimAccount DimTrade Status Type 13

  14. Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP Transaction Execution (TE) = (STAGING) TYPE TYPE An instance of a stored procedure executing on an input batch TE1 TE2 Date Time DimAccount DimTrade DimSecurity Status Type 14

  15. Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP Shared state (STAGING) TYPE TYPE read or written by TEs TE1 TE2 Date Time DimAccount DimTrade DimSecurity Status Type 15

  16. Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 16

  17. Data Migrator • Provides durable migra*on into the data warehouse using an ack mechanism that simulates 2PC • Leverages the BigDAWG polystore middleware ( see Session 4 ) – can support a variety of des*na*on warehouses – can par*cipate in federated querying • Supports both “push” and “pull” modes 17

  18. TPC-DI Experiment: Push vs. Pull Tradeoffs • How omen to migrate? Push or pull? • Impacts: – Maximum ingest latency in S-Store – Query execu*on *me in Postgres – Staleness of the query results in Postgres • Result summary: Push in small batches, every 1-5 seconds. Fine-grained inges*on performs well. 18

  19. Ongoing Work • Time-series data management (inges*on & beyond) – New inges*on challenges and opportuni*es (e.g., synchroniza*on/alignment of *me-series, using predic*ve techniques for dealing with missing/delayed values) – Append-based updates, window-based reads – Need to support complex analy*cs opera*ons (forecas*ng/ predic*on, paWern matching, anomaly detec*on, signal processing) – Exploit the resources on edge devices 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend