MaSM: Efficient Online Updates in Data Warehouses Manos - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs

Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only 2

Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only Is zero overhead possible? 3

Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J 4

Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J Can we have the cake and eat it too? 5

Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 6

Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes Let’s think again! – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 7

MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join ü Intuitively does both 8

MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join MaSM merges data with updates using sort- ü Intuitively does both merge join and materializing sorted runs 9

Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 10

MaSM in detail Updates Main memory M= !!" M pages SSD e.g. GBs Disks (main data) e.g. TBs 11

MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs 12

MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs Merge pages from HDD, SSD and RAM with negligible overhead! 13

Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs 14

Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs Trade-off extra writes for memory size. 15

Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 16

Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 10x smaller memory for 2x more writes! 17

Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 18

Experimental setup • Dell Precision 690 Workstation – Intel Xeon Quad (2.33MHz, 8MB L2), 4GB DRAM, Ubuntu Linux, kernel 2.6.24 • Dedicated SATA disk for main data – 7200rpm Seagate Barracuda, 77MB/s sequential bandwidth • Intel X25-E SSD for caching updates – 250 MB/s sequential read, 170MB/s sequential write bandwidth; 35,000 4KB-sized random reads/second • Prototype row store: – Implemented in-place updates, indexed updates, MaSM 19

Query performance on synthetic data 100GB main data, 4GB flash for cached updates, 16MB memory normalized time 4 3 2 1 0 4KB 100KB 1MB 10MB 100MB 1GB 10GB 100GB range size in-place updates MaSM w/ coarse-grain index MaSM w/ fine-grain index • MaSM has negligible impact on 10MB or larger scans • MaSM with fine-grain index incurs 4% overhead for 4KB ranges (modeling point queries) 20

TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates 21

TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates Queries with MaSM see less than 1% overhead! 22

Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD 23

Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 Efficient usage of a few GB of flash can 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD increase update rate up to 258x! 24

To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) • Supports a high update rate (up to 12k) • Tunable memory footprint vs SSD wear • Low migration cost (one-time 2.2x) • SSD-friendly behavior – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 1 0.5 0 Query only Query w/ Query only + Ideal 25 MaSM updates Updates only

To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) Update Approach Freshness Performance ↓ mem overhead • Supports a high update rate (up to 12k) Batched X J J • Tunable memory footprint vs SSD wear In place J X J • Low migration cost (one-time 2.2x) In-memory differential X J J • SSD-friendly behavior MaSM and SSD J J J – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 Thank you! 1 0.5 0 Query only Query w/ Query only + Ideal 26 MaSM updates Updates only

MaSM: Efficient Online Updates in Data Warehouses Manos - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs Freshness vs Performance Data warehouse workload Read-only queries (scans)

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

BUSINESS ANALYTICS CHAPTER 29 LECTURE OUTLINE Data warehouses Comparison with

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Data Mining for Knowledge Management Data Warehouses Themis Palpanas University of Trento

Carnegie Mellon Univ. Problem Dept. of Computer Science Getting the data: Data Warehouses,

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

Developing Data Warehouses with Quality in Mind Yannis Vassiliou National Technical University

XCube XML for Data Warehouses Wolfgang Hmmer Andreas Bauer Gunnar Harde DOLAP 2003 New

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake Apache Hive Motiv tivation

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet,

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI & Selma KHOURI

in data warehouses Manfred Jeusfeld University of Skvde, Sweden 1 (c) 2015 M. Jeusfeld,

Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling Balls

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal , A. El Abbadi , A.

Predicting Performance and Cost of Serverless Computing Functions with SAAF Robert Cordingly, Wen

Virtualization-based Bandwidth Management for Parallel Storage Systems Yiqi Xu , Lixi Wang,

Adaptable IO System (ADIOS) http://www.cc.gatech.edu/~lofstead/adios Cray User Group 2008 May 8,

Using R for time series analysis and spatial-temporal distribution of global burnt surface

Puzzles in B Decays Alakabha Datta University of Mississippi April 21, 2017 WIN 2017, Irvine

Storage and Data Challenges for Production Nisha Talagala CEO, Pyxeda AI Machine Learning

Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba

Simple Data Storage: SQLite Mahdi Roozbahani Lecturer, Computational Science and Engineering,

MaSM: Efficient Online Updates in Data Warehouses Manos - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs Freshness vs Performance Data warehouse workload Read-only queries (scans)

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

BUSINESS ANALYTICS CHAPTER 29 LECTURE OUTLINE Data warehouses Comparison with

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Data Mining for Knowledge Management Data Warehouses Themis Palpanas University of Trento

Carnegie Mellon Univ. Problem Dept. of Computer Science Getting the data: Data Warehouses,

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

Developing Data Warehouses with Quality in Mind Yannis Vassiliou National Technical University

XCube XML for Data Warehouses Wolfgang Hmmer Andreas Bauer Gunnar Harde DOLAP 2003 New

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake Apache Hive Motiv tivation

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet,

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI &amp; Selma KHOURI

in data warehouses Manfred Jeusfeld University of Skvde, Sweden 1 (c) 2015 M. Jeusfeld,

Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling Balls

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal , A. El Abbadi , A.

Predicting Performance and Cost of Serverless Computing Functions with SAAF Robert Cordingly, Wen

Virtualization-based Bandwidth Management for Parallel Storage Systems Yiqi Xu , Lixi Wang,

Adaptable IO System (ADIOS) http://www.cc.gatech.edu/~lofstead/adios Cray User Group 2008 May 8,

Using R for time series analysis and spatial-temporal distribution of global burnt surface

Puzzles in B Decays Alakabha Datta University of Mississippi April 21, 2017 WIN 2017, Irvine

Storage and Data Challenges for Production Nisha Talagala CEO, Pyxeda AI Machine Learning

Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba

Simple Data Storage: SQLite Mahdi Roozbahani Lecturer, Computational Science and Engineering,

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI & Selma KHOURI