masm efficient online updates in data warehouses
play

MaSM: Efficient Online Updates in Data Warehouses Manos - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs Freshness vs Performance Data warehouse workload Read-only queries (scans)


  1. MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs

  2. Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only 2

  3. Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only Is zero overhead possible? 3

  4. Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J 4

  5. Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J Can we have the cake and eat it too? 5

  6. Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 6

  7. Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes Let’s think again! – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 7

  8. MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join ü Intuitively does both 8

  9. MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join MaSM merges data with updates using sort- ü Intuitively does both merge join and materializing sorted runs 9

  10. Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 10

  11. MaSM in detail Updates Main memory M= !!" M pages SSD e.g. GBs Disks (main data) e.g. TBs 11

  12. MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs 12

  13. MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs Merge pages from HDD, SSD and RAM with negligible overhead! 13

  14. Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs 14

  15. Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs Trade-off extra writes for memory size. 15

  16. Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 16

  17. Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 10x smaller memory for 2x more writes! 17

  18. Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 18

  19. Experimental setup • Dell Precision 690 Workstation – Intel Xeon Quad (2.33MHz, 8MB L2), 4GB DRAM, Ubuntu Linux, kernel 2.6.24 • Dedicated SATA disk for main data – 7200rpm Seagate Barracuda, 77MB/s sequential bandwidth • Intel X25-E SSD for caching updates – 250 MB/s sequential read, 170MB/s sequential write bandwidth; 35,000 4KB-sized random reads/second • Prototype row store: – Implemented in-place updates, indexed updates, MaSM 19

  20. Query performance on synthetic data 100GB main data, 4GB flash for cached updates, 16MB memory normalized time 4 3 2 1 0 4KB 100KB 1MB 10MB 100MB 1GB 10GB 100GB range size in-place updates MaSM w/ coarse-grain index MaSM w/ fine-grain index • MaSM has negligible impact on 10MB or larger scans • MaSM with fine-grain index incurs 4% overhead for 4KB ranges (modeling point queries) 20

  21. TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates 21

  22. TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates Queries with MaSM see less than 1% overhead! 22

  23. Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD 23

  24. Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 Efficient usage of a few GB of flash can 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD increase update rate up to 258x! 24

  25. To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) • Supports a high update rate (up to 12k) • Tunable memory footprint vs SSD wear • Low migration cost (one-time 2.2x) • SSD-friendly behavior – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 1 0.5 0 Query only Query w/ Query only + Ideal 25 MaSM updates Updates only

  26. To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) Update Approach Freshness Performance ↓ mem overhead • Supports a high update rate (up to 12k) Batched X J J • Tunable memory footprint vs SSD wear In place J X J • Low migration cost (one-time 2.2x) In-memory differential X J J • SSD-friendly behavior MaSM and SSD J J J – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 Thank you! 1 0.5 0 Query only Query w/ Query only + Ideal 26 MaSM updates Updates only

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend