MaSM: Efficient Online Updates in Data Warehouses Manos - - PowerPoint PPT Presentation

masm efficient online updates in data warehouses
SMART_READER_LITE
LIVE PREVIEW

MaSM: Efficient Online Updates in Data Warehouses Manos - - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs Freshness vs Performance Data warehouse workload Read-only queries (scans)


slide-1
SLIDE 1

MaSM: Efficient Online Updates in Data Warehouses

Manos Athanassoulis1 Shimin Chen2 Anastasia Ailamaki1 Phillip Gibbons2 Radu Stoica1

1EPFL 2Intel Labs

slide-2
SLIDE 2

Freshness vs Performance

  • Data warehouse workload

– Read-only queries (scans) – Scattered updates – Difficult to combine efficiently

  • Traditionally two choices

– Freshness: in-place updates – Performance: batch updates

  • Ideally, zero overhead

2

0.5 1 1.5 2 2.5 Query only Query w/ updates Query only + Updates only Ideal

Normalized execution time

TPCH queries (on avg) Freshness

Performance

slide-3
SLIDE 3

Freshness vs Performance

  • Data warehouse workload

– Read-only queries (scans) – Scattered updates – Difficult to combine efficiently

  • Traditionally two choices

– Freshness: in-place updates – Performance: batch updates

  • Ideally, zero overhead

3

0.5 1 1.5 2 2.5 Query only Query w/ updates Query only + Updates only Ideal

Normalized execution time

TPCH queries (on avg)

Is zero overhead possible?

Freshness

Performance

slide-4
SLIDE 4

Freshness AND Performance

4

In Memory Buffered Updates

[Stonebraker et al .’05] [Heman et al.’10]

Ø Apply them online Ø Apply them as differential updates x Large memory overhead x Trade-off migration overhead for memory footprint

To sum up

1 10 100 1000 16MB 128MB 1GB 8GB

normalized migration overhead

in-memory buffer size

cache updates in memory

ideal

Update Approach Freshness Performance ↓ mem overhead Batched X J J In place

J

X J In-memory differential J J X

slide-5
SLIDE 5

Freshness AND Performance

5

In Memory Buffered Updates

[Stonebraker et al .’05] [Heman et al.’10]

Ø Apply them online Ø Apply them as differential updates x Large memory overhead x Trade-off migration overhead for memory footprint

To sum up

1 10 100 1000 16MB 128MB 1GB 8GB

normalized migration overhead

in-memory buffer size

cache updates in memory

ideal

Update Approach Freshness Performance ↓ mem overhead Batched X J J In place

J

X J In-memory differential J J X

Can we have the cake and eat it too?

slide-6
SLIDE 6

Use MaSM!

  • Buffer updates on Flash instead of memory

ØFlash has larger capacity and smaller price

  • But: Flash friendly design is important

– Avoid random writes – Limit total writes

  • e.g. Log-Structure Merge Tree incurs a high number writes per update

6

Incoming updates Merge data from disks and flash Answer query

[O’ Neil et al.’96]

SSD

slide-7
SLIDE 7

Use MaSM!

  • Buffer updates on Flash instead of memory

ØFlash has larger capacity and smaller price

  • But: Flash friendly design is important

– Avoid random writes – Limit total writes

  • e.g. Log-Structure Merge Tree incurs a high number writes per update

7

Incoming updates Merge data from disks and flash Answer query Let’s think again!

[O’ Neil et al.’96]

SSD

slide-8
SLIDE 8

Key Value Type 5 V5’ Mod 19 V19’ Mod 1 V1’ Mod 9 N/A Del 125 V125 Ins 5 V5’’ Mod

MaSM core idea

Key Value 1 V1 2 V2 3 V3 4 V4 5 V5 6 V6 7 V7 8 V8 9 V9

8

Updates (U) Data (D)

K Value Type 1 V1’ Mod 5 V5’’ Mod 9 N/A Del 19 V19’ Mod 125 V125 Ins

Current data

ü Outer join: D ⟕ U ü Keep latest update only

Efficient execution

ü Discard duplicates ü Re-use information for future queries

Sort-Merge Join

ü Intuitively does both

slide-9
SLIDE 9

Key Value Type 5 V5’ Mod 19 V19’ Mod 1 V1’ Mod 9 N/A Del 125 V125 Ins 5 V5’’ Mod

MaSM core idea

Key Value 1 V1 2 V2 3 V3 4 V4 5 V5 6 V6 7 V7 8 V8 9 V9

9

Updates (U) Data (D)

K Value Type 1 V1’ Mod 5 V5’’ Mod 9 N/A Del 19 V19’ Mod 125 V125 Ins

Current data

ü Outer join: D ⟕ U ü Keep latest update only

Efficient execution

ü Discard duplicates ü Re-use information for future queries

Sort-Merge Join

ü Intuitively does both MaSM merges data with updates using sort- merge join and materializing sorted runs

slide-10
SLIDE 10

Outline

  • Introduction

– Prior work: Differential Updates – MaSM sneak peak

  • MaSM architecture
  • Evaluation

– Query response time – Sustained update rate

  • Conclusions

10

slide-11
SLIDE 11

MaSM in detail

11

Main memory Disks (main data) e.g. TBs

M pages M= !!" Updates

SSD e.g. GBs

slide-12
SLIDE 12

Main memory

MaSM in detail

12

Disks (main data) e.g. TBs

Incoming query Merge data & updates

Table Range Scan Run Scan Run Scan Run Scan

Merge updates

Mem Scan

M pages M pages M= !!"

SSD e.g. GBs

slide-13
SLIDE 13

Main memory

MaSM in detail

13

Disks (main data) e.g. TBs

Incoming query Merge data & updates

Table Range Scan Run Scan Run Scan Run Scan

Merge updates

Mem Scan

M pages M pages M= !!" Merge pages from HDD, SSD and RAM with negligible overhead!

SSD e.g. GBs

slide-14
SLIDE 14

Reducing MaSM memory

14

Main memory Disks (main data) e.g. TBs

Incoming query Merge data & updates

Table Range Scan Run Scan Run Scan Run Scan

Merge updates

Mem Scan

αM-S pages S pages

1-pass runs 2-pass runs

! ≤ 2, S ≤ &

SSD e.g. GBs

slide-15
SLIDE 15

Reducing MaSM memory

15

Main memory Disks (main data) e.g. TBs

Incoming query Merge data & updates

Table Range Scan Run Scan Run Scan Run Scan

Merge updates

Mem Scan

αM-S pages S pages

1-pass runs 2-pass runs

! ≤ 2, S ≤ & Trade-off extra writes for memory size.

SSD e.g. GBs

slide-16
SLIDE 16

Impact of α on SSD wear

16

1 2 0.2 2.0 Memory footprint = αM ! − #. !%&! 2M 0.2M f(M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2

SSD writes per update M= **+

slide-17
SLIDE 17

Impact of α on SSD wear

17

1 2 0.2 2.0 Memory footprint = αM ! − #. !%&! 2M 0.2M f(M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2

SSD writes per update 10x smaller memory for 2x more writes! M= **+

slide-18
SLIDE 18

Outline

  • Introduction

– Prior work: Differential Updates – MaSM sneak peak

  • MaSM architecture
  • Evaluation

– Query response time – Sustained update rate

  • Conclusions

18

slide-19
SLIDE 19

Experimental setup

  • Dell Precision 690 Workstation

– Intel Xeon Quad (2.33MHz, 8MB L2), 4GB DRAM, Ubuntu Linux, kernel 2.6.24

  • Dedicated SATA disk for main data

– 7200rpm Seagate Barracuda, 77MB/s sequential bandwidth

  • Intel X25-E SSD for caching updates

– 250 MB/s sequential read, 170MB/s sequential write bandwidth; 35,000 4KB-sized random reads/second

  • Prototype row store:

– Implemented in-place updates, indexed updates, MaSM

19

slide-20
SLIDE 20

Query performance on synthetic data

  • MaSM has negligible impact on 10MB or larger scans
  • MaSM with fine-grain index incurs 4% overhead for 4KB

ranges (modeling point queries)

20

100GB main data, 4GB flash for cached updates, 16MB memory

1 2 3 4

4KB 100KB 1MB 10MB 100MB 1GB 10GB 100GB normalized time

range size

in-place updates MaSM w/ coarse-grain index MaSM w/ fine-grain index

slide-21
SLIDE 21

TPCH replay experiment

  • Replay TPCH disk traces recorded from commercial row

store; random online updates

21

500 1000 1500 2000 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 execution time (s) query w/o updates query w/ in-place updates query w/ MaSM updates

3537s

slide-22
SLIDE 22

TPCH replay experiment

  • Replay TPCH disk traces recorded from commercial row

store; random online updates

22

500 1000 1500 2000 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 execution time (s) query w/o updates query w/ in-place updates query w/ MaSM updates

Queries with MaSM see less than 1% overhead!

3537s

slide-23
SLIDE 23

Update performance

23

2000 4000 6000 8000 10000 12000 14000 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD Update Rate (upd/s)

48 3.5K 6K 12K

slide-24
SLIDE 24

Update performance

24

2000 4000 6000 8000 10000 12000 14000 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD Update Rate (upd/s)

48 3.5K 6K 12K

Efficient usage of a few GB of flash can increase update rate up to 258x!

slide-25
SLIDE 25

To sum up

MaSM enables on-line updates in DW

  • Negligible query overhead (less than 1% for TPCH)
  • Supports a high update rate (up to 12k)
  • Tunable memory footprint vs SSD wear
  • Low migration cost (one-time 2.2x)
  • SSD-friendly behavior

– Limited number of writes per updates – No random writes on SSD

  • Easy DBMS integration
  • Ensure ACID properties

25

0.5 1 1.5 2 2.5 Query only Query w/ updates Query only + Updates only Ideal

Normalized execution time

TPCH queries (on avg)

MaSM

slide-26
SLIDE 26

To sum up

MaSM enables on-line updates in DW

  • Negligible query overhead (less than 1% for TPCH)
  • Supports a high update rate (up to 12k)
  • Tunable memory footprint vs SSD wear
  • Low migration cost (one-time 2.2x)
  • SSD-friendly behavior

– Limited number of writes per updates – No random writes on SSD

  • Easy DBMS integration
  • Ensure ACID properties

26

0.5 1 1.5 2 2.5 Query only Query w/ updates Query only + Updates only Ideal

Normalized execution time

TPCH queries (on avg)

MaSM

Update Approach Freshness Performance ↓ mem overhead Batched X J J In place J X J In-memory differential J J X MaSM and SSD J J J

Thank you!