Cheap data analytics using cold storage devices Renata - - PowerPoint PPT Presentation

cheap data analytics using cold
SMART_READER_LITE
LIVE PREVIEW

Cheap data analytics using cold storage devices Renata - - PowerPoint PPT Presentation

Cheap data analytics using cold storage devices Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki Proliferation of cold data 80% enterprise data is cold with 60% CAGR [ Horison] cold data: an incredibly valuable piece of


slide-1
SLIDE 1

Cheap data analytics using cold storage devices

Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki

slide-2
SLIDE 2

Proliferation of cold data

Cool one disk

2

Active disks

Cold Storage Devices (CSD) to the rescue “80% enterprise data is cold with 60% CAGR” [Horison] “cold data: an incredibly valuable piece of the analysis pipeline” [Intel]

PB of storage at cost ~ tape and latency ~ disks

Power one disk

Latency ~ 10 ms Latency ~ 10 secs A B

slide-3
SLIDE 3

CSD in the storage tiering hierarchy

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$ $

7200 RPM HDD

ms

VTL

min sec

3

Tiers

slide-4
SLIDE 4

CSD in the storage tiering hierarchy

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$

Can we shrink tiers to further save cost?

$

7200 RPM HDD

ms

CSD VTL

min sec

?

4

Tiers

COLD

$

slide-5
SLIDE 5

CSD in the storage tiering hierarchy

ns µs hour Performance Capacity Archival

Data Access Latency

15k RPM HDD

DRAM SSD

$$$ $$

Can we shrink tiers to further save cost?

$

7200 RPM HDD

ms min sec

CSD

?

5

Tiers

COLD

$

slide-6
SLIDE 6

50 100 150 200 250 300 350 400

Cost (x1000$) CSD offers significant cost savings (40%) But… can we run queries over CSD?

CSD in the storage tiering hierarchy

ns µs hour Performance

Data Access Latency

15k RPM HDD

DRAM SSD

$$$

COLD

$

ms min sec

CSD

6

Trad. 3-tier $159,641 CSD 2-tier [Horison, 2015]

Storing 100TB of data Tiers

slide-7
SLIDE 7

Query execution over CSD

DB1 HDD-Based Capacity Tier

Virtualized enterprise data center

Network Cold Storage Tier

VM1 VM2 VM3

7

Clients

DB1 DB2 DB3

Traditional setting

A1 A2 A3 C1 C2 C3 B1 B2 B3 B4 blocks

  • bjects

Uniform access Control layout Static (pull-based) execution

Pull-based execution will trigger unwarranted group switches

A1A2A3 C1C2C3 B1B2B3B4

Uniform access Control layout

    

slide-8
SLIDE 8

What this means for an enterprise datacenter…

1 2 3 4 5

1 2 3 4 5

Average execution time (x1000 sec)

Number of clients (groups)

PostgreSQL Ideal

Lost opportunity: CSD relegated to archival storage

1 2 3 4 5 6 7 8 10 20

Average execution time (x1000 sec) Group switch latency (sec) CSD HDD

HDD CSD Setting: multitenant enterprise datacenter, clients: PostgreSQL , TPCH 50, Q12, CSD: shared, layout: one client per group

8

slide-9
SLIDE 9

Need hardware-software codesign

  • 1. Data access has to be hardware-driven to

minimize group switches

  • 2. Query execution engine has to process data

pushed from storage in out-of-order (unpredictable) manner

  • 3. Reduce data round-trips to cold storage by

smart data caching

9

slide-10
SLIDE 10

Skipper to the rescue

Network VM Cache Management

DB1 Cold Storage

Virtualized enterprise data center

DB2 DB3 VM1 VM2 VM3

I/O Scheduler object-group map. MJoin Hash Hash Hash Scan A Scan B Scan C A1 B1 C1 A2 PostgreSQL

1. 2. 3.

Progress driven caching

10

Opportunistic execution with multi-way joins Novel ranking algorithm

slide-11
SLIDE 11

Multi-way joins in PostgreSQL

VM: PostgreSQL Cache Manager State Manager

A1 A2 C1 B1

Join Execution

MJoin Hash Hash Hash Scan A Scan B Scan C A1,B1,C1 A2,B1,C1 A1 B1 C1 A2

Setting: Query AxBxC, A:A1, A2; B: B1,B2; C:C1, C2;

Subplans:

Enable out-of-order opportunistic execution

Pending A1,B1,C1 A1,B1,C2 A1,B2,C1 A1,B2,C2 A2,B1,C1 A2,B1,C2 A2,B2,C1 A2,B2,C2 Executed Pending A1,B1,C2 A1,B2,C1 A1,B2,C2 A2,B1,C2 A2,B2,C1 A2,B2,C2 Executed A1,B1,C1 A2,B1,C1 Pending A1,B1,C1 A1,B1,C2 A1,B2,C1 A1,B2,C2 A2,B1,C1 A2,B1,C2 A2,B2,C1 A2,B2,C2

slide-12
SLIDE 12

Progress driven caching

12

Setting: Query AxBxC, Cache size: 4, Cache full, Evict a candidate

Pending A1,B1,C2 A1,B2,C1 A1,B2,C2 A2,B1,C2 A2,B2,C1 A2,B2,C2

A1 A2 C1 B1

Cache

C2

?

Executed A1,B1,C1 A2,B1,C1

LRU No progress (drop B1) Object A1 A2 B1 C1 Progress A1 A2 B1 C2 Progress: 2

Minimizes data roundtrips, maximizes query progress

New “Max progress” algorithm 1 1 2

Pending A1,B1,C2 A1,B2,C1 A1,B2,C2 A2,B1,C2 A2,B2,C1 A2,B2,C2

slide-13
SLIDE 13

Rank-based scheduling

13

Which group to switch to ?

Balances efficiency and fairness

FCFS – Fair but inefficient

O1 O2 O3 O44 O5 O1 O4 O3 O2 TIME

O1, O2, O3, O4, O5

O1 O4 O3 O2 O1 O3

O5 STARVES

New Ranking Algorithm

O1 O4 O2 O3 O2 O1 O3 O5

Rank(G) = #Requests + ∑Wait

O4

Provides efficiency Provides fairness

Group Table objects

G1 O1 (DB1), O3 (DB3) G2 O2 (DB2), O4 (DB4) G3 O5 (DB5) O1, O3

O1 O4 …. O3 O2 O1 O3 O2O4

O2, O4

Max-requests: Efficient, not fair

O1 O3

slide-14
SLIDE 14

Skipper in action

1 2 3 4 5 1 2 3 4 5

Average exec. time (x1000 sec) Number of clients

PostgreSQL Ideal Skipper

1 2 3 4 5 6 7 8 10 20 30 40

Average exec. time (x1000 sec) Group switch latency (sec)

PostgreSQL Ideal Skipper

Skipper performs within 20% of HDD-based capacity tier

Setting: multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group

Skipper is resilient to group switch latency

14

slide-15
SLIDE 15

Minimizing group switches

Setting: multitenant enterprise datacenter, 5 clients: TPCH 50, Q12, CSD: shared, layout: one client per group 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% PostgreSQL Skipper

  • Exec. time breakdown (%)

Transfer time Switch time Processing

Skipper substantially reduces overhead of group switches 15

slide-16
SLIDE 16

Conclusions

  • Cold storage can substantially reduce TCO

– But DBMS performance suffers due to pull-based execution

  • Skipper enables efficient query execution over CSD with

– Out-of-order execution based on multi-way joins – Novel progress based caching policy – Rank based I/O scheduling

  • Skipper makes data analytics over CSD as a service possible

– Providers reduce cost by offloading data to CSD – Customers reduce cost by running inexpensive data analytics over CSD

Thank you!

16