[PPT] - How to Win a Hot Dog Eating Contest: Incremental View Maintenance PowerPoint Presentation

SLIDE 1

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates

Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28th June 2016

SLIDE 2

REALTIME APPLICATIONS

2

Web Analytics Sensor Networks Cloud Monitoring

DECISION SUPPORT RUNTIME ENGINE Continuously arriving data Continuously evaluated views EVENTS ACTIONS

SLIDE 3

REALTIME SYSTEMS: REQUIREMENTS

3

LOW LATENCY PROCESSING

Incremental view maintenance Q(D + ∆D) = Q(D) + ∆Q(D, ∆D)

SCALABLE PROCESSING

Synchronous execution model

COMPLEX CONTINUOUS QUERIES

SQL queries (w/ nested aggregates) No window semantics

SLIDE 4

IN THIS TALK

4

Q1: How does the size of update affect the performance of incremental computation? Q2: (Idea) How to achieve efficient distributed incremental computation?

SLIDE 5

PROBLEM: DBMS & stream engines with classical IVM can have poor performance on fast, long-lived data OUR APPROACH: Compilation of SQL queries into incremental engines PERF: Million view refreshes/sec for single-tuple updates

5

Recursive IVM Code Generation

(C++, Scala, Spark)

= +

HIGH-PERFORMANCE INCREMENTAL COMPUTATION

SLIDE 6

6

Delta for ΔR

Optimize

SUM(A) GROUP BY B S SUM(C) GROUP BY B

⋈

B

SUM(L*R)

⋈

SUM(A*C)

B

S ΔR

Relations: R(A,B), S(B,C) Q := SELECT SUM(R.A * S.C) FROM R, S WHERE R.B = S.B

⋈

SUM(A*C)

B

S R

Delta ΔRQ Optimized Delta ΔRQ

ΔR

Update Q

SLIDE 7

7

ΔR

SUM(A) GROUP BY B ΔS SUM(C) GROUP BY B

⋈

B

SUM(L*R)

mR ΔS

SUM(A) GROUP BY B S SUM(C) GROUP BY B

⋈

B

SUM(L*R) ΔR

mS

Optimized Delta ΔRQ

Relations: R(A,B), S(B,C) Q := SELECT SUM(R.A * S.C) FROM R, S WHERE R.B = S.B

⋈

SUM(A*C)

B

S R

Optimized Delta ΔSQ

Update Q

R

Pre-compute Pre-compute

Update Q

SLIDE 8

9

ON UPDATE R BY ΔR: // Pre-aggregate batch tmp[B] := SELECT B, SUM(A) FROM ΔR GROUP BY B // Update Q Q += SELECT SUM(tmp.V*mS.V) FROM tmp, mS WHERE tmp.B = mS.B // Update mR mR[B] += SELECT * FROM tmp

⋈

SUM(A*C) B S R

SUM(A) GROUP BY B

⋈

B SUM(L*R)

ΔR SUM(A) GROUP BY B R

mR

SUM(C) GROUP BY B S

mS

Update Q

Q

SUM(A) GROUP BY B ΔR

Update mR

Common delta expressions

tmp tmp

SLIDE 9

10

ON UPDATE R BY ΔR: // Pre-aggregate batch tmp[B] := SELECT B, SUM(A) FROM ΔR GROUP BY B // Update Q Q += SELECT SUM(tmp.V*mS.V) FROM tmp, mS WHERE tmp.B = mS.B // Update mR mR[B] += SELECT * FROM tmp void onUpdateR(List<T> dR) { // Pre-aggregate batch HashMap<int,int> tmp; foreach (dA,dB) in dR tmp[dB] += dA; // Update Q (of type int) foreach (k,v) in tmp Q += v * mS[k]; // Update mR foreach (k,v) in tmp mR[k] += v; }

SLIDE 10

11

CODE SPECIALIZATION Primitive-type parameters No intermediate maps Loop elimination Partial evaluation, inlining

void onUpdateR(int dA, int dB) { Q += dA * mS[dB]; mR[dB] += dA; }

BASELINE

void onUpdateR(List<T> dR) { // Pre-aggregate batch HashMap<int,int> tmp; foreach (dA,dB) in dR tmp[dB] += dA; // Update Q (of type int) foreach (k,v) in tmp Q += v * mS[k]; // Update mR foreach (k,v) in tmp mR[k] += v; }

SLIDE 11

SINGLE-TUPLE VS. BATCH IVM

12

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Q3 Q9

TPC-H, 10GB stream, batch size = 1…100,000, C++

NORMALIZED THROUGHPUT

Single-tuple

BS=1 BS=10 BS=100 BS=1K BS=10K BS=100K

MAIN RESULTS 1) Best performance w/ medium batch sizes (= bite sizes) 2) Single-tuple processing faster for 5 queries; 7 queries within 20% of best-batch performance 3) Batch pre-aggregation can enable cheaper maintenance 4) OOM faster than DBMS

SLIDE 12

DISTRIBUTED IVM

13

STATEMENT 1 STATEMENT 2 STATEMENT 3

ON UPDATE R

STATEMENT 4 STATEMENT 5 STATEMENT 6

LOCAL IVM PROGRAM

STATEMENT 7

ON UPDATE S

DESIGN CHOICE 1: Local ➞ Distributed programs CHALLENGE: Dependencies among statements prevent arbitrary re-orderings DESIGN CHOICE 2: Synchronous execution model (on top of Spark)

SLIDE 13

OUR APPROACH

LOCATION TAGS: LOCAL, PARTITIONED BY KEY, RANDOM Annotate each node in query plan with location info LOCATION TRANSFORMERS: Insert communication

perations into query plan to preserve query semantics

HOLISTIC OPTIMIZATION: Minimize network cost

14

REPARTITION GATHER SCATTER

SLIDE 14

CONCLUSION

Much more in the paper:

Single-tuple vs. batch incremental processing

(single-tuple can be better!) + more experiments

Distributed IVM (+ optimization framework)
IVM of queries with nested aggregates
Code and data-structure specialization

15

How to Win a Hot Dog Eating Contest: Incremental View Maintenance - - PowerPoint PPT Presentation

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates

REALTIME APPLICATIONS

REALTIME SYSTEMS: REQUIREMENTS

LOW LATENCY PROCESSING

SCALABLE PROCESSING

Synchronous execution model

COMPLEX CONTINUOUS QUERIES

SQL queries (w/ nested aggregates) No window semantics

IN THIS TALK

Q1: How does the size of update affect the performance of incremental computation? Q2: (Idea) How to achieve efficient distributed incremental computation?

= +

HIGH-PERFORMANCE INCREMENTAL COMPUTATION

Optimize

⋈

⋈

⋈

Delta ΔRQ Optimized Delta ΔRQ

⋈

⋈

Optimized Delta ΔRQ

⋈

Optimized Delta ΔSQ

⋈

⋈

SINGLE-TUPLE VS. BATCH IVM

DISTRIBUTED IVM

OUR APPROACH

CONCLUSION

Much more in the paper:

(single-tuple can be better!) + more experiments

Download: http://www.dbtoaster.org