How to Win a Hot Dog Eating Contest: Incremental View Maintenance - - PowerPoint PPT Presentation

how to win a hot dog eating contest incremental view
SMART_READER_LITE
LIVE PREVIEW

How to Win a Hot Dog Eating Contest: Incremental View Maintenance - - PowerPoint PPT Presentation

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28 th June 2016 REALTIME APPLICATIONS Web Analytics Sensor Networks Cloud Monitoring


slide-1
SLIDE 1

How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates

Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28th June 2016

slide-2
SLIDE 2

REALTIME APPLICATIONS

2

Web Analytics Sensor Networks Cloud Monitoring

DECISION SUPPORT RUNTIME ENGINE Continuously arriving data Continuously evaluated views EVENTS ACTIONS

slide-3
SLIDE 3

REALTIME SYSTEMS: REQUIREMENTS

3

LOW LATENCY PROCESSING

Incremental view maintenance Q(D + ∆D) = Q(D) + ∆Q(D, ∆D)

SCALABLE PROCESSING

Synchronous execution model

COMPLEX CONTINUOUS QUERIES

SQL queries (w/ nested aggregates) No window semantics

slide-4
SLIDE 4

IN THIS TALK

4

Q1: How does the size of update affect the performance of incremental computation? Q2: (Idea) How to achieve efficient distributed incremental computation?

slide-5
SLIDE 5

PROBLEM: DBMS & stream engines with classical IVM can have poor performance on fast, long-lived data OUR APPROACH: Compilation of SQL queries into incremental engines PERF: Million view refreshes/sec for single-tuple updates

5

Recursive IVM Code Generation

(C++, Scala, Spark)

= +

HIGH-PERFORMANCE INCREMENTAL COMPUTATION

slide-6
SLIDE 6

6

Delta for ΔR

Optimize

SUM(A) GROUP BY B S SUM(C) GROUP BY B

B

SUM(L*R)

SUM(A*C)

B

S ΔR

Relations: R(A,B), S(B,C) Q := SELECT SUM(R.A * S.C) FROM R, S WHERE R.B = S.B

SUM(A*C)

B

S R

Delta ΔRQ Optimized Delta ΔRQ

ΔR

Update Q

slide-7
SLIDE 7

7

ΔR

SUM(A) GROUP BY B ΔS SUM(C) GROUP BY B

B

SUM(L*R)

mR ΔS

SUM(A) GROUP BY B S SUM(C) GROUP BY B

B

SUM(L*R) ΔR

mS

Optimized Delta ΔRQ

Relations: R(A,B), S(B,C) Q := SELECT SUM(R.A * S.C) FROM R, S WHERE R.B = S.B

SUM(A*C)

B

S R

Optimized Delta ΔSQ

Update Q

R

Pre-compute Pre-compute

Update Q

slide-8
SLIDE 8

9

ON UPDATE R BY ΔR: // Pre-aggregate batch tmp[B] := SELECT B, SUM(A) FROM ΔR GROUP BY B // Update Q Q += SELECT SUM(tmp.V*mS.V) FROM tmp, mS WHERE tmp.B = mS.B // Update mR mR[B] += SELECT * FROM tmp

SUM(A*C) B S R

SUM(A) GROUP BY B

B SUM(L*R)

ΔR SUM(A) GROUP BY B R

mR

SUM(C) GROUP BY B S

mS

Update Q

Q

SUM(A) GROUP BY B ΔR

Update mR

Common delta expressions

tmp tmp

slide-9
SLIDE 9

10

ON UPDATE R BY ΔR: // Pre-aggregate batch tmp[B] := SELECT B, SUM(A) FROM ΔR GROUP BY B // Update Q Q += SELECT SUM(tmp.V*mS.V) FROM tmp, mS WHERE tmp.B = mS.B // Update mR mR[B] += SELECT * FROM tmp void onUpdateR(List<T> dR) { // Pre-aggregate batch HashMap<int,int> tmp; foreach (dA,dB) in dR tmp[dB] += dA; // Update Q (of type int) foreach (k,v) in tmp Q += v * mS[k]; // Update mR foreach (k,v) in tmp mR[k] += v; }

slide-10
SLIDE 10

11

CODE SPECIALIZATION Primitive-type parameters No intermediate maps Loop elimination Partial evaluation, inlining

void onUpdateR(int dA, int dB) { Q += dA * mS[dB]; mR[dB] += dA; }

BASELINE

void onUpdateR(List<T> dR) { // Pre-aggregate batch HashMap<int,int> tmp; foreach (dA,dB) in dR tmp[dB] += dA; // Update Q (of type int) foreach (k,v) in tmp Q += v * mS[k]; // Update mR foreach (k,v) in tmp mR[k] += v; }

slide-11
SLIDE 11

SINGLE-TUPLE VS. BATCH IVM

12

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Q3 Q9

TPC-H, 10GB stream, batch size = 1…100,000, C++

NORMALIZED THROUGHPUT

Single-tuple

BS=1 BS=10 BS=100 BS=1K BS=10K BS=100K

MAIN RESULTS 1) Best performance w/ medium batch sizes (= bite sizes) 2) Single-tuple processing faster for 5 queries; 7 queries within 20% of best-batch performance 3) Batch pre-aggregation can enable cheaper maintenance 4) OOM faster than DBMS

slide-12
SLIDE 12

DISTRIBUTED IVM

13

STATEMENT 1 STATEMENT 2 STATEMENT 3

ON UPDATE R

STATEMENT 4 STATEMENT 5 STATEMENT 6

LOCAL IVM PROGRAM

STATEMENT 7

ON UPDATE S

DESIGN CHOICE 1: Local ➞ Distributed programs CHALLENGE: Dependencies among statements prevent arbitrary re-orderings DESIGN CHOICE 2: Synchronous execution model (on top of Spark)

slide-13
SLIDE 13

OUR APPROACH

LOCATION TAGS: LOCAL, PARTITIONED BY KEY, RANDOM Annotate each node in query plan with location info LOCATION TRANSFORMERS: Insert communication

  • perations into query plan to preserve query semantics

HOLISTIC OPTIMIZATION: Minimize network cost

14

REPARTITION GATHER SCATTER

slide-14
SLIDE 14

CONCLUSION

Much more in the paper:

  • Single-tuple vs. batch incremental processing

(single-tuple can be better!) + more experiments

  • Distributed IVM (+ optimization framework)
  • IVM of queries with nested aggregates
  • Code and data-structure specialization

15

Download: http://www.dbtoaster.org