DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation

data analytics using deep learning
SMART_READER_LITE
LIVE PREVIEW

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 8 : Q U E R Y E X E C U T I O N administrivia Reminders Sign up for discussion slots on Thursday Proposal presentations on next Wednesday


slide-1
SLIDE 1

DATA ANALYTICS USING DEEP LEARNING

GT 8803 // FALL 2019 // JOY ARULRAJ

L E C T U R E # 0 8 : Q U E R Y E X E C U T I O N

slide-2
SLIDE 2

GT 8803 // Fall 2019

administrivia

  • Reminders

– Sign up for discussion slots on Thursday – Proposal presentations on next Wednesday – 2 page document + 5 minute presentation – Assignment 1 due on Wednesday

2

slide-3
SLIDE 3

GT 8803 // Fall 2019

LAST CLASS

  • Storage models: NSM, DSM, and FSM

3

1 2 Georgia Tech 15000 Wisconsin 30000 Atlanta Madison ID University Enrollment City 3 Carnegie Mellon 4 UC Berkeley 6000 30000 Pittsburgh Berkeley

FLEXIBLE STORAGE MODEL

slide-4
SLIDE 4

GT 8803 // Fall 2019

LAST CLASS

  • Compression

– Zone maps – Dictionary encoding

4

Original Data

val 100 200 300 400 400

slide-5
SLIDE 5

GT 8803 // Fall 2019

LAST CLASS

  • Compression

– Zone maps – Dictionary encoding

5

Zone Map

val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT

Original Data

val 100 200 300 400 400

slide-6
SLIDE 6

GT 8803 // Fall 2019

LAST CLASS

  • Compression

– Zone maps – Dictionary encoding

6

Zone Map

val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT

Original Data

val 100 200 300 400 400

SELECT * FROM table WHERE val > 600

slide-7
SLIDE 7

GT 8803 // Fall 2019

LAST CLASS

  • Visual Storage Engine

– Convolutional Auto Encoder

7

ENCODER (CONVOLUTIONS) DECODER (DECONVOLUTIONS) COMPRESSED DATA

slide-8
SLIDE 8

GT 8803 // Fall 2019

TODAY’s AGENDA

  • Query Processing Models
  • Access Methods
  • Expression Evaluation
  • Visual Query Execution Engine

8

slide-9
SLIDE 9

GT 8803 // Fall 2018

QUERY PROCESSING MODELS

9

slide-10
SLIDE 10

GT 8803 // Fall 2019

ANATOMY OF A DATABASE SYSTEM

Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager

10

Query Transactional Storage Manager Query Processor Shared Utilities Process Manager

Source: Anatomy of a Database System

slide-11
SLIDE 11

GT 8803 // Fall 2019

QUERY PLAN

  • The operators are arranged in a tree.

– Data flows from the leaves toward the root. – Output of the root node is the result of the query.

11

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

slide-12
SLIDE 12

GT 8803 // Fall 2019

QUERY PROCESSING MODELS

  • A DBMS's processing model defines how the

system executes a query plan.

– Different trade-offs for different workloads. – Top-down or Bottom-up execution – Determines what data is sent between operators

  • Three approaches:

– Tuple-at-a-time Model – Operator-at-a-time Model – Vector-at-a-time Model

12

slide-13
SLIDE 13

GT 8803 // Fall 2019

#1: TUPLE-AT-A-TIME MODEL

  • Each query plan operator implements a next

function.

– On each invocation, the operator returns either a single tuple or a null marker if there are no more tuples. – The operator implements a loop that calls next on its children to retrieve their tuples and then process them.

  • Top-down plan processing.
  • Also called Iterator / Volcano / Pipeline model.

13

slide-14
SLIDE 14

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

14

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

slide-15
SLIDE 15

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

15

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

slide-16
SLIDE 16

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

16

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

slide-17
SLIDE 17

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

17

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

1

slide-18
SLIDE 18

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

18

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

1

slide-19
SLIDE 19

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

19

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

1 2

slide-20
SLIDE 20

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

20

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

1 2 3

slide-21
SLIDE 21

GT 8803 // Fall 2018

#1: TUPLE-AT-A-TIME MODEL

21

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

for t in A: emit(t) for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): emit(t1⨝t2) for t in child.Next(): emit(projection(t)) for t in child.Next(): if evalPred(t): emit(t) for t in B: emit(t)

1 2 3 5 4

slide-22
SLIDE 22

GT 8803 // Fall 2019

#1: TUPLE-AT-A-TIME MODEL

  • This is used in almost every DBMS.

– Allows for tuple pipelining.

  • Some operators will block until children

emit all of their tuples.

– Joins, Subqueries, Order By – Known as pipeline breakers

  • Output control works easily with this

approach.

– Limit

22

slide-23
SLIDE 23

GT 8803 // Fall 2019

#2: OPERATOR-AT-A-TIME MODEL

  • Each operator processes its input all at once

and then emits its output all at once.

– The operator "materializes" it output as a single result. – The DBMS can push down hints into to avoid scanning too many tuples.

  • Bottom-up plan processing.

23

slide-24
SLIDE 24

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

24

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)
slide-25
SLIDE 25

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

25

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

1

slide-26
SLIDE 26

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

26

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

1

slide-27
SLIDE 27

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

27

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

1 2

slide-28
SLIDE 28

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

28

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

1 2 3

slide-29
SLIDE 29

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

29

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

4 1 2 3

slide-30
SLIDE 30

GT 8803 // Fall 2018

#2: OPERATOR-AT-A-TIME MODEL

30

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)
  • ut = { }

for t1 in left.Output(): buildHashTable(t1) for t2 in right.Output(): if probe(t2): out.add(t1⨝t2)

  • ut = { }

for t in child.Output():

  • ut.add(projection(t))
  • ut = { }

for t in child.Output(): if evalPred(t): out.add(t)

  • ut = { }

for t in B:

  • ut.add(t)

5 4 1 2 3

slide-31
SLIDE 31

GT 8803 // Fall 2019

#2: OPERATOR-AT-A-TIME MODEL

  • Better for OLTP workloads

– Transactions typically only access a small number of tuples at a time. – Lower execution / coordination overhead.

  • Not good for OLAP queries with large

intermediate results.

31

slide-32
SLIDE 32

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

  • Like tuple-at-a-time model, each operator

implements a next function.

  • Each operator emits a vector (i.e., batch) of

tuples instead of a single tuple.

– The operator's internal loop processes multiple tuples at a time. – The size of the batch can vary based on hardware

  • r query properties.

32

slide-33
SLIDE 33

GT 8803 // Fall 2018

#3: VECTOR-AT-A-TIME MODEL

33

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)

if |out|> n : emit(out)

  • ut = { }

for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): out.add(t1⨝t2) if |out|> n: emit(out)

  • ut = { }

for t in child.Next():

  • ut.add(projection(t))

if |out|> n: emit(out)

  • ut = { }

for t in child.Next(): if evalPred(t): out.add(t) if |out|> n : emit(out)

  • ut = { }

for t in B:

  • ut.add(t)

if |out|> n : emit(out)

slide-34
SLIDE 34

GT 8803 // Fall 2018

#3: VECTOR-AT-A-TIME MODEL

34

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)

if |out|> n : emit(out)

  • ut = { }

for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): out.add(t1⨝t2) if |out|> n: emit(out)

  • ut = { }

for t in child.Next():

  • ut.add(projection(t))

if |out|> n: emit(out)

  • ut = { }

for t in child.Next(): if evalPred(t): out.add(t) if |out|> n : emit(out)

1 2 3

  • ut = { }

for t in B:

  • ut.add(t)

if |out|> n : emit(out)

slide-35
SLIDE 35

GT 8803 // Fall 2018

#3: VECTOR-AT-A-TIME MODEL

35

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

  • ut = { }

for t in A:

  • ut.add(t)

if |out|> n : emit(out)

  • ut = { }

for t1 in left.Next(): buildHashTable(t1) for t2 in right.Next(): if probe(t2): out.add(t1⨝t2) if |out|> n: emit(out)

  • ut = { }

for t in child.Next():

  • ut.add(projection(t))

if |out|> n: emit(out)

  • ut = { }

for t in child.Next(): if evalPred(t): out.add(t) if |out|> n : emit(out)

1 2 3

  • ut = { }

for t in B:

  • ut.add(t)

if |out|> n : emit(out)

5 4

slide-36
SLIDE 36

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

36

slide-37
SLIDE 37

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

  • Q: What is the target workload for this model:

OLTP or OLAP?

– Reduces number of next function invocations – Works well for OLAP

  • Q: Why is it better than operator-at-a-time

model in traditional disk-centric DBMSs?

– Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses

37

slide-38
SLIDE 38

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

  • Q: What is the target workload for this model:

OLTP or OLAP?

– Reduces number of next function invocations – Works well for OLAP

  • Q: Why is it better than operator-at-a-time

model in traditional disk-centric DBMSs?

– Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses

38

slide-39
SLIDE 39

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

  • Q: What is the target workload for this model:

OLTP or OLAP?

– Reduces number of next function invocations – Works well for OLAP

  • Q: Why is it better than operator-at-a-time

model in traditional disk-centric DBMSs?

– Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses

39

slide-40
SLIDE 40

GT 8803 // Fall 2019

#3: VECTOR-AT-A-TIME MODEL

  • Ideal for OLAP queries

– Greatly reduces the number of invocations per

  • perator.

– Allows for operators to use vectorized instructions (SIMD) to process batches of tuples.

40

slide-41
SLIDE 41

GT 8803 // Fall 2018

QUERY PROCESSING MODELS: SUMMARY

41

Vector-at-a-time

→Direction: Top-Down →Emits: Tuple Batch →Target: OLAP

Operator-at-a-time

→Direction: Bottom-Up →Emits: Entire Tuple Set →Target: OLTP

Tuple-at-a-time

→Direction: Top-Down →Emits: Single Tuple →Target: General Purpose

slide-42
SLIDE 42

GT 8803 // Fall 2019

ACCESS METHODS

  • An access method is a way that

the DBMS can access the data stored in a table.

– Not defined in relational algebra – Physical database design

  • Three basic methods:

– Sequential Scan – Index Scan – Multi-index / "Bitmap" Scan

42

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

slide-43
SLIDE 43

GT 8803 // Fall 2019

ACCESS METHODS

  • An access method is a way that

the DBMS can access the data stored in a table.

– Not defined in relational algebra – Physical database design

  • Three basic methods:

– Sequential Scan – Index Scan – Multi-index / "Bitmap" Scan

43

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.value > 100

A B

A.id=B.id value>100 A.id, B.value

s

p

slide-44
SLIDE 44

GT 8803 // Fall 2019

#1: SEQUENTIAL SCAN

  • For each page in the table:

– Retrieve it from the buffer pool. – Iterate over each tuple and check whether to include it.

  • The DBMS maintains an internal cursor that

tracks the last page / slot it examined.

44

for page in table.pages: for t in page.tuples: if evalPred(t): // Do Something!

slide-45
SLIDE 45

GT 8803 // Fall 2019

#1: SEQUENTIAL SCAN

  • This is almost always the worst thing that the

DBMS can do to execute a query.

  • Sequential scan optimizations:

– Prefetching – Parallelization – Buffer Pool Bypass – Zone Maps – Late materialization – Heap clustering

45

slide-46
SLIDE 46

GT 8803 // Fall 2019

#1: SEQUENTIAL SCAN

  • This is almost always the worst thing that the

DBMS can do to execute a query.

  • Sequential scan optimizations:

– Prefetching – Parallelization – Buffer Pool Bypass – Zone Maps – Late materialization – Heap clustering

46

slide-47
SLIDE 47

GT 8803 // Fall 2019

ZONE MAPS

  • Pre-computed aggregates for the attribute

values in a page.

– DBMS checks the zone map first to decide whether it wants to access the page.

47

Original Data

val 100 200 300 400 400

slide-48
SLIDE 48

GT 8803 // Fall 2019

ZONE MAPS

  • Pre-computed aggregates for the attribute

values in a page.

– DBMS checks the zone map first to decide whether it wants to access the page.

48

Zone Map

val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT

Original Data

val 100 200 300 400 400

slide-49
SLIDE 49

GT 8803 // Fall 2019

ZONE MAPS

  • Pre-computed aggregates for the attribute

values in a page.

– DBMS checks the zone map first to decide whether it wants to access the page.

49

Zone Map

val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT

Original Data

val 100 200 300 400 400

SELECT * FROM table WHERE val > 600

slide-50
SLIDE 50

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

50

slide-51
SLIDE 51

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

51

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

slide-52
SLIDE 52

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

52

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

slide-53
SLIDE 53

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

53

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

slide-54
SLIDE 54

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

54

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

slide-55
SLIDE 55

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

55

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

Offsets

slide-56
SLIDE 56

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

56

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

Offsets Offsets

slide-57
SLIDE 57

GT 8803 // Fall 2019

LATE MATERIALIZATION

  • DSM DBMSs can delay stitching together

tuples until the upper parts of the query plan.

57

1 2 3

a b c

SELECT AVG(C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100

bar foo

foo.b=bar.b a>100 AVG(c)

s

γ

Offsets Offsets Result

slide-58
SLIDE 58

GT 8803 // Fall 2019

HEAP CLUSTERING

  • Tuples are sorted in the

heap's pages using the order specified by the clustering index.

  • If the query accesses tuples using the

clustering index's attributes, then the DBMS can jump directly to the pages that it needs.

58

101 102 103 104

slide-59
SLIDE 59

GT 8803 // Fall 2019

HEAP CLUSTERING

  • Tuples are sorted in the

heap's pages using the order specified by the clustering index.

  • If the query accesses tuples using the

clustering index's attributes, then the DBMS can jump directly to the pages that it needs.

59

101 102 103 104

Scan Direction

slide-60
SLIDE 60

GT 8803 // Fall 2019

HEAP CLUSTERING

  • Tuples are sorted in the

heap's pages using the order specified by the clustering index.

  • If the query accesses tuples using the

clustering index's attributes, then the DBMS can jump directly to the pages that it needs.

60

101 102 103 104

Scan Direction

slide-61
SLIDE 61

GT 8803 // Fall 2019

#2: INDEX SCAN

  • The DBMS picks an index to find the tuples

that the query needs.

  • Which index to use depends on:

– What attributes the index contains – What attributes the query references – The attribute's value domains – Predicate composition – Whether the index has unique or non-unique keys

61

slide-62
SLIDE 62

GT 8803 // Fall 2019

#2: INDEX SCAN

  • The DBMS picks an index to find the tuples

that the query needs.

  • Which index to use depends on:

– What attributes the index contains – What attributes the query references – The attribute's value domains – Predicate composition – Whether the index has unique or non-unique keys

62

slide-63
SLIDE 63

GT 8803 // Fall 2019

#2: INDEX SCAN

  • Suppose that we a single table

with 100 tuples and 2 indexes:

– Index #1: age – Index #2: dept

63

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-64
SLIDE 64

GT 8803 // Fall 2019

#2: INDEX SCAN

  • Suppose that we a single table

with 100 tuples and 2 indexes:

– Index #1: age – Index #2: dept

64

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

There are 99 people under the age of 30 but only 2 people in the CS department. Scenario #1

slide-65
SLIDE 65

GT 8803 // Fall 2019

#2: INDEX SCAN

  • Suppose that we a single table

with 100 tuples and 2 indexes:

– Index #1: age – Index #2: dept

65

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

There are 99 people under the age of 30 but only 2 people in the CS department. Scenario #1 There are 99 people in the CS department but

  • nly 2 people under

the age of 30. Scenario #2

slide-66
SLIDE 66

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • If there are multiple indexes that the DBMS

can use for a query:

– Compute sets of record ids using each matching index. – Combine these sets based on the query's predicates (union vs. intersect). – Retrieve the records and apply any remaining terms.

  • Postgres calls this Bitmap Scan

66

slide-67
SLIDE 67

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • With an index on age and an index
  • n dept,

– We can retrieve the record ids satisfying age<30 using the first, – Then retrieve the record ids satisfying dept='CS' using the second, – Take their intersection – Retrieve records and check country='US'.

67

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-68
SLIDE 68

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • Set intersection can be done

with bitmaps, hash tables,

  • r Bloom filters.

68

age<30 dept='CS'

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-69
SLIDE 69

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • Set intersection can be done

with bitmaps, hash tables,

  • r Bloom filters.

69

age<30 dept='CS'

record ids

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-70
SLIDE 70

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • Set intersection can be done

with bitmaps, hash tables,

  • r Bloom filters.

70

age<30 dept='CS'

record ids record ids

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-71
SLIDE 71

GT 8803 // Fall 2019

#3: MULTI-INDEX SCAN

  • Set intersection can be done

with bitmaps, hash tables,

  • r Bloom filters.

71

age<30 dept='CS'

record ids record ids

country='US'

fetch records

SELECT * FROM students WHERE age < 30 AND dept = 'CS' AND country = 'US'

slide-72
SLIDE 72

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

72

101 102 103 104

slide-73
SLIDE 73

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

73

101 102 103 104

Scan Direction

slide-74
SLIDE 74

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

74

101 102 103 104

Scan Direction

slide-75
SLIDE 75

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

75

101 102 103 104

Page 102 Page 103 Page 102 Page 104 Page 104 Page 103 Page 102 Page 101 Page 104 Page 103 Page 102 Page 103

Scan Direction

slide-76
SLIDE 76

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

76

101 102 103 104

Page 102 Page 103 Page 102 Page 104 Page 104 Page 103 Page 102 Page 101 Page 104 Page 103 Page 102 Page 103

Scan Direction

slide-77
SLIDE 77

GT 8803 // Fall 2019

INDEX SCAN PAGE SORTING

  • Retrieving tuples in the order

that appear in an unclustered index is inefficient.

  • The DBMS can first figure out

all the tuples that it needs and then sort them based on their page id.

77

101 102 103 104

Page 102 Page 103 Page 102 Page 104 Page 104 Page 103 Page 102 Page 101 Page 104 Page 103 Page 102 Page 103 Page 102 Page 101 Page 102 Page 102 Page 103 Page 104 Page 103 Page 104 Page 101 Page 102 Page 103 Page 104

Scan Direction

slide-78
SLIDE 78

GT 8803 // Fall 2019

CLUSTERED VS UNCLUSTERED INDEX

78

slide-79
SLIDE 79

GT 8803 // Fall 2019

CLUSTERED VS UNCLUSTERED INDEX

  • Q: What is the difference between a clustered

and unclustered index?

– Clustered: Tuples are stored physically on disk in the same order as the index. Only one per table. – Unclustered: Ordered differently. Values are only pointers to the physical tuples.

79

slide-80
SLIDE 80

GT 8803 // Fall 2018

EXPRESSION EVALUATION

80

slide-81
SLIDE 81

GT 8803 // Fall 2019

EXPRESSION EVALUATION

  • The DBMS represents a

WHERE clause as an expression tree.

81

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.val > 100

slide-82
SLIDE 82

GT 8803 // Fall 2019

EXPRESSION EVALUATION

  • The DBMS represents a

WHERE clause as an expression tree.

82

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.val > 100

slide-83
SLIDE 83

GT 8803 // Fall 2019

EXPRESSION EVALUATION

  • The DBMS represents a

WHERE clause as an expression tree.

83

SELECT A.id, B.value FROM A, B WHERE A.id = B.id AND B.val > 100

Attribute(B.id)

=

Attribute(A.id)

AND >

Attribute(val) Constant(100)

slide-84
SLIDE 84

GT 8803 // Fall 2019

EXPRESSION EVALUATION

  • The nodes in the tree represent different

expression types:

– Comparison Operators (=, <, >, !=) – Logical Operators (AND, OR) – Arithmetic Operators (+, -, *, /, %) – Constant Values – Tuple Attribute References

84

slide-85
SLIDE 85

GT 8803 // Fall 2018

EXPRESSION EVALUATION

85

SELECT * FROM B WHERE B.val = ? + 1

slide-86
SLIDE 86

GT 8803 // Fall 2018

EXPRESSION EVALUATION

86

SELECT * FROM B WHERE B.val = ? + 1

slide-87
SLIDE 87

GT 8803 // Fall 2018

Execution Context

EXPRESSION EVALUATION

87

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-88
SLIDE 88

GT 8803 // Fall 2018

Execution Context

EXPRESSION EVALUATION

88

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-89
SLIDE 89

GT 8803 // Fall 2018

Execution Context

EXPRESSION EVALUATION

89

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-90
SLIDE 90

GT 8803 // Fall 2018

1000

Execution Context

EXPRESSION EVALUATION

90

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-91
SLIDE 91

GT 8803 // Fall 2018

1000

Execution Context

EXPRESSION EVALUATION

91

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-92
SLIDE 92

GT 8803 // Fall 2018

1000

Execution Context

EXPRESSION EVALUATION

92

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-93
SLIDE 93

GT 8803 // Fall 2018

1000 999

Execution Context

EXPRESSION EVALUATION

93

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-94
SLIDE 94

GT 8803 // Fall 2018

1000 999

Execution Context

EXPRESSION EVALUATION

94

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-95
SLIDE 95

GT 8803 // Fall 2018

1000 999 1

Execution Context

EXPRESSION EVALUATION

95

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-96
SLIDE 96

GT 8803 // Fall 2018

1000 999 1 true 1000

Execution Context

EXPRESSION EVALUATION

96

SELECT * FROM B WHERE B.val = ? + 1

Current Tuple (123, 1000) Query Parameters (int:999) Table Schema B→(int:id, int:val)

Attribute(val) Constant(1) = + Parameter(0)

slide-97
SLIDE 97

GT 8803 // Fall 2019

TREE VS STRING REPRESENTATION

97

slide-98
SLIDE 98

GT 8803 // Fall 2019

TREE VS STRING REPRESENTATION

  • Q: Why are queries and expressions

represented as a tree as opposed to a string?

– Structural elements simplify manipulation as

  • pposed to a string representation

– Works well for complex recursive structures (e.g., sub-queries, nested expressions)

98

slide-99
SLIDE 99

GT 8803 // Fall 2019

SUMMARY

  • The same query plan be executed in multiple

ways.

  • (Most) DBMSs will want to use an index scan

as much as possible.

  • Expression trees are flexible but slow.

99

slide-100
SLIDE 100

GT 8803 // Fall 2018

VISUAL QUERY EXECUTION ENGINE

100

slide-101
SLIDE 101

GT 8803 // Fall 2019

TABLE

  • BLAZEIT

– Each table corresponds to a video – Each tuple corresponds to a detected object – Each table should correspond to an unique entity – Object re-identification

101

Object ID Frame ID Class Mask Features Content 1 1 Bus Mask [F1, F2] Content

slide-102
SLIDE 102

GT 8803 // Fall 2019

TABLE

  • Prior visual query engines

– Assume that data is already populated (i.e., data created externally typically by humans). – In contrast, BLAZEIT automatically populates the data using an off-the-shelf classifier

  • Late materialization depending on query

– Similar to an un-materialized view. – This laziness enables several optimizations

102

slide-103
SLIDE 103

GT 8803 // Fall 2019

EARLY VS LATE MATERIALIZATION

  • Early materialization is inefficient

– Ad-hoc queries: Pre-computing all possible features would be expensive. Wasteful for ad-hoc queries since many of the columns with extracted features may never be used. – Online queries: For queries on live newscasts or broadcast games, it could be faster to execute the queries and classifier directly on the live data.

103

slide-104
SLIDE 104

GT 8803 // Fall 2019

QUERY EXAMPLE

  • Imagine having to narrow down a video of

passing vehicles captured on a traffic camera to red SUVs that are exceeding the speed limit

104

SELECT cameraID, frameID, 𝒟1(ℱ1(vehBox)) AS vehType, 𝒟2(ℱ2(vehBox)) AS vehColor FROM (PROCESS inputVideo PRODUCE cameraID, frameID, vehBox USING VehDetector) WHERE vehType=SUV ∧ vehColor=red;

slide-105
SLIDE 105

GT 8803 // Fall 2019

#1: FILTERING

  • To speed up this query, we can train a filtering

classifier that drops frames that are unlikely to satisfy the predicate

105

OLD PLAN NEW PLAN

slide-106
SLIDE 106

GT 8803 // Fall 2019

#1: FILTERING

  • A filtering classifier must be much faster

relative to the ML model that it bypasses.

– It must filter out a large portion of the input and have few false negatives – Also known as a probabilistic predicate or specialized model

106

slide-107
SLIDE 107

GT 8803 // Fall 2019

#1: FILTERING

  • Different classifiers are appropriate for

different inputs and predicates

– Linear SVMs – Deep neural networks

  • Content-based filtering

– Based on low-level visual features (e.g., avg. color) – If an analyst were to query for “red buses”, we could filter the video to have a certain number of red pixels, or a certain level of redness.

107

slide-108
SLIDE 108

GT 8803 // Fall 2019

#1: FILTERING

  • Temporal filtering

– Filtering based on temporal cues – Example: Analyst may want to find buses in the scene for at least K frames. In this case, sub-sample the video at a rate of (K−1)/2.

  • Spatial filtering

– Only regions of interest (ROIs) of the scene are considered. – ROI can be used to train smaller faster models

108

slide-109
SLIDE 109

GT 8803 // Fall 2019

#2: SAMPLING

  • When the query contains a tolerated error

rate, we can sample from the video

– Example: User is interested in some statistic over the data, such as average number of cars per frame – Only populate a small number of tuples (or not populate them at all) for faster execution. – In cases where the filtering classifier is accurate enough, we can return the answer directly from the filtering classifier

109

slide-110
SLIDE 110

GT 8803 // Fall 2019

#2: SAMPLING

  • In a cardinality-limited filtering query, user is

typically interested in a rare event

– Example: A clip containing at least one bus and five cars – Intuition is to bias the search towards regions of the video that likely contain the event – Train a filtering classifier to bias which frames to sample.

110

slide-111
SLIDE 111

GT 8803 // Fall 2019

QUERY EXECUTION: SUMMARY

  • Key Optimizations

– Filtering, Sampling

  • Query processing model

– Tuple-at-a-time

  • Access methods

– Sequential scan with a complex predicate – Index scan: Filtering classifier (e.g., vehType) – Multi-Index scan: Spatial access (e.g., R-trees)

111

slide-112
SLIDE 112

GT 8803 // Fall 2019

NEXT LECTURE

  • Given an error budget, how to select between

different filtering classifiers?

– Query optimization

112