data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 8 : Q U E R Y E X E C U T I O N administrivia Reminders Sign up for discussion slots on Thursday Proposal presentations on next Wednesday


  1. #2: OPERATOR-AT-A-TIME MODEL SELECT A.id, B.value out = { } for t in child.Output() : FROM A, B out.add ( projection (t)) WHERE A.id = B.id AND B.value > 100 out = { } p 4 for t 1 in left.Output() : buildHashTable (t 1 ) A.id, B.value for t 2 in right.Output() : if probe (t 2 ): out.add (t 1 ⨝ t 2 ) ⨝ A.id=B.id out = { } 3 for t in child.Output() : s if evalPred (t): out.add (t) value>100 out = { } out = { } 1 2 for t in A : for t in B : A B out.add (t) out.add (t) 29 GT 8803 // Fall 2018

  2. #2: OPERATOR-AT-A-TIME MODEL SELECT A.id, B.value out = { } 5 for t in child.Output() : FROM A, B out.add ( projection (t)) WHERE A.id = B.id AND B.value > 100 out = { } p 4 for t 1 in left.Output() : buildHashTable (t 1 ) A.id, B.value for t 2 in right.Output() : if probe (t 2 ): out.add (t 1 ⨝ t 2 ) ⨝ A.id=B.id out = { } 3 for t in child.Output() : s if evalPred (t): out.add (t) value>100 out = { } out = { } 1 2 for t in A : for t in B : A B out.add (t) out.add (t) 30 GT 8803 // Fall 2018

  3. #2: OPERATOR-AT-A-TIME MODEL • Better for OLTP workloads – Transactions typically only access a small number of tuples at a time. – Lower execution / coordination overhead. • Not good for OLAP queries with large intermediate results. GT 8803 // Fall 2019 31

  4. #3: VECTOR-AT-A-TIME MODEL • Like tuple-at-a-time model, each operator implements a next function. • Each operator emits a vector (i.e., batch ) of tuples instead of a single tuple. – The operator's internal loop processes multiple tuples at a time. – The size of the batch can vary based on hardware or query properties. GT 8803 // Fall 2019 32

  5. #3: VECTOR-AT-A-TIME MODEL out = { } for t in child.Next() : SELECT A.id, B.value out.add ( projection (t)) FROM A, B if | out |> n : emit ( out ) WHERE A.id = B.id out = { } AND B.value > 100 for t 1 in left.Next() : p buildHashTable (t 1 ) for t 2 in right.Next() : A.id, B.value if probe (t 2 ): out.add (t 1 ⨝ t 2 ) if | out |> n : emit ( out ) ⨝ A.id=B.id out = { } for t in child.Next() : s if evalPred (t): out.add (t) value>100 if | out |> n : emit ( out ) out = { } out = { } A B for t in A : for t in B : out.add (t) out.add (t) if | out |> n : emit ( out ) if | out |> n : emit ( out ) 33 GT 8803 // Fall 2018

  6. #3: VECTOR-AT-A-TIME MODEL out = { } 1 for t in child.Next() : SELECT A.id, B.value out.add ( projection (t)) FROM A, B if | out |> n : emit ( out ) WHERE A.id = B.id out = { } AND B.value > 100 2 for t 1 in left.Next() : p buildHashTable (t 1 ) for t 2 in right.Next() : A.id, B.value if probe (t 2 ): out.add (t 1 ⨝ t 2 ) if | out |> n : emit ( out ) ⨝ A.id=B.id out = { } for t in child.Next() : s if evalPred (t): out.add (t) value>100 if | out |> n : emit ( out ) out = { } out = { } 3 A B for t in A : for t in B : out.add (t) out.add (t) if | out |> n : emit ( out ) if | out |> n : emit ( out ) 34 GT 8803 // Fall 2018

  7. #3: VECTOR-AT-A-TIME MODEL out = { } 1 for t in child.Next() : SELECT A.id, B.value out.add ( projection (t)) FROM A, B if | out |> n : emit ( out ) WHERE A.id = B.id out = { } AND B.value > 100 2 for t 1 in left.Next() : p buildHashTable (t 1 ) for t 2 in right.Next() : A.id, B.value if probe (t 2 ): out.add (t 1 ⨝ t 2 ) if | out |> n : emit ( out ) ⨝ A.id=B.id out = { } 4 for t in child.Next() : s if evalPred (t): out.add (t) value>100 if | out |> n : emit ( out ) out = { } out = { } 3 5 A B for t in A : for t in B : out.add (t) out.add (t) if | out |> n : emit ( out ) if | out |> n : emit ( out ) 35 GT 8803 // Fall 2018

  8. #3: VECTOR-AT-A-TIME MODEL GT 8803 // Fall 2019 36

  9. #3: VECTOR-AT-A-TIME MODEL • Q: What is the target workload for this model: OLTP or OLAP? – Reduces number of next function invocations – Works well for OLAP • Q: Why is it better than operator-at-a-time model in traditional disk-centric DBMSs? – Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses GT 8803 // Fall 2019 37

  10. #3: VECTOR-AT-A-TIME MODEL • Q: What is the target workload for this model: OLTP or OLAP? – Reduces number of next function invocations – Works well for OLAP • Q: Why is it better than operator-at-a-time model in traditional disk-centric DBMSs? – Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses GT 8803 // Fall 2019 38

  11. #3: VECTOR-AT-A-TIME MODEL • Q: What is the target workload for this model: OLTP or OLAP? – Reduces number of next function invocations – Works well for OLAP • Q: Why is it better than operator-at-a-time model in traditional disk-centric DBMSs? – Intermediate tables may not fit in main-memory – Fetching batches of tuples reduces number of page accesses GT 8803 // Fall 2019 39

  12. #3: VECTOR-AT-A-TIME MODEL • Ideal for OLAP queries – Greatly reduces the number of invocations per operator. – Allows for operators to use vectorized instructions ( SIMD ) to process batches of tuples. GT 8803 // Fall 2019 40

  13. QUERY PROCESSING MODELS: SUMMARY Tuple-at-a-time → Direction: Top-Down → Emits: Single Tuple → Target: General Purpose Operator-at-a-time Vector-at-a-time → Direction: Bottom-Up → Direction: Top-Down → Emits: Entire Tuple Set → Emits: Tuple Batch → Target: OLTP → Target: OLAP 41 GT 8803 // Fall 2018

  14. ACCESS METHODS SELECT A.id, B.value • An access method is a way that FROM A, B WHERE A.id = B.id the DBMS can access the AND B.value > 100 p data stored in a table. – Not defined in relational algebra A.id, B.value – Physical database design ⨝ A.id=B.id • Three basic methods: s – Sequential Scan value>100 – Index Scan – Multi-index / "Bitmap" Scan A B GT 8803 // Fall 2019 42

  15. ACCESS METHODS SELECT A.id, B.value • An access method is a way that FROM A, B WHERE A.id = B.id the DBMS can access the AND B.value > 100 p data stored in a table. – Not defined in relational algebra A.id, B.value – Physical database design ⨝ A.id=B.id • Three basic methods: s – Sequential Scan value>100 – Index Scan – Multi-index / "Bitmap" Scan A B GT 8803 // Fall 2019 43

  16. #1: SEQUENTIAL SCAN • For each page in the table: for page in table.pages: – Retrieve it from the buffer pool. for t in page.tuples: if evalPred (t): – Iterate over each tuple and // Do Something! check whether to include it. • The DBMS maintains an internal cursor that tracks the last page / slot it examined. GT 8803 // Fall 2019 44

  17. #1: SEQUENTIAL SCAN • This is almost always the worst thing that the DBMS can do to execute a query. • Sequential scan optimizations: – Prefetching – Parallelization – Buffer Pool Bypass – Zone Maps – Late materialization – Heap clustering GT 8803 // Fall 2019 45

  18. #1: SEQUENTIAL SCAN • This is almost always the worst thing that the DBMS can do to execute a query. • Sequential scan optimizations: – Prefetching – Parallelization – Buffer Pool Bypass – Zone Maps – Late materialization – Heap clustering GT 8803 // Fall 2019 46

  19. ZONE MAPS • Pre-computed aggregates for the attribute values in a page. – DBMS checks the zone map first to decide whether it wants to access the page. Original Data val 100 200 300 400 400 GT 8803 // Fall 2019 47

  20. ZONE MAPS • Pre-computed aggregates for the attribute values in a page. – DBMS checks the zone map first to decide whether it wants to access the page. Original Data Zone Map val type val 100 MIN 100 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5 GT 8803 // Fall 2019 48

  21. ZONE MAPS • Pre-computed aggregates for the attribute values in a page. – DBMS checks the zone map first to decide whether it wants to access the page. Original Data Zone Map val type val SELECT * FROM table 100 MIN 100 WHERE val > 600 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5 GT 8803 // Fall 2019 49

  22. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. GT 8803 // Fall 2019 50

  23. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. SELECT AVG (C) FROM foo JOIN bar ON foo.b = bar.b WHERE a > 100 a b c 0 1 2 3 GT 8803 // Fall 2019 51

  24. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b s a b c a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 52

  25. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b s a b c a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 53

  26. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b s a b c a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 54

  27. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b s a b c Offsets a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 55

  28. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b Offsets s a b c Offsets a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 56

  29. LATE MATERIALIZATION • DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. γ SELECT AVG (C) FROM foo JOIN bar AVG (c) Result ON foo.b = bar.b ⨝ WHERE a > 100 foo.b=bar.b Offsets s a b c Offsets a>100 0 1 bar foo 2 3 GT 8803 // Fall 2019 57

  30. HEAP CLUSTERING • Tuples are sorted in the heap's pages using the order specified by the clustering index. 101 102 103 104 • If the query accesses tuples using the clustering index's attributes, then the DBMS can jump directly to the pages that it needs. GT 8803 // Fall 2019 58

  31. HEAP CLUSTERING • Tuples are sorted in the Scan Direction heap's pages using the order specified by the clustering index. 101 102 103 104 • If the query accesses tuples using the clustering index's attributes, then the DBMS can jump directly to the pages that it needs. GT 8803 // Fall 2019 59

  32. HEAP CLUSTERING • Tuples are sorted in the Scan Direction heap's pages using the order specified by the clustering index. 101 102 103 104 • If the query accesses tuples using the clustering index's attributes, then the DBMS can jump directly to the pages that it needs. GT 8803 // Fall 2019 60

  33. #2: INDEX SCAN • The DBMS picks an index to find the tuples that the query needs. • Which index to use depends on: – What attributes the index contains – What attributes the query references – The attribute's value domains – Predicate composition – Whether the index has unique or non-unique keys GT 8803 // Fall 2019 61

  34. #2: INDEX SCAN • The DBMS picks an index to find the tuples that the query needs. • Which index to use depends on: – What attributes the index contains – What attributes the query references – The attribute's value domains – Predicate composition – Whether the index has unique or non-unique keys GT 8803 // Fall 2019 62

  35. #2: INDEX SCAN • Suppose that we a single table SELECT * FROM students WHERE age < 30 with 100 tuples and 2 indexes: AND dept = 'CS' – Index #1: age AND country = 'US' – Index #2: dept GT 8803 // Fall 2019 63

  36. #2: INDEX SCAN • Suppose that we a single table SELECT * FROM students WHERE age < 30 with 100 tuples and 2 indexes: AND dept = 'CS' – Index #1: age AND country = 'US' – Index #2: dept Scenario #1 There are 99 people under the age of 30 but only 2 people in the CS department. GT 8803 // Fall 2019 64

  37. #2: INDEX SCAN • Suppose that we a single table SELECT * FROM students WHERE age < 30 with 100 tuples and 2 indexes: AND dept = 'CS' – Index #1: age AND country = 'US' – Index #2: dept Scenario #1 Scenario #2 There are 99 people There are 99 people in under the age of 30 the CS department but but only 2 people in only 2 people under the CS department. the age of 30. GT 8803 // Fall 2019 65

  38. #3: MULTI-INDEX SCAN • If there are multiple indexes that the DBMS can use for a query: – Compute sets of record ids using each matching index. – Combine these sets based on the query's predicates (union vs. intersect). – Retrieve the records and apply any remaining terms. • Postgres calls this Bitmap Scan GT 8803 // Fall 2019 66

  39. #3: MULTI-INDEX SCAN • With an index on age and an index SELECT * FROM students WHERE age < 30 on dept , AND dept = 'CS' – We can retrieve the record ids AND country = 'US' satisfying age<30 using the first, – Then retrieve the record ids satisfying dept='CS' using the second, – Take their intersection – Retrieve records and check country='US' . GT 8803 // Fall 2019 67

  40. #3: MULTI-INDEX SCAN • Set intersection can be done SELECT * FROM students WHERE age < 30 with bitmaps, hash tables, AND dept = 'CS' or Bloom filters. AND country = 'US' age<30 dept='CS' GT 8803 // Fall 2019 68

  41. #3: MULTI-INDEX SCAN • Set intersection can be done SELECT * FROM students WHERE age < 30 with bitmaps, hash tables, AND dept = 'CS' or Bloom filters. AND country = 'US' age<30 dept='CS' record ids GT 8803 // Fall 2019 69

  42. #3: MULTI-INDEX SCAN • Set intersection can be done SELECT * FROM students WHERE age < 30 with bitmaps, hash tables, AND dept = 'CS' or Bloom filters. AND country = 'US' age<30 dept='CS' record ids record ids GT 8803 // Fall 2019 70

  43. #3: MULTI-INDEX SCAN • Set intersection can be done SELECT * FROM students WHERE age < 30 with bitmaps, hash tables, AND dept = 'CS' or Bloom filters. AND country = 'US' age<30 dept='CS' record ids record ids fetch records country='US' GT 8803 // Fall 2019 71

  44. INDEX SCAN PAGE SORTING • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. • The DBMS can first figure out all the tuples that it needs and then sort them based on their page id. GT 8803 // Fall 2019 72

  45. INDEX SCAN PAGE SORTING Scan Direction • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. • The DBMS can first figure out all the tuples that it needs and then sort them based on their page id. GT 8803 // Fall 2019 73

  46. INDEX SCAN PAGE SORTING Scan Direction • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. • The DBMS can first figure out all the tuples that it needs and then sort them based on their page id. GT 8803 // Fall 2019 74

  47. INDEX SCAN PAGE SORTING Scan Direction • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. Page 102 • The DBMS can first figure out Page 103 Page 104 Page 104 all the tuples that it needs Page 102 Page 103 and then sort them Page 102 Page 102 Page 101 based on their page id. Page 103 Page 104 Page 103 GT 8803 // Fall 2019 75

  48. INDEX SCAN PAGE SORTING Scan Direction • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. Page 102 • The DBMS can first figure out Page 103 Page 104 Page 104 all the tuples that it needs Page 102 Page 103 and then sort them Page 102 Page 102 Page 101 based on their page id. Page 103 Page 104 Page 103 GT 8803 // Fall 2019 76

  49. INDEX SCAN PAGE SORTING Scan Direction • Retrieving tuples in the order that appear in an unclustered index is 101 102 103 104 inefficient. Page 102 Page 101 • The DBMS can first figure out Page 103 Page 101 Page 104 Page 102 Page 104 all the tuples that it needs Page 102 Page 102 Page 102 Page 103 Page 102 and then sort them Page 102 Page 103 Page 102 Page 103 Page 101 based on their page id. Page 103 Page 103 Page 104 Page 104 Page 104 Page 103 Page 104 GT 8803 // Fall 2019 77

  50. CLUSTERED VS UNCLUSTERED INDEX GT 8803 // Fall 2019 78

  51. CLUSTERED VS UNCLUSTERED INDEX • Q: What is the difference between a clustered and unclustered index? – Clustered: Tuples are stored physically on disk in the same order as the index. Only one per table. – Unclustered: Ordered differently. Values are only pointers to the physical tuples. GT 8803 // Fall 2019 79

  52. EXPRESSION EVALUATION 80 GT 8803 // Fall 2018

  53. EXPRESSION EVALUATION SELECT A.id, B.value • The DBMS represents a FROM A, B WHERE clause as an WHERE A.id = B.id AND B.val > 100 expression tree . GT 8803 // Fall 2019 81

  54. EXPRESSION EVALUATION SELECT A.id, B.value • The DBMS represents a FROM A, B WHERE clause as an WHERE A.id = B.id AND B.val > 100 expression tree . GT 8803 // Fall 2019 82

  55. EXPRESSION EVALUATION SELECT A.id, B.value • The DBMS represents a FROM A, B WHERE clause as an WHERE A.id = B.id AND B.val > 100 expression tree . AND = > Attribute(A.id) Attribute(B.id) Attribute(val) Constant(100) GT 8803 // Fall 2019 83

  56. EXPRESSION EVALUATION • The nodes in the tree represent different expression types: – Comparison Operators ( = , < , > , != ) – Logical Operators ( AND, OR ) – Arithmetic Operators ( + , - , * , / , % ) – Constant Values – Tuple Attribute References GT 8803 // Fall 2019 84

  57. EXPRESSION EVALUATION SELECT * FROM B WHERE B.val = ? + 1 85 GT 8803 // Fall 2018

  58. EXPRESSION EVALUATION SELECT * FROM B WHERE B.val = ? + 1 86 GT 8803 // Fall 2018

  59. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + Parameter(0) Constant(1) 87 GT 8803 // Fall 2018

  60. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + Parameter(0) Constant(1) 88 GT 8803 // Fall 2018

  61. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + Parameter(0) Constant(1) 89 GT 8803 // Fall 2018

  62. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 90 GT 8803 // Fall 2018

  63. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 91 GT 8803 // Fall 2018

  64. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 92 GT 8803 // Fall 2018

  65. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 999 93 GT 8803 // Fall 2018

  66. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 999 94 GT 8803 // Fall 2018

  67. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = Attribute(val) + 1000 Parameter(0) Constant(1) 999 1 95 GT 8803 // Fall 2018

  68. EXPRESSION EVALUATION Execution Context SELECT * FROM B Current Tuple Query Parameters Table Schema WHERE B.val = ? + 1 (123, 1000) (int:999) B→(int:id, int:val) = true Attribute(val) + 1000 1000 Parameter(0) Constant(1) 999 1 96 GT 8803 // Fall 2018

  69. TREE VS STRING REPRESENTATION GT 8803 // Fall 2019 97

  70. TREE VS STRING REPRESENTATION • Q: Why are queries and expressions represented as a tree as opposed to a string ? – Structural elements simplify manipulation as opposed to a string representation – Works well for complex recursive structures (e.g., sub-queries, nested expressions) GT 8803 // Fall 2019 98

  71. SUMMARY • The same query plan be executed in multiple ways. • (Most) DBMSs will want to use an index scan as much as possible. • Expression trees are flexible but slow. GT 8803 // Fall 2019 99

  72. VISUAL QUERY EXECUTION ENGINE 100 GT 8803 // Fall 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend