systems infrastructure for data science
play

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture IX: Distributed query processing and optimization Roadmap Overview (Query Decomposition) Data Localization Query Optimization Uni


  1. Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

  2. Lecture IX: Distributed query processing and optimization

  3. Roadmap • Overview • (Query Decomposition) • Data Localization • Query Optimization Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 3

  4. Query Processing Recap Declarative Procedural Query query specification query execution plan Processor Q Q’ SQL Relational Algebra SELECT ENAME FROM EMP, ASG Π ENAME ( EMP ENO ( σ RESP=“Manager” ( ASG ))) WHERE EMP.ENO = ASG.ENO AND RESP = “Manager” Two important requirements: 1. Correctness: Q’ must be semantically equivalent to Q. 2. Efficiency: Q’ must have the smallest execution cost. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

  5. Cost Metrics • Total cost – processing time at all sites (CPU + I/O) – communication time between sites • In WANs, communication cost usually dominates. • Query response time – time elapsed for executing the query What is the difference between total cost and query response time? Does it change in distributed/parallel settings? Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

  6. Complexity of Relational Algebra Operators n: relation cardinality To reduce costs:  The most selective operations should be performed first.  Operations should be ordered by increasing complexity. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

  7. Query Processing in a Centralized System Given: EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR) Query: Find the names of employees who are managing a project. SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND RESP = “Manager” 1 Π ENAME ( σ RESP=“Manager” AND EMP.ENO=ASG.ENO ( EMP x ASG )) Two equivalent execution plans. 2 Π ENAME ( EMP ENO ( σ RESP=“Manager” ( ASG ))) Which one to use? Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 7

  8. Query Processing in a Distributed System EMP ENO ( σ RESP=“Manager” ( ASG )) • Query: • Data fragments and their allocation to sites: – Site1 : ASG1 = σ ENO ≤ “E3” ( ASG )) – Site2 : ASG2 = σ ENO > “E3” ( ASG )) – Site3 : EMP1 = σ ENO ≤ “E3” ( EMP )) – Site4 : EMP2 = σ ENO > “E3” ( EMP )) – Site5 : Result • Assumptions: – size(EMP) = 400, size(ASG) = 1000, size( σ RESP=“Manager” ( ASG )) = 20 – tuple access cost = 1, tuple transfer cost = 10 – EMP locally indexed on ENO, ASG locally indexed on RESP – uniform data distribution across sites Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

  9. Query Processing in a Distributed System 1 join = 400 * 20 * 1 = 8000 selection = 1000 * 1 = 1000 transfer = 400 * 10 = 4000 transfer = 1000 * 10 = 10000 total cost = 10000 + 4000 + 1000 + 8000 = 23000 Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

  10. Query Processing in a Distributed System 2 transfer = 10 * 10 = 100 transfer = 10 * 10 = 100 join = 10 * 1 * 2 = 20 join = 10 * 1 * 2 = 20 transfer = 10 * 10 = 100 transfer = 10 * 10 = 100 selection = 10 * 1 = 10 selection = 10 * 1 = 10 total cost = 10 + 10 + 100 + 100 + 20 + 20 + 100 + 100 = 460 Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 10

  11. General Query Optimization Issues • Algorithmic approach: – Cost-based vs. Heuristics-based • Granularity: – Single query at a time vs. Multi-query optimization • Timing: – Static vs. Dynamic vs. Hybrid • Statistics: – what to collect, accuracy, independence, uniformity • Decision mechanism: Specific to – Centralized vs. Distributed vs. Hybrid distributed • Network topology: query processing – WANs vs. LANs Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

  12. Distributed Query Processing * * Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 12

  13. Query Decomposition • Goal: To convert global declarative query into a correct and efficient global procedural query • Query decomposition consists of 4 steps: 1. Normalization  Transformation of query predicates into normal form 2. Semantic Analysis  Detection and rejection of semantically incorrect queries 3. Simplification  Elimination of redundant predicates 4. Restructuring  Transformation of the query into algebraic form No distribution-related processing. • Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

  14. Sample Query • Transformation of the query into algebraic form Given: EMP(ENO, ENAME, TITLE) ASG(ENO, PNO, RESP, DUR) PROJ(PNO, PNAME, BUDGET, LOC) Query: Find the names of employees other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years. SELECT ENAME FROM EMP, ASG, PROJ WHERE ASG.ENO = EMP.ENO AND ASG.PNO = PROJ.PNO AND ENAME ≠ “J. Doe” AND PROJ.PNAME = “CAD/CAM” AND (DUR = 12 OR DUR = 24) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

  15. Data Localization • Goal: To convert an algebraic query on global relations into an algebraic query on physical fragments • General approach: 1. Generate a localized query by substituting each global relation in the leaves of the operator tree by the appropriate subtree on fragments. Union for horizontal fragments • Join for vertical fragments • 2. Apply reduction techniques on the localized query to generate a simpler and an optimized operator tree. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

  16. Data Localization Example Query plan on global relations • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) EMP 3 = σ ENO ≥ “E6” (EMP) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) ASG 2 = σ ENO > “E3” (ASG) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

  17. Data Localization Example Query plan on global relations Localized query plan ∪ ∪ Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science ASG 1 ASG 2 17 EMP 1 EMP 2 EMP 3

  18. Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Selection – Given relation R and F R = { R 1 , R 2 , …, R w } where R j = σ pj ( R ) : σ pi (R j ) = φ , if ∀ x in R: ¬(p i (x) ∧ p j (x)) SELECT * – Example: EMP is fragmented as before. FROM EMP WHERE ENO = “E5” Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 18

  19. Data Localization Reduction for Primary Horizontal Fragmentation • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) • Reduction with Selection EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) EMP 3 = σ ENO ≥ “E6” (EMP) – Given relation R and F R = { R 1 , R 2 , …, R w } where R j = σ pj ( R ) : σ pi (R j ) = φ , if ∀ x in R: ¬(p i (x) ∧ p j (x)) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) SELECT * – Example: EMP is fragmented as before. ASG 2 = σ ENO > “E3” (ASG) FROM EMP WHERE ENO = “E5” Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 19

  20. Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Join – Apply when fragmentation is done on the join attribute – Distribute Joins over Unions (R 1 ∪ R 2 ) S ⇔ (R 1 S) ∪ (R 2 S) – Eliminate useless Joins R i R j = φ , if ∀ x in R i , ∀ y in R j : ¬(p i (x) ∧ p j (y)) • Example: – EMP and ASG are fragmented as before. SELECT * FROM EMP, ASG WHERE EMP.ENO = ASG.ENO Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 20

  21. Data Localization Reduction for Primary Horizontal Fragmentation • Reduction with Join Example (cont’d): Localized query Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 21

  22. Data Localization Reduction for Primary Horizontal Fragmentation • EMP is fragmented as follows: EMP 1 = σ ENO ≤ “E3” (EMP) • Reduction with Join Example (cont’d): EMP 2 = σ “E3” < ENO ≤ “E6” (EMP) Localized query EMP 3 = σ ENO ≥ “E6” (EMP) • ASG is fragmented as follows: ASG 1 = σ ENO ≤ “E3” (ASG) ASG 2 = σ ENO > “E3” (ASG) Reduced query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 22

  23. Data Localization Reduction for Vertical Fragmentation • Reduction with Projection – Given a relation R defined over attributes A = {A 1 , ..., A n } and vertically fragmented as R i = Π A' (R) where A' ⊆ A : Π D,K (R i ) is useless, if the set of projection attributes D is not in A‘. • Example: – EMP is vertically fragmented as follows: EMP 1 = Π ENO,ENAME (EMP) Localized query Reduced query EMP 2 = Π ENO,TITLE (EMP) SELECT ENAME FROM EMP Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 23

  24. Data Localization Reduction for Derived Horizontal Fragmentation • Example: ASG 1 : ASG ENO EMP 1 SELECT * ASG 2 : ASG ENO EMP 2 FROM EMP, ASG EMP 1 : σ TITLE = “Programmer” (EMP) WHERE ASG.ENO = EMP.ENO AND EMP.TITLE = “Mech. Eng.” EMP 2 : σ TITLE ≠ “Programmer” (EMP) Localized query Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 24

  25. Data Localization Reduction for Derived Horizontal Fragmentation • Example cont’d: Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend