motivation
play

MOTIVATION In many scenarios a data processing pipeline repeatedly - PowerPoint PPT Presentation

C OMPRESSED R EPRESENTATIONS OF C ONJUNCTIVE Q UERY R ESULTS Paris Koutris University of Wisconsin-Madison joint work with Shaleen Deep (UW-Madison) MOTIVATION In many scenarios a data processing pipeline repeatedly accesses the result of a


  1. C OMPRESSED R EPRESENTATIONS OF C ONJUNCTIVE Q UERY R ESULTS Paris Koutris University of Wisconsin-Madison joint work with Shaleen Deep (UW-Madison)

  2. MOTIVATION • In many scenarios a data processing pipeline repeatedly accesses the result of a join query using some access pattern • But the result of the query over large data can lead to a large result, and be very expensive to store directly Can we compress the query result so that we can still allow these accesses to be performed efficiently? 2

  3. EXAMPLE : GRAPH DATA Consider the author relation from DBLP: R ( author, paper ) We want to run a data analysis over the co-author graph, which can be expressed as the following view: V ( x, z ) ← R ( x, p ) , R ( y, p ) 3

  4. EXAMPLE : GRAPH DATA Graph analytics algorithms access a graph through an API that asks for the set of neighbors of a given vertex , expressed by an adorned view: V bf ( x, y ) ← R ( x, p ) , R ( y, p ) • x is a bound ( b ) variable • z is a free ( f ) variable [Xirogiannopoulos& Deshpande, ’17 ] 4

  5. EXAMPLE : GRAPH DATA V bf ( x, y ) ← R ( x, p ) , R ( y, p ) How can we solve this problem? 1. run each access request from scratch 2 . create index on materialized V space can be Ω(𝑂 2 ) • • no extra space needed answer time can be Ω(𝑂) • • answer in constant time what exists between the two extremes? 5

  6. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 6

  7. ADORNED VIEWS We consider the class of conjunctive queries: V ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) An adorned view [Ullman ‘85] describes an access pattern where some variables are bound ( b ) and others free ( f ) V bbf ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) V fff ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) An adorned view is full is every variable appears in the head 7

  8. COMPRESSED REPRESENTATIONS database D adorned view 𝑊 𝜃 compression time 𝑼 𝑫 preprocessing compressed space 𝑻 representation online phase answer time/delay access requests 8

  9. PARAMETERS Goal : construct a space-efficient representation to answer access requests that originate from a given adorned view Parameters : • compression time 𝑈 𝐷 • space: 𝑇 • answering time – total answer time 𝑈 𝐵 ( time to enumerate all results ) – delay 𝜀 ( maximum time between outputting two consecutive tuples ) 9

  10. FACTORIZED DATABASES [Olteanu & Závodný 15] Suppose that the adorned view has only free variables and the query is full: V f ··· f ( x 1 , . . . , x k ) ← . . . In compression time T A = O ( | D | fhw ( Q ) ) we can construct a compressed representation with space S = O ( | D | fhw ( Q ) ) such that we can answer any access request over D with constant delay. * fhw(Q) = fractional hypertree width of Q 10

  11. EXAMPLE | R | = | S | = | T | = N V bbf ( x, y, z ) ← R ( x, y ) , S ( y, z ) , T ( z, x ) compression total answer space delay time time 𝑃(𝑂) 𝑃(𝑂) 𝑃(𝑂) do nothing - materialize 𝑃(𝑂 2/4 ) 𝑃(𝑂 2/4 ) 𝑃(|𝑃𝑉𝑈|) * 𝑃(1) + create index 𝑃 𝑂 2/4 𝑃(𝑂 2/4 ) 𝑃(𝜐|𝑃𝑉𝑈|) 𝑃(𝜐) our results 𝜐 * OUT is the output of an access request 11

  12. ALL BOUND VARIABLES Suppose that the adorned view has only bound variables: V b ··· b ( x 1 , . . . , x k ) ← . . . Then, in time linear to the database size, we can construct a compressed representation with linear space that can answer any access request over D with constant delay. IDEA : simply create a hash index for every relation 12

  13. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 13

  14. QUERY AS A HYPERGRAPH Given an adorned view 𝑅 𝜃 , it will be convenient to view it as a hypergraph 𝐼 = (𝑊, 𝐹) Q fffbbb ( x, y, z, w 1 , w 2 , w 3 ) ← R 1 ( w 1 , x, y ) , R 2 ( w 2 , y, z ) , R 3 ( w 3 , x, z ) x bound variables: 𝑊 w 1 w 3 > y z free variables: 𝑊 ? w 2 14

  15. FRACTIONAL EDGE COVER fractional edge cover : assign a weight to each hyperedge such that for every variable, the sum of the weights that include it is at least 1 1 1 x w 1 w 3 y z w 2 1 15

  16. SLACK slack : given a fractional edge cover 𝒗 , and a subset 𝑇 of the variables, the slack 𝛽(𝑇) is the maximum quantity such that 𝒗/𝛽(𝑇) is still a fractional cover of 𝑇 1 1 x w 1 w 3 V f = { x, y, z } α ( V f ) = 2 y z w 2 the slack is always at least one! 1 16

  17. AGM BOUND Let 𝐼 = 𝑊, 𝐹 be a hypergraph. For every fractional edge cover 𝒗 of 𝑊 , the output size of the corresponding join query is upper bounded by Y | R F | u F F ∈ E 1 1 x w 1 w 3 In our example, if all relations have y z size 𝑂 , we obtain a bound of 𝑂 3 w 2 1 17

  18. MAIN THEOREM #1 𝑅 𝜃 : full adorned view with hypergraph 𝐼 = (𝑊, 𝐹) 𝒗 : any fractional edge cover of 𝑊 For any database D and parameter 𝜐 > 0 , we can construct a compressed representation with: | R F | u F ) compression time T C = ˜ Y O ( | D | + F ∈ E | R F | u F / τ α ( V f ) ) space S = ˜ Y O ( | D | + F ∈ E delay δ = ˜ O ( τ ) answer time T A = ˜ O ( | q ( D ) | + τ · | q ( D ) | 1 / α ( V f )) ) 18

  19. EXAMPLE 1 1 x w 1 w 3 u = (1 , 1 , 1) α ( V f ) = 2 , V f = { x, y, z } y z w 2 1 compression time T C = ˜ O ( N 3 ) space S = ˜ O ( N 3 / τ 2 ) delay δ = ˜ O ( τ ) answer time T A = ˜ O ( | q ( D ) | + τ · | q ( D ) | 1 / 2 ) 19

  20. THE DATA STRUCTURE (1) • Consider an ordering of the free variables 𝑊 𝑔 e.g. 𝑦 ≤ 𝑧 ≤ 𝑨 • This induces a lexicographic ordering for all the valuations over 𝑊 𝑔 • Using this ordering, we can define intervals: 𝐽 K = [ 0,0,10 ,(0,10,20)] 𝐽 4 = [ 3,1,0 , 4,5,0 ) • Given a valuation 𝑤 𝑐 over 𝑊 𝑐 and an interval 𝐽 , we can estimate an upper bound 𝑼(𝒘 𝒄 , 𝑱) on the cost of computing the query restricted on 𝑤 𝑐 , 𝐽 using the AGM bound 20

  21. THE DATA STRUCTURE (2) • The data structure is a binary tree parameterized by a threshold 𝝊 • Each node is labeled by an interval 𝐽 • Each node stores a bit for every valuation over 𝑤 𝑐 over 𝑊 𝑐 with cost 𝑈(𝑤 𝑐 , 𝐽) > 𝜐 : 0 : the query over 𝑤 𝑐 , 𝐽 is empty 1 : the query over 𝑤 𝑐 , 𝐽 is not empty 21

  22. THE DATA STRUCTURE (3) the interval in the root node includes all valuations I at the next level, the interval of the parent I 1 I 2 is split into two smaller intervals I 6 I 3 I 4 I 5 we stop at log |𝐸| levels … … … We split the intervals such that the bits we need to store at the two sub-intervals is balanced 22

  23. USING THE DATA STRUCTURE We are given a valuation 𝑤 𝑐 over 𝑊 𝑐 starting from the root: I 1 • if there is not a bit set, we run the query on the interval • if bit =0, we exit the node - I 1 I 2 1 • if bit = 1, we visit the left and then the right child I 6 I 3 I 4 I 5 - 0 … … … The delay is bounded by the threshold 𝜐 23

  24. COROLLARY OF THEOREM #1 𝑅 𝜃 : full adorned view with hypergraph 𝐼 = (𝑊, 𝐹) 𝜍(𝐼) : minimum fractional edge cover For any input database D and parameter 𝜐 > 0 , we can construct a compressed representation with: space S = ˜ O ( | D | + | D | ρ ( H ) / τ ) delay δ = ˜ O ( τ ) For 𝜐 = 1 , the space matches the AGM bound 24

  25. BETTER BOUNDS USING SLACK 1 1 x 2 1 Star join query x 3 x 1 • fractional edge cover assigns weight 1 z • the slack for {𝑨} in this case is 𝑜 … x n 1 space S = ˜ O ( | D | + | D | n / τ n ) delay δ = ˜ O ( τ ) answer time S = ˜ O ( | q ( D ) | + τ | q ( D ) | 1 /n ) If we ignored slack, the space would be |𝐸| 𝑜 /𝜐 25

  26. DELAY - SPACE TRADEOFF delay slack =1 slack > 1 1 |D| ρ |D| space AGM bound 26

  27. FAST SET INTERSECTION Given a family of sets { 𝑇 1 , . .. , 𝑇 𝑜 } with total size 𝑛 , construct a space-efficient data structure such that given any 𝑗, 𝑘 we can compute 𝑇 𝑗 ∩ 𝑇 𝑘 as fast as possible [Cohen & Porat ‘10] 1 y Q bbf ( x, y, z ) ← R ( x, z ) , R ( y, z ) z x 1 special case of the Theorem #1: 𝑆(𝑡,𝑐) encodes that set s contains element b 27

  28. LIMITATIONS OF THEOREM #1 Consider the following adorned view: Q fff ( x, y, z ) ← R ( x, z ) , S ( y, z ) • Theorem #1 implies that for constant delay we need space 𝑃(𝑂 2 ) • But we know that we can achieve the same delay with only linear space (because of acyclicity) • Why is there a mismatch in the space bounds? We must take the query structure into account as well 28

  29. TALK OUTLINE 1. Problem Setting 2. Main Result #1 3. Main Result #2 4. Future Work 29

  30. TREE DECOMPOSITION Q ( x 1 , . . . , x 7 ) ← R 1 ( x 1 , x 2 ) , R 2 ( x 2 , x 3 ) , . . . , R 6 ( x 6 , x 7 ) x 2 x 1 Given a hypergraph 𝐼 = 𝑊, 𝐹 , a tree decomposition of 𝐼 is a tuple x 3 x 2 (𝑈, (𝐶 𝑢 )) where 𝑈 is a tree, and each bag 𝐶 𝑢 is a subset of 𝑊 such that: x 4 x 3 • each edge is contained in some bag • for each variable 𝑦 , the set of tree x 5 x 4 nodes that contain 𝑦 in their bag is connected x 6 x 5 x 7 x 6 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend