Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Triangle query R 1 ( A , B ) , R 2 ( A , C ) , R 3 ( B , C ) key ( A ) = ∅ A Step 3: ⇒ key ( B ) = { A } C is removed from T A , B , C B and inserted into ∆ key ( C ) = { A , B } C 20 / 99

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) A , B B , C C , D 21 / 99

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) key ( A ) = ∅ A A , B Step 1: ⇒ A is removed from T B , C and inserted into ∆ C , D 22 / 99

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) key ( A ) = ∅ A A , B key ( B ) = { A } Step 2: B ⇒ B is removed from T B , C and inserted into ∆ C , D 23 / 99

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) key ( A ) = ∅ A A , B key ( B ) = { A } Step 3: B ⇒ C is removed from T B , C key ( C ) = { B } and inserted into ∆ C C , D 24 / 99

Hypertree Decompositions ⇔ Variable Orders From hypertree decomposition T to variable order ∆: [OZ15] Create a node A in ∆ for a variable A in the top bag in T Recurse with T where A is removed from all bags in T . If top bag empty, then recurse independently on each of its child bags and create children of A in ∆ Update key for each variable at each step. Example: Path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) key ( A ) = ∅ A A , B key ( B ) = { A } Step 4: B ⇒ D is removed from T B , C key ( C ) = { B } and inserted into ∆ C C , D key ( D ) = { C } D 25 / 99

Outline of Part 1: Joins Introduction by Examples Decompositions and Variable Orders Size Bounds for Join Results Worst-Case Optimal Join Algorithms Further Work and References Quiz 26 / 99

How Can We Bound the Size of the Join Result? Example: the path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) Assumption: All relations have size N . The query result is included in the result of R 1 ( A , B ) , R 3 ( C , D ) ◮ Its size is upper bounded by N 2 = | R 1 | × | R 3 | ◮ All variables are ”covered” by the relations R 1 and R 3 There are databases for which the result size is at least N 2 ◮ Let R 1 = [ N ] × { 1 } , R 2 = { 1 } × [ N ] , R 3 = [ N ] × { 1 } . 27 / 99

How Can We Bound the Size of the Join Result? Example: the path query R 1 ( A , B ) , R 2 ( B , C ) , R 3 ( C , D ) Assumption: All relations have size N . The query result is included in the result of R 1 ( A , B ) , R 3 ( C , D ) ◮ Its size is upper bounded by N 2 = | R 1 | × | R 3 | ◮ All variables are ”covered” by the relations R 1 and R 3 There are databases for which the result size is at least N 2 ◮ Let R 1 = [ N ] × { 1 } , R 2 = { 1 } × [ N ] , R 3 = [ N ] × { 1 } . Conclusion: Size of the query result is Θ( N 2 ) for some input classes 27 / 99

How Can We Bound the Size of the Join Result? Example: the triangle query R 1 ( A , B ) , R 2 ( A , C ) , R 3 ( B , C ) Assumption: All relations have size N . The query result is included in the result of R 1 ( A , B ) , R 3 ( B , C ) ◮ Its size is upper bounded by N 2 = | R 1 | × | R 3 | ◮ All variables are ”covered” by the relations R 1 and R 3 There are databases for which the result size is at least N ◮ Let R 1 = [ N ] × { 1 } , R 2 = [ N ] × { 1 } , R 3 ⊇ { (1 , 1) } 28 / 99

How Can We Bound the Size of the Join Result? Example: the triangle query R 1 ( A , B ) , R 2 ( A , C ) , R 3 ( B , C ) Assumption: All relations have size N . The query result is included in the result of R 1 ( A , B ) , R 3 ( B , C ) ◮ Its size is upper bounded by N 2 = | R 1 | × | R 3 | ◮ All variables are ”covered” by the relations R 1 and R 3 There are databases for which the result size is at least N ◮ Let R 1 = [ N ] × { 1 } , R 2 = [ N ] × { 1 } , R 3 ⊇ { (1 , 1) } Conclusion: Size gap between the N 2 upper bound and the N lower bound! Question: Can we close this gap and give tight size bounds? 28 / 99

Edge Covers and Independent Sets We can generalize the previous examples as follows: For the size upper bound: Cover all nodes (variables) by k edges (relations) ⇒ size ≤ N k . This is an edge cover of the query hypergraph! For the size lower bound: m independent nodes ⇒ construct database such that size ≥ N m . This is an independent set of the query hypergraph! max m = | IndependentSet ( Q ) | ≤ | EdgeCover ( Q ) | = min k max m and min k do not necessarily meet! Can we further refine this analysis? 29 / 99

The Fractional Edge Cover Number ρ ∗ ( Q ) The two bounds meet if we take their fractional versions [AGM08] Fractional edge cover of Q with weight k ⇒ size ≤ N k . Fractional independent set with weight m ⇒ size ≥ N m . By duality of linear programming: max m = | FractionalIndependentSet ( Q ) | = | FractionalEdgeCover ( Q ) | = min k This is the fractional edge cover number ρ ∗ ( Q )! For query Q and database of size N , the query result has size O ( N ρ ∗ ( Q ) ) . 30 / 99

The Fractional Edge Cover Number ρ ∗ ( Q ) For a join query Q ( A 1 ∪ · · · ∪ A n ) = R 1 ( A 1 ) , . . . , R n ( A n ), ρ ∗ ( Q ) is the cost of an optimal solution to the linear program: � minimize i ∈ [ n ] x R i � i : edge R i covers node A x R i ≥ 1 ∀ A ∈ � j ∈ [ n ] A j , subject to x R i ≥ 0 ∀ i ∈ [ n ] . x R i is the weight of edge (relation) R i in the hypergraph of Q Each node (variable) has to be covered by edges with sum of weights ≥ 1 In the integer program variant for the edge cover, x R i ∈ { 0 , 1 } 31 / 99

Example: Compute the Fractional Edge Cover (1/3) Consider the join query Q : R ( A , B , C ) , S ( A , B , D ) , T ( A , E ) , U ( E , F ) . A T R B E F C D U S The three edges R , S , U can cover all nodes. FractionalEdgeCover ( Q ) ≤ 3 Each node C , D , and F must be covered by a distinct edge. FractionalIndependentSet ( Q ) ≥ 3 ⇒ ρ ∗ ( Q ) = 3 ⇒ Size ≤ N 3 and for some inputs is Θ( N 3 ). 32 / 99

Example: Compute the Fractional Edge Cover (2/3) Consider the triangle query: R 1 ( A , B ) , R 2 ( A , C ) , R 3 ( B , C ). minimize x R 1 + x R 2 + x R 3 A subject to R 1 R 2 A : x R 1 + x R 2 ≥ 1 B : x R 1 + x R 3 ≥ 1 B C C : + ≥ 1 x R 2 x R 3 R 3 x R 1 ≥ 0 x R 2 ≥ 0 x R 3 ≥ 0 Our previous size upper bound was N 2 : This is obtained by setting any two of x R 1 , x R 2 , x R 3 to 1. What is the fractional edge cover number for the triangle query? 33 / 99

Example: Compute the Fractional Edge Cover (2/3) Consider the triangle query: R 1 ( A , B ) , R 2 ( A , C ) , R 3 ( B , C ). minimize x R 1 + x R 2 + x R 3 A subject to R 1 R 2 A : x R 1 + x R 2 ≥ 1 B : x R 1 + x R 3 ≥ 1 B C C : + ≥ 1 x R 2 x R 3 R 3 x R 1 ≥ 0 x R 2 ≥ 0 x R 3 ≥ 0 Our previous size upper bound was N 2 : This is obtained by setting any two of x R 1 , x R 2 , x R 3 to 1. What is the fractional edge cover number for the triangle query? We can do better: x R 1 = x R 2 = x R 3 = 1 / 2. Then, ρ ∗ = 3 / 2. √ √ Lower bound reaches N 3 / 2 for R 1 = R 2 = R 3 = [ N ] × [ N ]. 33 / 99

Example: Compute the Fractional Edge Cover (3/3) Consider the (4-cycle) join: R ( A 1 , A 2 ) , S ( A 2 , A 3 ) , T ( A 3 , A 4 ) , W ( A 4 , A 1 ). The linear program for its fractional edge cover number: R minimize x R + x S + x T + x W A 1 A 2 subject to A 1 : x R + x W ≥ 1 W S A 2 : + ≥ 1 x R x S A 3 : x S + x T ≥ 1 A 4 A 3 A 4 : + ≥ 1 x T x W T x R ≥ 0 x S ≥ 0 x T ≥ 0 x W ≥ 0 Possible solution: x R = x T = 1. Another solution: x S = x W = 1. Then, ρ ∗ = 2. Lower bound reaches N 2 for R = T = [ N ] × { 1 } and S = W = { 1 } × [ N ]. 34 / 99

Historical Note on the Fractional Edge Cover Number Tight size bounds via ρ ∗ have been known from earlier works in other contexts: (special case) Loomis-Whitney inequality [LW49] (general case) number of occurrences of a subgraph in a graph [A81] generalization of Loomis-Whitney that subsumes the AGM bound [BT95] Recent insightful travel through the history of this result [H18] 35 / 99

Refinement under Cardinality Constraints Common case in practice: Relations have different sizes Small-size projections of relations may be added to the join query Recall the linear program for computing the fractional edge cover number ρ ∗ ( Q ) of a join query Q ( A 1 ∪ · · · ∪ A n ) = R 1 ( A 1 ) , . . . , R n ( A n ): � minimize i ∈ [ n ] x R i � i : edge R i covers node A x R i ≥ 1 ∀ A ∈ � subject to j ∈ [ n ] A j , x R i ≥ 0 ∀ i ∈ [ n ] . 36 / 99

Refinement under Cardinality Constraints Common case in practice: Relations have different sizes Small-size projections of relations may be added to the join query Add relation sizes into the linear program that computes the result size of a join query Q ( A 1 ∪ · · · ∪ A n ) = R 1 ( A 1 ) , . . . , R n ( A n ): � i ∈ [ n ] x Ri minimize N � i : edge R i covers node A x R i ≥ 1 ∀ A ∈ � j ∈ [ n ] A j , subject to x R i ≥ 0 ∀ i ∈ [ n ] . Assumption: All relations have the same size N . 37 / 99

Refinement under Cardinality Constraints Common case in practice: Relations have different sizes Small-size projections of relations may be added to the join query Add relation sizes into the linear program that computes the result size of a join query Q ( A 1 ∪ · · · ∪ A n ) = R 1 ( A 1 ) , . . . , R n ( A n ): � i ∈ [ n ] N x i minimize � i : edge R i covers node A x R i ≥ 1 ∀ A ∈ � subject to j ∈ [ n ] A j , x R i ≥ 0 ∀ i ∈ [ n ] . Assumption: All relations have the same size N . 38 / 99

Refinement under Cardinality Constraints Common case in practice: Relations have different sizes Small-size projections of relations may be added to the join query Add relation sizes into the linear program that computes the result size of a join query Q ( A 1 ∪ · · · ∪ A n ) = R 1 ( A 1 ) , . . . , R n ( A n ): � i ∈ [ n ] N x i minimize i � i : edge R i covers node A x R i ≥ 1 ∀ A ∈ � subject to j ∈ [ n ] A j , x R i ≥ 0 ∀ i ∈ [ n ] . Assumption: Relation R i has size N i , ∀ i ∈ [ n ]. 39 / 99

Size Bounds for Factorized Representations of Join Results 40 / 99

Recall the Itemized Customer Orders Example Orders (O for short) Dish (D for short) Items (I for short) customer day dish dish item item price Elise Monday burger burger patty patty 6 Elise Friday burger burger onion onion 2 Steve Friday hotdog burger bun bun 2 Joe Friday hotdog hotdog bun sausage 4 hotdog onion hotdog sausage Consider the natural join of the above relations: O(customer, day, dish), D(dish, item), I(item, price) customer day dish item price Elise Monday burger patty 6 Elise Monday burger onion 2 Elise Monday burger bun 2 Elise Friday burger patty 6 Elise Friday burger onion 2 Elise Friday burger bun 2 . . . . . . . . . . . . . . . 41 / 99

Factor Out Common Data Blocks O(customer, day, dish), D(dish, item), I(item, price) customer day dish item price Elise Monday burger patty 6 Elise Monday burger onion 2 Elise Monday burger bun 2 Elise Friday burger patty 6 Elise Friday burger onion 2 Elise Friday burger bun 2 . . . . . . . . . . . . . . . The listing representation of the above query result is: � Elise � × � Monday � × � burger � × � patty � × � 6 � ∪ � Elise � × � Monday � × � burger � × � onion � × � 2 � ∪ � Elise � × � Monday � × � burger � × � bun � × � 2 � ∪ � Elise � × � Friday � × � burger � × � patty � × � 6 � ∪ � Elise � × � Friday � × � burger � × � onion � × � 2 � ∪ � Elise � × � Friday � × � burger � × � bun � × � 2 � ∪ . . . It uses relational product ( × ), union ( ∪ ), and data (singleton relations). The attribute names are not shown to avoid clutter. 42 / 99

This is How A Factorized Join Looks Like! ∪ � burger � � hotdog � dish × × ∪ ∪ ∪ ∪ day � Monday � � Friday � � patty � � bun � � onion � � Friday � � bun � � onion � � sausage � item × × × × × × × × × ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ � Elise � � Elise � � 6 � � 2 � � 2 � � Joe � � Steve � � 2 � � 2 � � 4 � customer price Var order Factorized representation of the join result There are several algebraically equivalent factorized representations defined: by distributivity of product over union and their commutativity; as groundings of variable orders. 43 / 99

.. Now with Further Compression using Caching ∪ � burger � � hotdog � ∅ dish × × ∪ ∪ ∪ ∪ { dish } { dish } day � Monday � � Friday � � patty � � bun � � onion � � bun �� onion � � sausage � � Friday � item × × × × × × × × × { dish , ∪ ∪ ∪ ∪ ∪ ∪ ∪ day } { item } � Elise � � Elise � � 6 � � 2 � � 2 � � 4 � � Joe � � Steve � customer price Observation: price is under item , which is under dish , but only depends on item , .. so the same price appears under an item regardless of the dish . Idea: Cache price for a specific item and avoid repetition! 44 / 99

Same Data, Different Factorization ∪ day � Monday � � Friday � × × ∪ ∪ customer � Elise � � Elise � � Joe � � Steve � × × × × ∪ ∪ ∪ ∪ � burger � � burger � � hotdog � � hotdog � dish × × × × ∪ ∪ ∪ ∪ � patty �� bun �� onion � � patty �� bun �� onion � � bun � � onion � � sausage � � bun � � onion � � sausage � item × × × × × × × × × × × × ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ price � 6 � � 2 � � 2 � � 6 � � 2 � � 2 � � 2 � � 2 � � 4 � � 2 � � 2 � � 4 � 45 / 99

.. and Further Compressed using Caching ∪ day � Monday � � Friday � ∅ × × ∪ ∪ { day } customer � Elise � � Elise � � Joe � � Steve � × × × × ∪ ∪ ∪ ∪ { customer , � burger � � burger � � hotdog � � hotdog � dish day } × × × × ∪ ∪ { dish } � patty �� bun �� onion � � bun � � onion � � sausage � item × × × × × × ∪ ∪ ∪ ∪ { item } price � 6 � � 2 � � 2 � � 4 � 46 / 99

Which factorization should we choose? The size of a factorization is the number of its values. Example: � � � � F 1 = � 1 � ∪ · · · ∪ � n � × � 1 � ∪ · · · ∪ � m � F 2 = � 1 � × � 1 � ∪ · · · ∪ � 1 � × � m � ∪ · · · ∪ � n � × � 1 � ∪ · · · ∪ � n � × � m � . F 1 is factorized, F 2 is a listing representation F 1 ≡ F 2 BUT | F 1 | = m + n ≪ | F 2 | = m ∗ n . How much space does factorization save over the listing representation? 47 / 99

Size Bounds for Join Results Given a join query Q , for any database of size N , the join result admits a listing representation of size O ( N ρ ∗ ( Q ) ). [LW49,A81,BT95,AGM08] 48 / 99

Size Bounds for Join Results Given a join query Q , for any database of size N , the join result admits a listing representation of size O ( N ρ ∗ ( Q ) ). [LW49,A81,BT95,AGM08] a factorization without caching of size O ( N s ( Q ) ). [OZ12] 48 / 99

Size Bounds for Join Results Given a join query Q , for any database of size N , the join result admits a listing representation of size O ( N ρ ∗ ( Q ) ). [LW49,A81,BT95,AGM08] a factorization without caching of size O ( N s ( Q ) ). [OZ12] a factorization with caching of size O ( N fhtw ( Q ) ). [OZ15] 48 / 99

Size Bounds for Join Results Given a join query Q , for any database of size N , the join result admits a listing representation of size O ( N ρ ∗ ( Q ) ). [LW49,A81,BT95,AGM08] a factorization without caching of size O ( N s ( Q ) ). [OZ12] a factorization with caching of size O ( N fhtw ( Q ) ). [OZ15] ρ ∗ ( Q ) ≤ | Q | 1 ≤ fhtw ( Q ) ≤ s ( Q ) ≤ �� up to log | Q | up to | Q | | Q | is the number of relations in Q ρ ∗ ( Q ) is the fractional edge cover number of Q s ( Q ) is the factorization width of Q fhtw ( Q ) is the fractional hypertree width of Q [M10] 48 / 99

Size Bounds for Join Results Given a join query Q , for any database of size N , the join result admits a listing representation of size O ( N ρ ∗ ( Q ) ). [LW49,A81,BT95,AGM08] a factorization without caching of size O ( N s ( Q ) ). [OZ12] a factorization with caching of size O ( N fhtw ( Q ) ). [OZ15] These size bounds are asymptotically tight! Best possible size bounds for factorized representations over variable orders of Q and for listing representation, but not database optimal! There exists arbitrarily large databases for which ◮ the listing representation has size Ω( N ρ ∗ ( Q ) ) ◮ the factorization with/without caching over any variable order of Q has size Ω( N s ( Q ) ) and Ω( N fhtw ( Q ) ) respectively. 49 / 99

Example: The Factorization Width s A T A R B E B E F C D C D F U S The structure of the factorization over the above variable order ∆: � � � �� a � × � b � × � c � × � d � × � e � × � f � a ∈ A b ∈ B c ∈ C d ∈ D e ∈ E f ∈ F The number of values for a variable is dictated by the number of valid tuples of values for its ancestors in ∆: One value � f � for each tuple ( a , e , f ) in the join result. Size of factorization = sum of sizes of results of subqueries along paths . 50 / 99

Example: The Factorization Width s A T A R B E B E F C D C D F U S The factorization width for ∆ is the largest ρ ∗ over subqueries defined by root-to-leaf paths in ∆ s ( Q ) is the minimum factorization width over all variable orders of Q In our example: Path A – E – F has fractional edge cover number 2. ⇒ The number of F -values is ≤ N 2 , but can be ∼ N 2 . All other root-to-leaf paths have fractional edge cover number 1. ⇒ The number of other values is ≤ N . ⇒ Factorization size is O ( N 2 ) s ( Q ) = 2 Recall that ρ ∗ ( Q ) = 3 ⇒ Listing representation size is O ( N 3 ) 51 / 99

Example: The Fractional Hypertree Width fhtw Idea: Avoid repeating identical expressions, store them once and use pointers. key ( A ) = ∅ A T A key ( B ) = { A } key ( E ) = { A } R B E B E F C D C D F U S key ( D ) = { A , B } key ( C ) = { A , B } key ( F ) = { E } � � � � � � �� a � × · · · × � e � × � f � a ∈ A e ∈ E f ∈ F Observation: Variable F only depends on E and not on A : key ( F ) = { E } A value � e � maps to the same union � ( e , f ) ∈ U � f � regardless of its pairings with A -values. ⇒ Define U e = � ( e , f ) ∈ U � f � once for each value � e � and reuse it 52 / 99

Example: The Fractional Hypertree Width fhtw Idea: Avoid repeating identical expressions, store them once and use pointers. key ( A ) = ∅ A T A key ( B ) = { A } key ( E ) = { A } R B E B E F C D C D F U S key ( D ) = { A , B } key ( C ) = { A , B } key ( F ) = { E } A factorization with caching would be: � � � � � � � �� a � × · · · × � e � × U e ; U e = � f � a ∈ A e ∈ E ( e , f ) ∈ U fhtw for ∆ is the largest ρ ∗ ( Q key ( X ) ∪{ X } ) over subqueries Q key ( X ) ∪{ X } defined by the variables key ( X ) ∪ { X } for each variable X in ∆ fhtw ( Q ) is the minimum fhtw over all variable orders of Q In our example: fhtw ( Q ) = 1 < s ( Q ) = 2 < ρ ∗ ( Q ) = 3. 53 / 99

Alternative Characterizations of fhtw The fractional hypertree width fhtw has been originally defined for hypertree decompositions. [M10] Given a join query Q . Let T be the set of hypertree decompositions of the hypergraph of Q . fhtw ( Q ) = min ( T ,χ ) ∈ T max n ∈ T ρ ∗ ( Q χ ( n ) ) 54 / 99

Alternative Characterizations of fhtw The fractional hypertree width fhtw has been originally defined for hypertree decompositions. [M10] Given a join query Q . Let T be the set of hypertree decompositions of the hypergraph of Q . fhtw ( Q ) = min ( T ,χ ) ∈ T max n ∈ T ρ ∗ ( Q χ ( n ) ) Alternative characterization of the fractional hypertree width fhtw using the mapping between hypertree decompositions and variable orders [OZ15] Given a join query Q . Let VO be the set of variable orders of Q . fhtw ( Q ) = min ( F , key ) ∈ VO max v ∈ F ρ ∗ ( Q key ( v ) ∪{ v } ) 54 / 99

Compression by Factorization in Practice 55 / 99

Compression Contest: Factorized vs. Zipped Relations 100 Compression ratio Tabular/Factorized Tabular/Gzip(Tabular) Factorized/Gzip(Factorized) 10 1 1 2 4 8 16 32 Database Scale Result of query Orders ✶ Dish ✶ Items [BKOZ13] Tabular = listing representation in CSV text format Gzip (compression level 6) outputs binary format Factorized representation in text format (each digit takes one character) Observations: Gzip does not exploit distant repetitions! Factorizations can be arbitrarily more succinct than gzipped relations. Gzipping factorizations improves the compression by 3x. 56 / 99

Factorization Gains in Practice (1/4) Retailer dataset used for LogicBlox analytics Relations: Inventory (84M), Sales (1.5M), Clearance (368K), Promotions (183K), Census (1K), Location (1K). Compression factors (caching not used): ◮ 26.61x for natural join of Inventory, Census, Location. ◮ 159.59x for natural join of Inventory, Sales, Clearance, Promotions 57 / 99

Factorization Gains in Practice (2/4) LastFM public dataset Relations: UserArtists (93K), UserFriends (25K), TaggedArtists (186K). Compression factors: ◮ 143.54x for joining two copies of Userartists and Userfriends With caching: 982.86x ◮ 253.34x when also joining on TaggedArtists ◮ 2.53x/ 3.04x/ 924.46x for triangle/4-clique/bowtie query on UserFriends ◮ 9213.51x/ 552Kx/ ≥ 86Mx for versions of triangle/4-clique/bowtie queries with copies for UserArtists for each UserFriend copy 58 / 99

Factorization Gains in Practice (3/4) Twitter public dataset Relation: Follower-Followee (1M) Compression factors: ◮ 2.69x for triangle query ◮ 3.48x for 4-clique query ◮ 4918.73x for bowtie query 59 / 99

Factorization Gains in Practice (4/4) Yelp Dataset Challenge Relations: Business (174K), User (1.3M), Review (5.2M), Category(667K), Attribute (1.3M) Compression factors: ◮ 39.43x for natural join of Business, User, Review, Attribute (with caching) ◮ 185.87x for natural join of Business, User, Review, Attribute, Category (with caching) 60 / 99

Outline of Part 1: Joins Introduction by Examples Decompositions and Variable Orders Size Bounds for Join Results Worst-Case Optimal Join Algorithms Further Work and References Quiz 61 / 99

How Fast Can We Compute Join Results? Given a join query Q , for any database of size N , the join result can be computed in time O ( N ρ ∗ ( Q ) ) as listing representation [NPRR12,V14] O ( N s ( Q ) ) as factorization without caching [OZ15] O ( N fhtw ( Q ) ) as factorization with caching [OZ15] These upper bounds essentially follow the succinctness gap. They are: worst-case optimal (modulo log N ) within the given representation model with respect to data complexity ◮ additional quadratic factor in the number of variables and linear factor in the number of relations in Q 62 / 99

Example: Computing the Factorized Join Result with FDB Our join: O(customer, day, dish), D(dish, item), I(item, price) can be grounded to a factorized representation as follows: Variable Order FDB execution plan � ∅ O ( , , dish ) , D ( dish , ) � dish � dish × { dish } { dish } � � O ( , day , dish ) � day � D ( dish , item ) � item � day item × × { dish , day } { item } � � O ( customer , day , dish ) � customer � I ( item , price ) � price � price customer 63 / 99

Example: Computing the Factorized Join Result with FDB � O ( , , dish ) , D ( dish , ) � dish � × O ( , day , dish ) � day � D ( dish , item ) � item � � � × × � O ( c , day , dish ) � c � � I ( item , p ) � p � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × O ( , day , burger ) � day � D ( burger , item ) � item � D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � � � � × × × × � O ( c , day , burger ) � c � � I ( item , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ D ( burger , item ) � item � D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � � � × × × × × � O ( c , friday , burger ) � c � � I ( item , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � O ( c , monday , burger ) � c � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ D ( burger , item ) � item � D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � � � × × × × × ∪ � O ( c , friday , burger ) � c � � I ( item , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ D ( burger , item ) � item � D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � � � × × × × × ∪ ∪ � I ( item , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � � × × × × × × × ∪ ∪ � I ( patty , p ) � p � � I ( onion , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � I ( bun , p ) � p � 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � � × × × × × × × ∪ ∪ ∪ � I ( onion , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � cache! � I ( bun , p ) � p � price depends on item , but not on dish . Cache prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � � × × × × × × × ∪ ∪ ∪ ∪ � I ( onion , p ) � p � � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � cache! cache! price depends on item , but not on dish . Cache prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ D ( hotdog , item ) � item � O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � � × × × × × × × ∪ ∪ ∪ ∪ ∪ � I ( item , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � cache! cache! cache! price depends on item , but not on dish . Cache prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ � I ( bun , p ) � p � � I ( sausage , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � cache! cache! cache! � I ( onion , p ) � p � price depends on item , but not on dish . Cache prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ � I ( sausage , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � cache! cache! cache! � I ( onion , p ) � p � price depends on item , but not on dish . Cache prices for specific items ! Reuse cached prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ � I ( sausage , p ) � p � � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � cache! cache! cache! price depends on item , but not on dish . Cache prices for specific items ! Reuse cached prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ O ( , day , hotdog ) � day � � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ ∪ � O ( c , day , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � � 4 � cache! cache! cache! price depends on item , but not on dish . Cache prices for specific items ! Reuse cached prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ ∪ � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � Friday � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ ∪ � O ( c , friday , hotdog ) � c � � Elise � � Elise � � 6 � � 2 � � 2 � � 4 � cache! cache! cache! price depends on item , but not on dish . Cache prices for specific items ! Reuse cached prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB ∪ � burger � � hotdog � × × ∪ ∪ ∪ ∪ � Monday � � Friday � � patty � � bun � � onion � � bun � � onion � � sausage � � Friday � × × × × × × × × × ∪ ∪ ∪ ∪ ∪ ∪ ∪ � Elise � � Elise � � 6 � � 2 � � 2 � � 4 � � Joe � � Steve � cache! cache! cache! price depends on item , but not on dish . Cache prices for specific items ! Reuse cached prices for specific items ! 64 / 99

Example: Computing the Factorized Join Result with FDB � O ( , , dish ) , D ( dish , ) � dish � × � � O ( , day , dish ) � day � D ( dish , item ) � item � × × � � O ( customer , day , dish ) � customer � I ( item , price ) � price � Relations are sorted following any topological order of the variable order The intersection of relations O and D on dish takes time O ( N min log( N max / N min )), where N m = m ( | π dish O | , | π dish D | ). The remaining operations are lookups in the relations, where we first fix the dish value and then the day and item values. 65 / 99

Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School University of Warsaw November 22, 2018 1 / 99 Acknowledgements Some work reported in this course has been done in the context of the FDB

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Overview DW Performance Optimization Choosing aggregates Maintaining views

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Managing a Healthy Menopause Transition Priorities for Consideration Cardiovascular (and brain)

Targeting your food product for health Dr Amanda Avery, PhD RD, Associate Professor in Nutrition

Large Biological Molecules Multiple Choice Review www.njctl.org Slide 3 / 43 1 Why is

Fitness and Nutrition RALEIGH RACERS SPECIAL OLYMPICS SWIM TEAM Grains Meat and Beans

KELLOGG COMPANY 2019 Q4 EARNINGS February 6, 2020 KELLOGG COMPANY | Q4 2019 EARNINGS | February

Deploy For Growth Forward-Looking Statements This presentation contains, or incorporates by

Total Protein Slides From Jessica Spreitzer, USMEF Economist USDA forecasting record production

Numerical Issues and Influences in the Design of Algebraic Modeling Languages for Optimization

Joins Aggregates Optimization https://fdbresearch.github.io Dan - PowerPoint PPT Presentation

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School University of Warsaw November 22, 2018 1 / 99 Acknowledgements Some work reported in this course has been done in the context of the FDB

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&amp;7,&amp;2015 What%are%joins?

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Joins Aggregates Optimization https://fdbresearch.github.io Dan Olteanu PhD Open School

Overview DW Performance Optimization Choosing aggregates Maintaining views

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Managing a Healthy Menopause Transition Priorities for Consideration Cardiovascular (and brain)

Targeting your food product for health Dr Amanda Avery, PhD RD, Associate Professor in Nutrition

Large Biological Molecules Multiple Choice Review www.njctl.org Slide 3 / 43 1 Why is

Fitness and Nutrition RALEIGH RACERS SPECIAL OLYMPICS SWIM TEAM Grains Meat and Beans

KELLOGG COMPANY 2019 Q4 EARNINGS February 6, 2020 KELLOGG COMPANY | Q4 2019 EARNINGS | February

Deploy For Growth Forward-Looking Statements This presentation contains, or incorporates by

Total Protein Slides From Jessica Spreitzer, USMEF Economist USDA forecasting record production

Numerical Issues and Influences in the Design of Algebraic Modeling Languages for Optimization

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?