datalogra datalog with recursive aggregation in the spark
play

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD - PowerPoint PPT Presentation

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June,


  1. DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June, 2016 1 / 28

  2. Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 2 / 28

  3. Introduction Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 3 / 28

  4. Introduction Motivation Need for high-level declarative languages for Graph Processing Datalog seems an interesting starting point: Well-understood semantics Very parallellizable [Ganguly et al. 1990] [Zhang et al. 1995] . Large body of research on optimization [Tekle et al. 2010] Limited recursion matches graph navigation Becomes more interesting when extended with basic arithmetic and stratified aggregation [Mumick et al. 1990] [Shkapsky et al. 2013] Counting triangles And even better with recursive aggregation [Lam et al. 2013] ( Socialite ) Shortest Path , PageRank Jan Hidders (VUB) GRADES 2016 24 June, 2016 4 / 28

  5. Introduction Contribution of Paper Implementation in Spark: Leverages optimizations in Spark (but not yet Spark SQL ) Embedding in mature framework DatalogRA program can be part of bigger Spark workflow Semantics: Explicit and more general semantics then Socialite Some investigation of well-definedness of result Jan Hidders (VUB) GRADES 2016 24 June, 2016 5 / 28

  6. Plain Datalog and its Evaluation Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 6 / 28

  7. Plain Datalog and its Evaluation Syntax of Plain Datalog A database is a finite set of facts of the form r ( v 1 , . . . , v n ) where r is a relation name and ( v 1 , . . . , v n ) a vector of domain values . E.g., { a (1 , 2) , a (2 , 3) , b (3 , 1) } We will assume all domains are finite. A basic Datalog program consist of a set of rules where a rule is an expression of the form: r (¯ x ) :- s 1 (¯ y 1 ) , . . . , s n (¯ y n ) . where n ≥ 1, r , s 1 , . . . , s n are relation names and ¯ x , ¯ y 1 , . . . ¯ y n are tuples of variables and constants (i.e., domain values). Head: r (¯ x ) Body: s 1 ( x 1 ) , . . . , s n ( x n ), which is a set of subgoals Operational semantics in terms of a minimal/first fixed point of a function that applies all rules to infer facts. Jan Hidders (VUB) GRADES 2016 24 June, 2016 7 / 28

  8. Plain Datalog and its Evaluation Semi-naive Evaluation Basic idea: compute inferred facts based on newly added atoms in previous interation For example: a rule r ( x , y ) :- s ( x , y , z ) , r ( z , 2) , r ( y , z ) assume r ′ contains the tuples added in the previous step the tuples added by this rule in the next step are the union of { ( x , y ) | s ( x , y , z ) ∧ r ′ ( z , 2) , r ( y , z ) } and { ( x , y ) | s ( x , y , z ) ∧ r ( z , 2) ∧ r ′ ( y , z ) } after this we compute the next r ′ by subtracting existing tuples Prevents a lot of redundant computation, but same tuple may still be derived more than once Jan Hidders (VUB) GRADES 2016 24 June, 2016 8 / 28

  9. DatalogRA: Syntax and Semantics Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 9 / 28

  10. DatalogRA: Syntax and Semantics Basic Idea of DatalogRA Based on ideas in Socialite [Lam et al. 2013] Allows recursive aggregation, under certain conditions i.e., optionally an aggregation function can be specified for the last column of a relation Example: (compute length of shortest path from node 1 ) Edge (int src , int sink , int len ) Path (int target , int dist aggregate Min ) Path ( t , d ) :- t = 1 , d = 0 . Path ( t , d ) :- Path ( s , d 1 ) , Edge ( s , t , d 2 ) , d = d 1 + d 2 . Can be generalized to allow aggregation on multiple columns We also allow basic arithmetic predicates and stratified negation Jan Hidders (VUB) GRADES 2016 24 June, 2016 10 / 28

  11. DatalogRA: Syntax and Semantics Semantics of DatalogRA Operational semantics The semantics of DatalogRA program P (without negation) is the first fixed point of immediate conseq. operator Γ P ◦ ˆ T P ˆ T P computes the bag of direct consequences of P Γ P is a function that aggregates as specified in P Jan Hidders (VUB) GRADES 2016 24 June, 2016 11 / 28

  12. DatalogRA: Syntax and Semantics Semantics of DatalogRA The bag of direct consequences ˆ T P computes the bag of direct consequences of P : The result bag of a rule r for database D , ˆ r ( D ), is a bag over r ( D ) such that the multiplicity of each fact r (¯ c ) in this bag is the number of valuations of the variables in the tail that cause its inference The bag of direct consequences of P for D , is ˆ � T P ( D ) = D ⊎ ˆ r ( D ) r ∈ P where ⊎ is the additive bag union. Jan Hidders (VUB) GRADES 2016 24 June, 2016 12 / 28

  13. DatalogRA: Syntax and Semantics Semantics of DatalogRA The global aggregation function Γ P is a function that aggregates as specified in P : If relation R is aggregated in P with G : for each vector ¯ x s.t. there is a fact of the form R (¯ x , y ) in the input: x , G ( ¯ Y )) where ¯ replace these facts with R (¯ Y is the bag of domain values where the multiplicity of an element y is the multiplicity of R (¯ x , y ) in the input. If relation R is not aggregated in P : remove duplicate facts for this relation Note: the result of Γ P is in both cases without duplicates Jan Hidders (VUB) GRADES 2016 24 June, 2016 13 / 28

  14. DatalogRA: Syntax and Semantics Semantics of DatalogRA Well-definedness So the semantics of P ( D ) is the first fixed point of Γ P ◦ ˆ T P on D Questions: When is this defined? Is result a minimal fixed point in some sense? Sufficient condition: for some partial ordering over databases Γ P ◦ ˆ T P is monotonic Subset ordering is too strict when aggregation is used. Jan Hidders (VUB) GRADES 2016 24 June, 2016 14 / 28

  15. DatalogRA: Syntax and Semantics Semantics of DatalogRA Aggregation-dependent partial order Assume G is based on a binary operator, say ⊕ G , that is commutative and associative: G applied to non-empty bag { { a 1 , . . . , a n } } is a 1 ⊕ G . . . ⊕ G a n Implies sometimes a partial order: a ⊑ G b iff a = b or there is a c such that a ⊕ G c = b . E.g., for Max operator that ordering is ≤ for Min it is ≥ for Sum over nonnegative integers it is also ≤ for Sum over all integers it is not a partial order We consider only those G where ⊑ G is a partial order Jan Hidders (VUB) GRADES 2016 24 June, 2016 15 / 28

  16. DatalogRA: Syntax and Semantics Semantics of DatalogRA Aggregation-based database ordering Assume ⊑ G is a partial order for all G in a program P We let ⊑ P define a partial order over facts : if relation R has aggregation operator G in P then 1 x ′ and y ⊑ G y ′ and R (¯ x , y ) ⊑ P R (¯ x ′ , y ′ ) iff ¯ x = ¯ if R has no aggregation operator in P then R (¯ x ) ⊑ P R (¯ x ′ ) iff ¯ x = ¯ x ′ . 2 We let ⊑ P also define a partial order over databases : D 1 ⊑ P D 2 holds iff for all R (¯ x ) ∈ D 1 there is a fact R (¯ x ′ ) ∈ D 2 such 1 that R (¯ x ) ⊑ P R (¯ x ′ ) If P is monotonic w.r.t. to ⊑ P , i.e., Γ P ◦ ˆ T P is monotonic under ⊑ P , then P always computes a minimal fixed point. Jan Hidders (VUB) GRADES 2016 24 June, 2016 16 / 28

  17. DatalogRA: Syntax and Semantics Semantics of DatalogRA A sufficient condition for monotonicity Also assume all G are all idempotent, i.e., a ⊕ G a = a e.g., for Min and Max Then multiplicity in the bags is ignored by Γ P , so Γ P ◦ ˆ T P = Γ P ◦ T P , where T P is the classical Datalog inference function Since Γ P is always monotonic under ⊑ P , it is sufficient to require that T P is monotonic under ⊑ P . Complexity of deciding this property is still unclear Under such monotonicity we essentially can do semi-naive evaluation: “New facts” are those not subsumed (under ⊑ P ) by an existing fact 1 Infer additional results in T P for these facts as usual 2 Add these results and apply Γ P 3 Jan Hidders (VUB) GRADES 2016 24 June, 2016 17 / 28

  18. Implementation in Spark Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 18 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend