DatalogRA : Datalog with Recursive Aggregation in the Spark RDD - PowerPoint PPT Presentation

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June, 2016 1 / 28

Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 2 / 28

Introduction Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 3 / 28

Introduction Motivation Need for high-level declarative languages for Graph Processing Datalog seems an interesting starting point: Well-understood semantics Very parallellizable [Ganguly et al. 1990] [Zhang et al. 1995] . Large body of research on optimization [Tekle et al. 2010] Limited recursion matches graph navigation Becomes more interesting when extended with basic arithmetic and stratified aggregation [Mumick et al. 1990] [Shkapsky et al. 2013] Counting triangles And even better with recursive aggregation [Lam et al. 2013] ( Socialite ) Shortest Path , PageRank Jan Hidders (VUB) GRADES 2016 24 June, 2016 4 / 28

Introduction Contribution of Paper Implementation in Spark: Leverages optimizations in Spark (but not yet Spark SQL ) Embedding in mature framework DatalogRA program can be part of bigger Spark workflow Semantics: Explicit and more general semantics then Socialite Some investigation of well-definedness of result Jan Hidders (VUB) GRADES 2016 24 June, 2016 5 / 28

Plain Datalog and its Evaluation Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 6 / 28

Plain Datalog and its Evaluation Syntax of Plain Datalog A database is a finite set of facts of the form r ( v 1 , . . . , v n ) where r is a relation name and ( v 1 , . . . , v n ) a vector of domain values . E.g., { a (1 , 2) , a (2 , 3) , b (3 , 1) } We will assume all domains are finite. A basic Datalog program consist of a set of rules where a rule is an expression of the form: r (¯ x ) :- s 1 (¯ y 1 ) , . . . , s n (¯ y n ) . where n ≥ 1, r , s 1 , . . . , s n are relation names and ¯ x , ¯ y 1 , . . . ¯ y n are tuples of variables and constants (i.e., domain values). Head: r (¯ x ) Body: s 1 ( x 1 ) , . . . , s n ( x n ), which is a set of subgoals Operational semantics in terms of a minimal/first fixed point of a function that applies all rules to infer facts. Jan Hidders (VUB) GRADES 2016 24 June, 2016 7 / 28

Plain Datalog and its Evaluation Semi-naive Evaluation Basic idea: compute inferred facts based on newly added atoms in previous interation For example: a rule r ( x , y ) :- s ( x , y , z ) , r ( z , 2) , r ( y , z ) assume r ′ contains the tuples added in the previous step the tuples added by this rule in the next step are the union of { ( x , y ) | s ( x , y , z ) ∧ r ′ ( z , 2) , r ( y , z ) } and { ( x , y ) | s ( x , y , z ) ∧ r ( z , 2) ∧ r ′ ( y , z ) } after this we compute the next r ′ by subtracting existing tuples Prevents a lot of redundant computation, but same tuple may still be derived more than once Jan Hidders (VUB) GRADES 2016 24 June, 2016 8 / 28

DatalogRA: Syntax and Semantics Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 9 / 28

DatalogRA: Syntax and Semantics Basic Idea of DatalogRA Based on ideas in Socialite [Lam et al. 2013] Allows recursive aggregation, under certain conditions i.e., optionally an aggregation function can be specified for the last column of a relation Example: (compute length of shortest path from node 1 ) Edge (int src , int sink , int len ) Path (int target , int dist aggregate Min ) Path ( t , d ) :- t = 1 , d = 0 . Path ( t , d ) :- Path ( s , d 1 ) , Edge ( s , t , d 2 ) , d = d 1 + d 2 . Can be generalized to allow aggregation on multiple columns We also allow basic arithmetic predicates and stratified negation Jan Hidders (VUB) GRADES 2016 24 June, 2016 10 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA Operational semantics The semantics of DatalogRA program P (without negation) is the first fixed point of immediate conseq. operator Γ P ◦ ˆ T P ˆ T P computes the bag of direct consequences of P Γ P is a function that aggregates as specified in P Jan Hidders (VUB) GRADES 2016 24 June, 2016 11 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA The bag of direct consequences ˆ T P computes the bag of direct consequences of P : The result bag of a rule r for database D , ˆ r ( D ), is a bag over r ( D ) such that the multiplicity of each fact r (¯ c ) in this bag is the number of valuations of the variables in the tail that cause its inference The bag of direct consequences of P for D , is ˆ � T P ( D ) = D ⊎ ˆ r ( D ) r ∈ P where ⊎ is the additive bag union. Jan Hidders (VUB) GRADES 2016 24 June, 2016 12 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA The global aggregation function Γ P is a function that aggregates as specified in P : If relation R is aggregated in P with G : for each vector ¯ x s.t. there is a fact of the form R (¯ x , y ) in the input: x , G ( ¯ Y )) where ¯ replace these facts with R (¯ Y is the bag of domain values where the multiplicity of an element y is the multiplicity of R (¯ x , y ) in the input. If relation R is not aggregated in P : remove duplicate facts for this relation Note: the result of Γ P is in both cases without duplicates Jan Hidders (VUB) GRADES 2016 24 June, 2016 13 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA Well-definedness So the semantics of P ( D ) is the first fixed point of Γ P ◦ ˆ T P on D Questions: When is this defined? Is result a minimal fixed point in some sense? Sufficient condition: for some partial ordering over databases Γ P ◦ ˆ T P is monotonic Subset ordering is too strict when aggregation is used. Jan Hidders (VUB) GRADES 2016 24 June, 2016 14 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA Aggregation-dependent partial order Assume G is based on a binary operator, say ⊕ G , that is commutative and associative: G applied to non-empty bag { { a 1 , . . . , a n } } is a 1 ⊕ G . . . ⊕ G a n Implies sometimes a partial order: a ⊑ G b iff a = b or there is a c such that a ⊕ G c = b . E.g., for Max operator that ordering is ≤ for Min it is ≥ for Sum over nonnegative integers it is also ≤ for Sum over all integers it is not a partial order We consider only those G where ⊑ G is a partial order Jan Hidders (VUB) GRADES 2016 24 June, 2016 15 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA Aggregation-based database ordering Assume ⊑ G is a partial order for all G in a program P We let ⊑ P define a partial order over facts : if relation R has aggregation operator G in P then 1 x ′ and y ⊑ G y ′ and R (¯ x , y ) ⊑ P R (¯ x ′ , y ′ ) iff ¯ x = ¯ if R has no aggregation operator in P then R (¯ x ) ⊑ P R (¯ x ′ ) iff ¯ x = ¯ x ′ . 2 We let ⊑ P also define a partial order over databases : D 1 ⊑ P D 2 holds iff for all R (¯ x ) ∈ D 1 there is a fact R (¯ x ′ ) ∈ D 2 such 1 that R (¯ x ) ⊑ P R (¯ x ′ ) If P is monotonic w.r.t. to ⊑ P , i.e., Γ P ◦ ˆ T P is monotonic under ⊑ P , then P always computes a minimal fixed point. Jan Hidders (VUB) GRADES 2016 24 June, 2016 16 / 28

DatalogRA: Syntax and Semantics Semantics of DatalogRA A sufficient condition for monotonicity Also assume all G are all idempotent, i.e., a ⊕ G a = a e.g., for Min and Max Then multiplicity in the bags is ignored by Γ P , so Γ P ◦ ˆ T P = Γ P ◦ T P , where T P is the classical Datalog inference function Since Γ P is always monotonic under ⊑ P , it is sufficient to require that T P is monotonic under ⊑ P . Complexity of deciding this property is still unclear Under such monotonicity we essentially can do semi-naive evaluation: “New facts” are those not subsumed (under ⊑ P ) by an existing fact 1 Infer additional results in T P for these facts as usual 2 Add these results and apply Γ P 3 Jan Hidders (VUB) GRADES 2016 24 June, 2016 17 / 28

Implementation in Spark Outline Introduction 1 Plain Datalog and its Evaluation 2 DatalogRA: Syntax and Semantics 3 Implementation in Spark 4 Experiments and Evaluation 5 Conclusions and Future Work 6 Jan Hidders (VUB) GRADES 2016 24 June, 2016 18 / 28

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD - PowerPoint PPT Presentation

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June,

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Order in Datalog with Applications to Declarative Output Stefan Brass University of Halle,

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Static Analysis in Datalog Gang Tan CSE 597 Spring 2019 Penn State University 1 DATALOG INTRO

A Retrospective on Datalog 1.0 Phokion G. Kolaitis UC Santa Cruz and IBM Research - Almaden

Datalog Datalog A nonprocedural language based on Prolog Describe what instead of how:

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

RECURSIVE DEEP MODELS FOR SEMANTIC 1 COMPOSITIONALITY Zhicong Lu DGP Lab

Theory of Computer Science D4. Primitive Recursion and -Recursion Malte Helmert University of

H0K03a : Advanced Process Control Model-based Predictive Control 4 : Robustness Bert Pluymers Prof.

Towards Making Theory of Describing a For-Loop . . . Computation Course More Resulting

The dual tree of a recursive triangulation of the disk Henning Sulzbach, INRIA Paris-Rocquencourt

A list of open problems submitted during the conference Model Theory and Proof Theory of

Assessing the cyclical implications of IFRS9: A recursive model Jorge Abad CEMFI Javier Suarez

Windows 3.1? So why stick with SQL-92? @ModernSQL - https://modern-sql.com/ @MarkusWinand

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD - PowerPoint PPT Presentation

DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June,

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Order in Datalog with Applications to Declarative Output Stefan Brass University of Halle,

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Static Analysis in Datalog Gang Tan CSE 597 Spring 2019 Penn State University 1 DATALOG INTRO

A Retrospective on Datalog 1.0 Phokion G. Kolaitis UC Santa Cruz and IBM Research - Almaden

Datalog Datalog A nonprocedural language based on Prolog Describe what instead of how:

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

RECURSIVE DEEP MODELS FOR SEMANTIC 1 COMPOSITIONALITY Zhicong Lu DGP Lab

Theory of Computer Science D4. Primitive Recursion and -Recursion Malte Helmert University of

H0K03a : Advanced Process Control Model-based Predictive Control 4 : Robustness Bert Pluymers Prof.

Towards Making Theory of Describing a For-Loop . . . Computation Course More Resulting

The dual tree of a recursive triangulation of the disk Henning Sulzbach, INRIA Paris-Rocquencourt

A list of open problems submitted during the conference Model Theory and Proof Theory of

Assessing the cyclical implications of IFRS9: A recursive model Jorge Abad CEMFI Javier Suarez

Windows 3.1? So why stick with SQL-92? @ModernSQL - https://modern-sql.com/ @MarkusWinand

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark